The News corpus developed by LIACC in JSON format was complemented with POS and keyword topics annotation. POS-tagging =========== The POS-tagging used the tagger described in Généreux et al. (2012) The title and text body were extracted, tokenized and pos-tagged. Two new fields were added...
CINTIL-USuite is a corpus of Portuguese that is annotated with lemmas, the Universal Part-of-Speech tagset (UPOS) and Universal feature bundles, related to the Universal Dependency framework, and that contains around 1 million annotated tokens. It is described in this article: António Branc...
This is a corpus for multi-document summarization for European Portuguese. It contains 80 topics, each of which has 10 documents, for a total of 800 documents. Each topic contains two human summaries. The summaries are compressive: they are the result of a compression of the sentences in the orig...
The CRPC Discourse Bank is labeled for discourse relations (also referred to as rhetorical relations or coher- ence relations), such as cause and condition, that hold between two spans of text and contribute to ensure the overall cohesion and coherence of the text. The scheme follows the principl...
This is a data set of Portuguese tweets labeled with the emotion conveyed in the tweet. It was gathered using a methodology similar to the one used for building the Affect in Tweets data set used in the SemEval-2018 Task 1. The data set contains 11219 tweets, each labeled with an emotion (anger,...
Bilingual (EN-PT) corpus acquired from the website (https://www.europarl.europa.eu/) of the European Parliament (9th May 2020)
The CORDIAL-SIN–TreeBank is a collection of 177596 syntactic parse trees of the Syntax-oriented Corpus of Portuguese Dialects. CORDIAL-SIN is a corpus of spoken dialectal European Portuguese, developed at Centro de Linguística da Universidade de Lisboa, that compiles excerpts of spontaneous and s...