ExtraGLUE is a Portuguese dataset obtained by the automatic translation of some of the tasks in the GLUE and SuperGLUE benchmarks. Two variants of Portuguese are considered, namely European Portuguese and American Portuguese. The 14 tasks in extraGLUE cover different aspects of language unders...
ExtraGLUE-instruct is a data set with examples from tasks, with instructions and with prompts that integrate instructions and examples, for both the European variant of Portuguese, spoken in Portugal, and the American variant of Portuguese, spoken in Brazil. For each variant, it contains over 170...
This is a corpus for multi-document summarization for European Portuguese. It contains 80 topics, each of which has 10 documents, for a total of 800 documents. Each topic contains two human summaries. The summaries are compressive: they are the result of a compression of the sentences in the orig...
Porttinari-base (Duran et al., 2023) is the journalistic portion of Porttinari (which stands for “PORTuguese Treebank”), which shall be a large multigenre treebank for Portuguese (Pardo et al., 2021), following the "Universal Dependencies" international grammar framework (de Marneffe et al., 2021...
News articles collected from Portuguese newspapers.
The CRPC Discourse Bank is labeled for discourse relations (also referred to as rhetorical relations or coher- ence relations), such as cause and condition, that hold between two spans of text and contribute to ensure the overall cohesion and coherence of the text. The scheme follows the principl...
The DepBankPT (Branco et al., 2011a) is a corpus of grammatical dependencies of the translated news composed of 3,406 sentences and 44,598 tokens taken from the Wall Street Journal. The DepBankPT is aligned to a constituency bank, the TreeBankPT (see Branco et al., 2011b). The key bridging eleme...
CINTIL-DeepBank (Branco et al., 2010) is a corpus of Portuguese texts annotated with deep grammatical information. This document refers to version 1.4 of the corpus, from January 2016, which adds over 15,400 annotated sentences to the previous version from September 2015. The current version i...