BDCamões Corpus - Collection of Portuguese Literary Documents from the Digital Library of Camões I.P., is a collection of literary documents written in Portuguese, in plain text .txt format, with close to 4 million words from over 200 complete documents from 83 authors in 14 genres, covering a ti...
News articles collected from Portuguese newspapers.
Multilingual corpora with coreferential annotation of person entities ===================================================================== In-progress corpora with coreferent annotation of person entities. Sources: journals and Wikipedia. Languages: * Portuguese: varieties from Portugal, Brazi...
Corpus with the transcriptions of syllogistic reasoning protocols. Written transcriptions: Verbal data (30 hours) elicited during an experiment on syllogistic reasoning (each of 27 participants x the 64 syllogistic problems): Thinking aloud task; reflexive conversation Performance data: La...
CINTIL-DeepBank (Branco et al., 2010) is a corpus of Portuguese texts annotated with deep grammatical information. This document refers to version 1.4 of the corpus, from January 2016, which adds over 15,400 annotated sentences to the previous version from September 2015. The current version i...
CINTIL-USuite is a corpus of Portuguese that is annotated with lemmas, the Universal Part-of-Speech tagset (UPOS) and Universal feature bundles, related to the Universal Dependency framework, and that contains around 1 million annotated tokens. It is described in this article: António Branc...
The CORDIAL-SIN–TreeBank is a collection of 177596 syntactic parse trees of the Syntax-oriented Corpus of Portuguese Dialects. CORDIAL-SIN is a corpus of spoken dialectal European Portuguese, developed at Centro de Linguística da Universidade de Lisboa, that compiles excerpts of spontaneous and s...
This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Europe Facility - Automated Translation (CEF.AT) action. For further information on the project: http://lr-coordination.eu. Polish-English parallel corpus from the website of the I...
A burst-annotated co-occurrence network about the Arab Spring topic built on the top of New York Times article snapshots from the years 2010-2013.
This is a data set of Portuguese tweets labeled with the emotion conveyed in the tweet. It was gathered using a methodology similar to the one used for building the Affect in Tweets data set used in the SemEval-2018 Task 1. The data set contains 11219 tweets, each labeled with an emotion (anger,...