CINTIL-USuite is a corpus of Portuguese that is annotated with lemmas, the Universal Part-of-Speech tagset (UPOS) and Universal feature bundles, related to the Universal Dependency framework, and that contains around 1 million annotated tokens. It is described in this article: António Branc...
This corpus is a collection of different governmental resources, containing two types of documents: minutes, which were taken during local council meetings (covering the years from 2007 till 2010) and memorandums (covering from 2008 till 2011). This corpus, consisting of raw text files and comma...
Multilingual corpora with coreferential annotation of person entities ===================================================================== In-progress corpora with coreferent annotation of person entities. Sources: journals and Wikipedia. Languages: * Portuguese: varieties from Portugal, Brazi...
Audio corpus: 8 subfolders with .wav files Each containing : • 2 sound files containing a read story (“The sun and the wind”, each by speaker A and speaker B) • 2 sound files containing each 30 read sentences (each by speaker A and speaker B) • 2 x each of the 30 sentences as a single sound f...
108 WAV files of spoken Maltese newspaper texts, subdivided into 12 directories with a variable number of sentences (sometimes: clauses) each. They come together with transcriptions and tables of phoneme durations.
Tweets annotated with geographic coordinates
Text corpus for bilingual concordancing, single- and multi-word translation extraction, machine translation. Languages: cs-pt, de-pt, en-pt, es-pt, fr-pt, it-pt, and pt-sk. Size: 1 G per language (phrases aligned). Domain: Law and Health.
This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Europe Facility - Automated Translation (CEF.AT) action. For further information on the project: http://lr-coordination.eu. Parallel texts in pdf file format. Original in Swedish, ...
This research proposes a corpus of popular Brazilian Portuguese, called CorPop, with texts selected based on the average level of literacy of the country's readers. CorPop’s theoretical and methodological bases are interdisciplinary and fall within the scope of Language Studies and related discip...
Collection of dialogues extracted from subreddits related to Information Technology (IT) and extracted with RDET (Reddit Dataset Extraction Tool). It is composed of 61,842,638 tokens in 179,358 dialogues.