This research proposes a corpus of popular Brazilian Portuguese, called CorPop, with texts selected based on the average level of literacy of the country's readers. CorPop’s theoretical and methodological bases are interdisciplinary and fall within the scope of Language Studies and related discip...
Datasets is arff format (for Weka machine learning software) are made available to reproduce the validation experiments presented in the paper.
BDCamões Corpus - Collection of Portuguese Literary Documents from the Digital Library of Camões I.P., is a collection of literary documents written in Portuguese, in plain text .txt format, with close to 4 million words from over 200 complete documents from 83 authors in 14 genres, covering a ti...
QTLeap WSD/NED corpus This corpora is part of Deliverable 5.5 of the European Commission project QTLeap FP7-ICT-2013.4.1-610516 (http://qtleap.eu). The texts are Q&A interactions from the real-user scenario (batches 1 and 2). The interactions in this corpus are available in Basque, Bulgar...
The Complex Word (CW) Corpus contains 731 sentences each with one annotated CW. These simplifications were mined from Simple Wikipedia edit histories. Each entry gives an example of a sentence requiring simplification by means of a single lexical edit. This resource is primarily designed for t...
This corpus is created from documents from translation memorios of Elhuyar Fundation (obtained via Eleka, member of the Advisory Board of Potential Users).
Perfil Sociolinguístico da Fala Bracarense - POS is a manually verified part-of-speech annotation of the EXMARaLDA transcriptions in "Perfil Sociolinguístico da Fala Bracarense", a Portuguese speech corpus with 90 hours of recorded spontaneous speech, aligned with its transcription in EXMARaLDA f...
The corpus contains the Laws of Malta in Maltese from the official government website. The unannotated raw text files were extracted from the pdf files that can be found on the website.
CIPM-POS is a set of historical, religious, notarial, literary texts in prose and verse, written is medieval portuguese. It contains around 88000 words.
Dundee GCG-Bank contains hand-corrected deep syntactic annotations for the Dundee eye-tracking corpus (Kennedy et al., 2003). The annotations are designed to support psycholinguistic investigation into the structural determinants of sentence processing effort. Dundee GCG-Bank is distributed as a ...