CINTIL-UPos is a corpus of Portuguese that is annotated with the Universal Part-of-Speech tagset (UPOS), related to the Universal Dependency framework, and that contains around 1 million annotated tokens. It is described in this article: António Branco, João Ricardo Silva, Luís Gomes and Jo...
The PAROLE Portuguese Corpus – tagged subset contains 250.000 tokens and is a subset of the PAROLE Portuguese Corpus of 3 million running words of European Portuguese. The corpus was classified and encoded according to the common core parole encoding standard. The tagged subset reproduces appro...
This resource includes a spoken Portuguese corpus - with aligned sound and orthographic transcription -, collected among sociolinguistically diverse speakers. It consists of recordings from informal conversations.
The Ontology for the area of Nanoscience and Nanotechnology (Ontologia para a área de Nanociência e Nanotecnologia) is constituted by 511 terms of this field of knowledge. It was extracted from a corpus collected from the Web, with a total of 2.570.792 words
The LX-Rare Word Similarity Data set was created from Stanford Rare Word (RW) Similarity data set (Luong et al., 2013). This list contains 2 034 words (1 017 pairs of words). All the words were extracted from Wikipedia and from WordNet (Miller, 1995), a lexical database where the concepts are gro...
The LX-ESSLLI 2008 data set was created from the ESSLLI 2008 Distributional Semantic Workshop shared-task set, made of 44 concrete nouns grouped in 6 semantic categories (4 animate and 2 inanimate). The grouping is done in an hierarchical way following the top 10 properties from the McRae (2005) ...
LX-Abbreviations resource is a collection of abbreviations of different types from European Portuguese composed by 208 words. Each type of abbreviation is manually divided and annotated with grammatical categories, gender and number, and, finally, with the respective abbreviations.
«The Memórias Paroquiais (Parish Memories) are an essential source for obtaining a radiography of Portugal in 1758-1761. They correspond to a survey, organized in 3 major parts (the locality itself, the mountain and the river), which was printed and sent to those responsible for the dioceses of t...
LX-UTagger is a POS tagger for Portuguese that adopts the Universal Part-of-Speech tagset (UPOS), related to the Universal Dependency framework, with an initial performance of 99.06% under a ten-fold cross validation scheme. It is described in this article: António Branco, João Ricardo Silv...
A Portuguese as a non-native language learners' corpus of written texts with three independent subcorpora: - Portuguese as a Foreign Language: Subcorpus Português Língua Estrangeira (PEAPL2_PLE) http://teitok2.iltec.pt/peapl2-ple/index.php?action=home - East Timorese Portuguese: Subcorpus T...