The News corpus developed by LIACC in JSON format was complemented with POS and keyword topics annotation. POS-tagging =========== The POS-tagging used the tagger described in Généreux et al. (2012) The title and text body were extracted, tokenized and pos-tagged. Two new fields were added...
This research proposes a corpus of popular Brazilian Portuguese, called CorPop, with texts selected based on the average level of literacy of the country's readers. CorPop’s theoretical and methodological bases are interdisciplinary and fall within the scope of Language Studies and related discip...
This corpus was run through BiRoamer https://github.com/bitextor/biroamer to anonymise the Portuguese-English parallel data from release 7 of the ParaCrawl project, specifically "Broader Web-Scale Provision of Parallel Corpora for European Languages". This version is filtered with BiCleaner with ...
The texts are sentences from the Europarl parallel corpus (Koehn, 2005). The textscontain the monolingual sentences from parallel corpora for the following pairs: Bulgarian-English, Czech-English, Portuguese-English and Spanish- English. The English corpus is comprised by the English side of th...
The LX-Rare Word Similarity Data set was created from Stanford Rare Word (RW) Similarity data set (Luong et al., 2013). This list contains 2 034 words (1 017 pairs of words). All the words were extracted from Wikipedia and from WordNet (Miller, 1995), a lexical database where the concepts are gro...
Multilingual corpora with coreferential annotation of person entities ===================================================================== In-progress corpora with coreferent annotation of person entities. Sources: journals and Wikipedia. Languages: * Portuguese: varieties from Portugal, Brazi...
The Portuguese Parliamentary Corpus is part of the Mutlilingual ParlaMint Corpus, a set of comparable corpora containing transcriptions of parliamentary debates of 29 European countries and autonomous regions. The Portuguese corpus (ParlaMint-PT) comprehends transcripts of sessions in the time pe...
277780 sentence pairs (in 23 EN-X language pairs in total) extracted from the Publications Office of the EU on the medical domain. These are sourced from laws, studies, EC announcements, etc. labelled with concepts like epidemiology, epidemic, disease surveillance, health control, public hygiene,...
The CINTIL-WordSenses corpus, built upon the CINTIL International Corpus of Portuguese (Barreto et al., 2006), is composed of 23,825 sentences of written Portuguese with open-class terms manually disambiguated and annotated with synset identifiers from the Portuguese MultiWordNet (MWNPT) (Pianti ...
«The Memórias Paroquiais (Parish Memories) are an essential source for obtaining a radiography of Portugal in 1758-1761. They correspond to a survey, organized in 3 major parts (the locality itself, the mountain and the river), which was printed and sent to those responsible for the dioceses of t...