The LX-SimLex-999 was created from SimLex-999 (Hill et al., 2015) which, in turn, was based in the University of South Florida Free Association Database (USF) (Nelson et al., 2014). There were strict guidelines to create SimLex-999. Both words in each pair have the same morphosyntactic category ...
The test set described in was used as the basis for the assessment of word embeddings. An example entry in this data set would read: ‘Berlin Germany Lisbon Portugal’. With these four words relations – as in this example – one can test semantic analogies by using any of the possible combinations o...
The LX-Battig was created from Battig test.set (Baroni et al., 2010). This data set has 83 concrete concepts of the following 10 categories: mammals, birds, fish, vegetables, fruit, trees, vehicles, clothes, tools and kitchenware. The categories names and the concepts were translated by two trans...
BDCamões Corpus - Collection of Portuguese Literary Documents from the Digital Library of Camões I.P., is a collection of literary documents written in Portuguese, in plain text .txt format, with close to 4 million words from over 200 complete documents from 83 authors in 14 genres, covering a ti...
The corpus was developed as a linguistic resource for Automatic Summarization research and his relation with different issues to engage studies on the discourse treatment. Summ-it consists of fifty texts from Science domain extracted from Science section of Brazilian daily newspaper Folha de Sã...
The LX-WordSim-353 was created from WordSim-353 (Agirre et al., 2009). As the name suggests, this data set contains 353 pairs of words. Both words in each pair can have different morphosyntactic categories. The data set is made of nouns, adjectives, verbs and named entities, and has no multiwords...
«The Memórias Paroquiais (Parish Memories) are an essential source for obtaining a radiography of Portugal in 1758-1761. They correspond to a survey, organized in 3 major parts (the locality itself, the mountain and the river), which was printed and sent to those responsible for the dioceses of t...
A Portuguese as a non-native language learners' corpus of written texts with three independent subcorpora: - Portuguese as a Foreign Language: Subcorpus Português Língua Estrangeira (PEAPL2_PLE) http://teitok2.iltec.pt/peapl2-ple/index.php?action=home - East Timorese Portuguese: Subcorpus T...