The test set described in was used as the basis for the assessment of word embeddings. An example entry in this data set would read: ‘Berlin Germany Lisbon Portugal’. With these four words relations – as in this example – one can test semantic analogies by using any of the possible combinations o...
The LT Corpus (Literary Corpus) contains approximately 1,781,083 running words of European and Brazilian Portuguese. It includes 70 copyright-free classics (61 Portugal and 9 from Brazil) published before 1940.
The LogicalFormBankPT (Branco, 2009, and Branco et al., 2011) is a corpus of semantic dependencies of translated texts composed of 3,406 sentences and 44,598 tokens taken from the Wall Street Journal. The LogicalFormBankPT is composed of MRS representations of each sentence’s semantic relation...
The corpus contains the Laws of Malta in Maltese from the official government website. The unannotated raw text files were extracted from the pdf files that can be found on the website.
The corpus contains the Laws of Malta in English from the official government website. The unannotated raw text files were extracted from the pdf files that can be found on the website.
Royal inquiries of 1258 (primarily published in the Portugaliae Monumenta Historica).
The full editions of ILLUM from 12/11/2006 to 30/05/2010 (185 issues).
The HIMERA annotated corpus contains a set of published historical medical documents that have been manually annotated with semantic information that is relevant to the study of medical history and public health. Specifically, annotations correspond to seven different entity types and two differe...
A corpus of manually annotated event hierarchies in news stories.
Hesita-POS is an annotaded corpus. Tv News.