Freely available large dataset, manually annotated for German NER. Includes nested span annotations. Source text from German Wikipedia and news. This data set does not contain the test data, which is used for the GermEval 2014 NER task at KONVENS. Test data will be available from September 2014.
The LX-WordSim-353 was created from WordSim-353 (Agirre et al., 2009). As the name suggests, this data set contains 353 pairs of words. Both words in each pair can have different morphosyntactic categories. The data set is made of nouns, adjectives, verbs and named entities, and has no multiwords...
BDCamões Corpus - Collection of Portuguese Literary Documents from the Digital Library of Camões I.P., is a collection of literary documents written in Portuguese, in plain text .txt format, with close to 4 million words from over 200 complete documents from 83 authors in 14 genres, covering a ti...
TimeBankPT, a TimeML annotated corpus of Portuguese, is the first corpus of Portuguese with rich temporal annotations (i.e. it includes annotations not only of temporal expressions but also about events and temporal relations). The annotation scheme used is similar to TimeML. TimeBankPT is the...
ixa-pipe-ned-ukb is a multilingual Named Entity Disambiguation tool. It is based on UKB (http://ixa2.si.ehu.es/ukb/), a graph-based Word Sense Disambiguation tool. The Wikipedia graph built from the hyperlinks between Wikipedia articles is used for the processing. The input of the tool is ...
CIPM is a set of historical, religious, notarial, literary texts in prose and verse, written in medieval portuguese. It has around 3.5 million words.
This lexicon is a speech lexicon, exported from Crimsonwing’s text-to-speech (TTS) database into a .txt file. In its original form and together with the Maltese Speech Engine Diphone repository, it was used for building Crimsonwing’s text-to-speech system. The file is in txt format, with each ...
This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Europe Facility - Automated Translation (CEF.AT) action. For further information on the project: http://lr-coordination.eu. Contents of http://casopis-gradjevinar.hr were crawled, ...
News articles collected from Portuguese newspapers.
This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Europe Facility - Automated Translation (CEF.AT) action. For further information on the project: http://lr-coordination.eu. Parallel texts in pdf file format. Original in Swedish, ...