The text of the novel Sania (eng. The Sledge) served as a training corpus. It was written in 1955 by Ion Druță and printed originally in Cyrillic scripts. We have followed a special previously developed technology of recognition and specialized lexicons. In such a way, we have obtained the electr...
Treebanks and semantic lexicons for Basque, Bulgarian, Dutch, German and Portuguese. Created within European project QTLeap.
«The Memórias Paroquiais (Parish Memories) are an essential source for obtaining a radiography of Portugal in 1758-1761. They correspond to a survey, organized in 3 major parts (the locality itself, the mountain and the river), which was printed and sent to those responsible for the dioceses of t...
This is a corpus for multi-document summarization for European Portuguese. It contains 80 topics, each of which has 10 documents, for a total of 800 documents. Each topic contains two human summaries. The summaries are compressive: they are the result of a compression of the sentences in the orig...
In the period since 2004, many novel sophisticated approaches for generic multi-document summarization have been developed. Intuitive simple approaches have also been shown to perform unexpectedly well for the task. Yet it is practically impossible to compare the existing approaches directly, bec...
The LX-Rare Word Similarity Data set was created from Stanford Rare Word (RW) Similarity data set (Luong et al., 2013). This list contains 2 034 words (1 017 pairs of words). All the words were extracted from Wikipedia and from WordNet (Miller, 1995), a lexical database where the concepts are gro...
This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Europe Facility - Automated Translation (CEF.AT) action. For further information on the project: http://lr-coordination.eu. Bilingual Croatian-English Parallel Corpus of 21340 tran...
EmoVoicePort, Emotional Vocalization Corpus (see Lima, Castro, & Scott, 2013) is a validated set of nonverbal vocalizations that portray four positive emotions (achievement/triumph, amusement, sensual pleasure, relief) and four negative ones (anger, disgust, fear, sadness). The vocalizations (n =...
This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Europe Facility - Automated Translation (CEF.AT) action. For further information on the project: http://lr-coordination.eu. English-Slovak corpus of annual reports on immigration a...
This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Europe Facility - Automated Translation (CEF.AT) action. For further information on the project: http://lr-coordination.eu. Dataset of various English-Slovak legal texts within age...