Web service created by exporting UIMA-based workflow from the U-Compare text mining system. Functionality: Identifies and categorises syntactic chunks in plain text Tools in workflow: Freeling shallow parser web service (service provided by the PANACEA project) NOTE: The licence provided cove...
This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Europe Facility - Automated Translation (CEF.AT) action. For further information on the project: http://lr-coordination.eu. Parallel texts Danish-English from the Danish Ministry o...
The CINTIL-LogicalFormBank (Branco, 2009, and Branco et al., 2011) is a corpus of semantic dependencies of sentences from Portuguese texts composed of 10,039 sentences and 110,166 tokens taken from different sources and domains: news (8,861 sentences; 101,430 tokens), novels (399 sentences; 3,082...
This research proposes a corpus of popular Brazilian Portuguese, called CorPop, with texts selected based on the average level of literacy of the country's readers. CorPop’s theoretical and methodological bases are interdisciplinary and fall within the scope of Language Studies and related discip...
This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Europe Facility - Automated Translation (CEF.AT) action. For further information on the project: http://lr-coordination.eu. Estonian-English translations of the Acts of Estonian la...
This is a set of 11.361 biographies of Portuguese people. The compilation of the data involved the biography collection from wikipedia and data conversion. Several filters were applied to remove entries that were mostly empty or non applicable content. Format: JSON (conversion from HTML) ...
This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Europe Facility - Automated Translation (CEF.AT) action. For further information on the project: http://lr-coordination.eu. Parallel Global Voices EL-EN is a parallel corpus genera...
The AuCoPro-Splitting dataset contains compounds annotated with their compound boundaries and linking morphemes. The dataset consists of two files, one for Afrikaans and one for Dutch. The annotation was performed according to annotation guidelines as described in Verhoeven, van Zaanen, van Huyss...
CINTIL-DeepBank (Branco et al., 2010) is a corpus of Portuguese texts annotated with deep grammatical information. This document refers to version 1.4 of the corpus, from January 2016, which adds over 15,400 annotated sentences to the previous version from September 2015. The current version i...
Web service created by exporting UIMA-based workflow from the U-Compare text mining system. Functionality: Identifies sentences and tokens in plain text. Parts of speech and lemmas are assigned to tokens. Language is automatically identified amongst the supported languages and language-specific ...