Datasets for classification experiments IS-pros

Datasets is arff format (for Weka machine learning software) are made available to reproduce the validation experiments presented in the paper.

Resource Type:Corpus
Media Type:Text
Language:English
EUROPARL Corpus Parallel Corpora: Portuguese-English

The EUROPARL Corpus (subpart Portuguese-English of the parallel corpora), available at http://www.statmt.org/europarl/, was extracted from the proceedings of the European Parliament (Koehn, 2005). It contains transcriptions of sessions dating back from 1996 to 2011, in a total of approximately 58...

Resource Type:Corpus
Media Type:Text
Languages:English
Portuguese
COVID-19 EU presscorner v2 dataset. Multilingual (CEF languages)

Multilingual (CEF languages) corpus acquired from website (https://ec.europa.eu/commission/presscorner/) of the EU portal (8th July 2020). It contains 23 TMX files (EN-X, where X is a CEF language) with 151895 TUs in total.

Resource Type:Corpus
Media Type:Text
Languages:Bulgarian
Croatian
Czech
Danish
Dutch; Flemish
English
Estonian
Finnish
French
German
Greek, Modern (1453-)
Hungarian
Irish
Italian
Latvian
Lithuanian
Maltese
Moldavian; Moldovan
Polish
Portuguese
Romanian
Slovak
Slovenian
Spanish; Castilian
Swedish
COVID-19 EU presscorner v1 dataset. Multilingual (CEF languages)

Multilingual (CEF languages) corpus acquired from website (https://ec.europa.eu/commission/presscorner/) of the EU portal (14th May 2020). It contains 23 TMX files (EN-X, where X is a CEF language) with 83217 TUs in total.

Resource Type:Corpus
Media Type:Text
Languages:Bulgarian
Croatian
Czech
Danish
Dutch; Flemish
English
Estonian
Finnish
French
German
Greek, Modern (1453-)
Hungarian
Irish
Italian
Latvian
Lithuanian
Maltese
Moldavian; Moldovan
Polish
Portuguese
Romanian
Slovak
Slovenian
Spanish; Castilian
Swedish
Code-switched English-Spanish Tweets

This package contains the collection of tweets described in the LREC 2018 paper: "Collecting Code-Switched Data from Social Media", Gideon Mendels, Victor Soto, Aaron Jaech and Julia Hirschberg, LREC 2018. Please remember to cite this paper if you use this resource. The tagged_tweets_ids file con...

Resource Type:Corpus
Media Type:Text
Languages:English
Spanish; Castilian
Memorias de traducción Portal oficial de turismo de España www.spain.info

Memoria de traducción Portal oficial de turismo de España www.spain.info

Resource Type:Corpus
Media Type:Text
Languages:English
French
German
Italian
Portuguese
Spanish; Castilian
Anonymised ParaCrawl release 7 Portuguese-English

This corpus was run through BiRoamer https://github.com/bitextor/biroamer to anonymise the Portuguese-English parallel data from release 7 of the ParaCrawl project, specifically "Broader Web-Scale Provision of Parallel Corpora for European Languages". This version is filtered with BiCleaner with ...

Resource Type:Corpus
Media Type:Text
Languages:English
Portuguese
Multilingual corpus from the Publications Office of the EU on the medical domain v.2

277780 sentence pairs (in 23 EN-X language pairs in total) extracted from the Publications Office of the EU on the medical domain. These are sourced from laws, studies, EC announcements, etc. labelled with concepts like epidemiology, epidemic, disease surveillance, health control, public hygiene,...

Resource Type:Corpus
Media Type:Text
Languages:Bulgarian
Croatian
Czech
Danish
Dutch; Flemish
English
Estonian
Finnish
French
German
Greek, Modern (1453-)
Hungarian
Irish
Italian
Latvian
Lithuanian
Maltese
Polish
Portuguese
Romanian
Slovak
Slovenian
Spanish; Castilian
Swedish
COVID-19 EC-EUROPA v1 dataset. Multilingual (CEF languages)

Multilingual (CEF languages) corpus acquired from website (https://ec.europa.eu/*coronavirus-response) of the EU portal (20th May 2020). It contains 23 TMX files (EN-X, where X is a CEF language) with 53311 TUs in total.

Resource Type:Corpus
Media Type:Text
Languages:Bulgarian
Croatian
Czech
Danish
Dutch; Flemish
English
Estonian
Finnish
French
German
Greek, Modern (1453-)
Hungarian
Irish
Italian
Latvian
Lithuanian
Maltese
Moldavian; Moldovan
Polish
Portuguese
Romanian
Slovak
Slovenian
Spanish; Castilian
Swedish
Parallel texts from Swedish Social Security Authority (Processed)

This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Europe Facility - Automated Translation (CEF.AT) action. For further information on the project: http://lr-coordination.eu. Parallel texts, email templates and forms in pdf file fo...

Resource Type:Corpus
Media Type:Text
Languages:Croatian
English
Finnish
French
German
Italian
Polish
Romanian
Spanish; Castilian
Swedish

Order by:

Filter by: