Spanish-English website parallel corpus (Processed)

Handle:	https://hdl.handle.net/21.11129/0000-000D-F8E2-C (persistent URL to this page)
ELRA ID:	ELRA-W0248

This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Europe Facility - Automated Translation (CEF.AT) action. For further information on the project: http://lr-coordination.eu.
This is a parallel corpus of bilingual texts crawled from multilingual websites, which contains 21,007 TUs.
Period of crawling : 15/11/2016 - 23/01/2017
A strict validation process has been followed, which resulted in discarding:
- TUs from crawled websites that do not comply to the PSI directive,
- TUs with more than 99% of mispelled tokens,
- TUs identified during the manual validation process and all the TUs from websites whose error rate in the sample extracted for manual validation is strictly above the following thresholds:
50% of TUs with language identification errors,
50% of TUs with alignment errors,
50% of TUs with tokenization errors,
20% of TUs identified as machine translated content,
50% of TUs with translation errors.

Download

DistributionLicence

Open Under - PSI

Distribution Access/Medium: Downloadable

Contact Person

Valérie Mapelli Female

http://www.elda.org

[javascript protected email address]

55-57 rue Brillat-Savarin

75013 Paris

France

Tel.: +1 43 13 33 33

Fax: +1 43 14 33 30

text

Bilingual text corpusLanguages

English Spanish; Castilian

Linguality

Linguality type: Bilingual

Multi-linguality type: Parallel

Text Format

Application/x-tmx+xml

Size

21,007 Units

Character encoding

UTF - 8

Modalities

Written Language

Resource Creation

Funding Project

European Language Resource Coordination LOT3 (ELRC Data - Tools and Resources for CEF Automated Translation - LOT3 (SMART 2015/1091 - 30-CE-0816766/00-92))

URL: http://www.lr-coordination.eu/

Funding Type: Eu Funds

Project duration: 12/13/2016 - 02/12/2020

Metadata

Created: 05/25/2020

Last Updated: 08/05/2021

Metadata Creator

Andrea Teixeira Female

University of Lisbon

[javascript protected email address]

Campo Grande

1749-016 Lisboa