EUROPARL Corpus Parallel Corpora: Portuguese-English

EUROPARL

Handle:	https://hdl.handle.net/21.11129/0000-000B-D33E-2 (persistent URL to this page)
URL:	http://www.statmt.org/europarl/

The EUROPARL Corpus (subpart Portuguese-English of the parallel corpora), available at http://www.statmt.org/europarl/, was extracted from the proceedings of the European Parliament (Koehn, 2005). It contains transcriptions of sessions dating back from 1996 to 2011, in a total of approximately 58,324,562 tokens words of European Portuguese (L1) and 49,216,896 tokens of English (translation).

Download

DistributionLicence

CC - BY - NC - SA

Restrictions: Academic - Non Commercial Use

Distribution Access/Medium: Accessible Through Interface

Licensors:

Amália Mendes

http://www.clul.ul.pt/en/researcher/146-amalia-mendes

Faculdade de Letras da Universidade de Lisboa

CLUL

[javascript protected email address]

Alameda da Universidade

1600-214 Lisbon

Tel.: 00351217904961

Fax: 00351217965622

Centro de Linguística da Universidade de Lisboa

http://www.clul.ul.pt

CLUL

Av. Prof. Gama Pinto, 2

1649-003 Lisbon

[javascript protected email address]

Tel.: +351217920000

Fax: +351217965622

IPR Holder

Amália Mendes

http://www.clul.ul.pt/en/researcher/146-amalia-mendes

Faculdade de Letras da Universidade de Lisboa

CLUL

[javascript protected email address]

Alameda da Universidade

1600-214 Lisbon

Tel.: 00351217904961

Fax: 00351217965622

Centro de Linguística da Universidade de Lisboa

http://www.clul.ul.pt

CLUL

Av. Prof. Gama Pinto, 2

1649-003 Lisbon

[javascript protected email address]

Tel.: +351217920000

Fax: +351217965622

Contact Person

Amália Mendes

http://www.clul.ul.pt/en/researcher/146-amalia-mendes

Faculdade de Letras da Universidade de Lisboa

CLUL

[javascript protected email address]

Alameda da Universidade

1600-214 Lisbon

Tel.: 00351217904961

Fax: 00351217965622

Centro de Linguística da Universidade de Lisboa

http://www.clul.ul.pt

CLUL

Av. Prof. Gama Pinto, 2

1649-003 Lisbon

[javascript protected email address]

Tel.: +351217920000

Fax: +351217965622

text

Bilingual text corpusLanguages

English (49,216,896 Tokens) Portuguese (58,324,562 Tokens)

Linguality

Linguality type: Bilingual

Multi-linguality type: Parallel

Size

49,216,896 Tokens

58,324,562 Tokens

Character encoding

UTF - 8 (58,324,562 Tokens)

Domains

Political (58,324,562 Tokens)

Modalities

Written Language

Classification

(58,324,562 Tokens)

Text type: Parliament sessions

Text genre: Political

AnnotationMorphosyntactic Annotation - Pos Tagging

Tagset: http://alfclul.clul.ul.pt/CQPweb/doc/CRPCmanual.v1_en.pdf (English)

Segmentation level: Word

Annotation Mode: Automatic

Annotation Tools:

http://ilk.uvt.nl/mbt/

Size: 58,324,562 Tokens

Annotation Manual:

Document Type: In Proceedings

Généreux, M, I. Hendrickx, A. Mendes, A Large Portuguese Corpus On-Line: Cleaning and Preprocessing, http://www.propor2012.org/ , pp. 113-120 , 10th International Conference PROPOR2012 , 2012

Editor: Caseli, H. et al. (eds.)

Publisher: Heidelberg: Springer-Verlag

Keywords: Corpus cleaning, PoS Tagging, Lemmatization

Document Language: English

Lemmatization

Segmentation level: Word

Annotation Mode: Automatic

Annotation Tools:

http://ilk.uvt.nl/mbma/

Size: 58,324,562 Tokens

Annotation Manual:

Document Type: In Proceedings

Généreux, M, I. Hendrickx, A. Mendes, A Large Portuguese Corpus On-Line: Cleaning and Preprocessing, http://www.propor2012.org/ , pp. 113-120 , 10th International Conference PROPOR2012 , 2012

Editor: Caseli, H. et al. (eds.)

Publisher: Heidelberg: Springer-Verlag

Keywords: Corpus cleaning, PoS Tagging, Lemmatization

Document Language: English

Time Coverage

1996-2011 (58,324,562 Tokens)

Geographic coverage

Portugal (58,324,562 Tokens)

United Kingdom (49,216,896 Tokens)

Creation

Creation mode: Automatic

Resource Creation

Resource Creator

Amália Mendes

http://www.clul.ul.pt/en/researcher/146-amalia-mendes

Faculdade de Letras da Universidade de Lisboa

CLUL

[javascript protected email address]

Alameda da Universidade

1600-214 Lisbon

Tel.: 00351217904961

Fax: 00351217965622

Centro de Linguística da Universidade de Lisboa

http://www.clul.ul.pt

CLUL

Av. Prof. Gama Pinto, 2

1649-003 Lisbon

[javascript protected email address]

Tel.: +351217920000

Fax: +351217965622

Funding Project

METANET4U - Enhancing the Linguistic Infrastructure of Europe (METANET4U)

Funding Type: Eu Funds

Metadata

Created: 12/17/2012

Last Updated: 01/21/2013

Metadata Creator

Amália Mendes

http://www.clul.ul.pt/en/researcher/146-amalia-mendes

Faculdade de Letras da Universidade de Lisboa

CLUL

[javascript protected email address]

Alameda da Universidade

1600-214 Lisbon

Tel.: 00351217904961

Fax: 00351217965622

Centro de Linguística da Universidade de Lisboa

http://www.clul.ul.pt

CLUL

Av. Prof. Gama Pinto, 2

1649-003 Lisbon

[javascript protected email address]

Tel.: +351217920000

Fax: +351217965622

Version

Version: 1.0

Usage

Foreseen UseNlp Applications

Use NLP Specific: Information Extraction, Lemmatization, Lexicon Access, Machine Translation, Morphosyntactic Tagging, Pos Tagging, Word Sense Disambiguation

Human Use

Use NLP Specific: Linguistic Research

Actual Use - Nlp Applications

Use NLP Specific: Information Extraction, Lemmatization, Lexicon Access, Machine Translation, Morphosyntactic Tagging, Pos Tagging, Word Sense Disambiguation

Actual Use - Human Use

Use NLP Specific: Linguistic Research

Documentation

Document Type: Article

Koehn, P. , “EUROPARL: A Parallel Corpus for Statistical Machine Translation” , , pp. pp. 79-86 , Tenth Machine Translation Summit, Phuket, Thailand , 2005

Book Title: Proceedings of the Tenth Machine Translation Summit, Phuket, Thailand,

Document Type: Other

Sandra Antunes, EUROPARL Corpus Parallel Corpora: Portuguese-English Narrative Description, http://portulanclarin.net/repository/extradocs/EUROPARL.pdf

Document Type: Proceedings

Généreux, M., I. Hendrickx and A. Mendes (2012), “A Large Portuguese Corpus On-Line: Cleaning and Preprocessing”, , pp. 113-120 , 10th International Conference PROPOR1012, , 2012

Editor: Berlin, Heidelberg: Springer-Verlag, pp. 113-120.

Book Title: Proceedings of the 10th International Conference PROPOR1012,

People who looked at this resource also viewed the following:

People who downloaded this resource also downloaded the following:

Resources from the same project

Resources from the same creators