CINTIL-Definitions

The corpus presented here is a collection of several tutorials and scientific papers in the field of Information Technology with 603 annotated definitions from Portuguese. The texts were collected from the Web at the beginning of the 2006 and they are organised in 32 files of three different sub-...

Resource Type:Corpus
Media Type:Text
Language:Portuguese
Bilingual corpus made out of PDF documents from the European Medicines Agency, (EMEA), https://www.ema.europa.eu, (February 2020) (EN-PT)

EN-PT Bilingual corpus made out of PDF documents from the European Medicines Agency, (EMEA), https://www.ema.europa.eu, (February 2020).

Resource Type:Corpus
Media Type:Text
Languages:English
Portuguese
COVID-19 EU presscorner v1 dataset. Bilingual (EN-PT)

Bilingual (EN-PT) corpus acquired from website (https://ec.europa.eu/commission/presscorner/) of the EU portal (14th May 2020).

Resource Type:Corpus
Media Type:Text
Languages:English
Portuguese
COVID-19 EUROPARL dataset v1. Bilingual (EN-PT)

Bilingual (EN-PT) corpus acquired from the website (https://www.europarl.europa.eu/) of the European Parliament (25th April 2020)

Resource Type:Corpus
Media Type:Text
Languages:English
Portuguese
COVID-19 EUROPARL v2 dataset. Bilingual (EN-PT)

Bilingual (EN-PT) corpus acquired from the website (https://www.europarl.europa.eu/) of the European Parliament (9th May 2020)

Resource Type:Corpus
Media Type:Text
Languages:English
Portuguese
COVID-19 Parallel Global Voices dataset. Bilingual (EN-PT)

EN-PT Bilingual COVID-19-related corpus acquired from the website (https://globalvoices.org/) of GlobalVoices (28th April 2020)

Resource Type:Corpus
Media Type:Text
Languages:English
Portuguese
COVID-19 EU presscorner v2 dataset. Bilingual (EN-PT)

Bilingual (EN-PT) corpus acquired from website (https://ec.europa.eu/commission/presscorner/) of the EU portal (8th July 2020).

Resource Type:Corpus
Media Type:Text
Languages:English
Portuguese
News corpus categorised

The News corpus developed by LIACC in JSON format was complemented with POS and keyword topics annotation. POS-tagging =========== The POS-tagging used the tagger described in Généreux et al. (2012) The title and text body were extracted, tokenized and pos-tagged. Two new fields were added...

Resource Type:Corpus
Media Type:Text
Language:Portuguese
HESITA database

The HESITA database is a corpus consisting of television daily news collected over a month and was annotated regarding to hesitation events, acoustical environments, speaking styles, speaker characteristics and respiratory events, among other characteristic sounds.

Resource Type:Corpus
Media Types:Text
Audio
Language:Portuguese
ParaCrawl release 7 Portuguese-English

Portuguese-English parallel from release 7 of the ParaCrawl project, specifically "Broader Web-Scale Provision of Parallel Corpora for European Languages". This version is filtered with BiCleaner with a threshold of 0.5. Data was crawled from the web following robots.txt, as is standard practice....

Resource Type:Corpus
Media Type:Text
Languages:English
Portuguese

Order by:

Filter by:

Portuguese (111)
English (31)
Czech (14)
Bulgarian (13)
German (12)
French (10)
Italian (9)
Polish (7)
Slovak (7)
Basque (6)
Danish (6)
Finnish (6)
Irish (6)
Latvian (6)
Maltese (6)
Swedish (6)
Chinese (3)
Arabic (2)
Hindi (1)
Russian (1)
Swahili (1)
Thai (1)
Turkish (1)
Urdu (1)
TMX (18)
Text/xml (11)
Wav (3)
Sgml (1)
Xml (1)