LT Corpus

The LT Corpus (Literary Corpus) contains approximately 1,781,083 running words of European and Brazilian Portuguese. It includes 70 copyright-free classics (61 Portugal and 9 from Brazil) published before 1940.

Resource Type:Corpus
Media Type:Text
Language:Portuguese
LX-Battig

The LX-Battig was created from Battig test.set (Baroni et al., 2010). This data set has 83 concrete concepts of the following 10 categories: mammals, birds, fish, vegetables, fruit, trees, vehicles, clothes, tools and kitchenware. The categories names and the concepts were translated by two trans...

Resource Type:Corpus
Media Type:Text
Language:Portuguese
COVID-19 ANTIBIOTIC dataset. Multilingual (CEF languages)

Multilingual (CEF languages) corpus acquired from the website https://antibiotic.ecdc.europa.eu/ . It contains 20981 TUs (in total) for EN-X language pairs, where X is a CEF language.

Resource Type:Corpus
Media Type:Text
Languages:Bokmål, Norwegian; Norwegian Bokmål
Bulgarian
Croatian
Czech
Danish
Dutch; Flemish
English
Estonian
Finnish
French
German
Greek, Modern (1453-)
Hungarian
Icelandic
Irish
Italian
Latvian
Lithuanian
Maltese
Moldavian; Moldovan
Polish
Portuguese
Romanian
Slovak
Slovenian
Spanish; Castilian
Swedish
CINTIL-Definitions

The corpus presented here is a collection of several tutorials and scientific papers in the field of Information Technology with 603 annotated definitions from Portuguese. The texts were collected from the Web at the beginning of the 2006 and they are organised in 32 files of three different sub-...

Resource Type:Corpus
Media Type:Text
Language:Portuguese
COVID-19 EUROPARL v2 dataset. Bilingual (EN-PT)

Bilingual (EN-PT) corpus acquired from the website (https://www.europarl.europa.eu/) of the European Parliament (9th May 2020)

Resource Type:Corpus
Media Type:Text
Languages:English
Portuguese
Portuguese Parliamentary Corpus 4.0

The Portuguese Parliamentary Corpus is part of the Mutlilingual ParlaMint Corpus, a set of comparable corpora containing transcriptions of parliamentary debates of 29 European countries and autonomous regions. The Portuguese corpus (ParlaMint-PT) comprehends transcripts of sessions in the time pe...

Resource Type:Corpus
Media Type:Text
Language:Portuguese
CINTIL-LogicalFormBank

The CINTIL-LogicalFormBank (Branco, 2009, and Branco et al., 2011) is a corpus of semantic dependencies of sentences from Portuguese texts composed of 10,039 sentences and 110,166 tokens taken from different sources and domains: news (8,861 sentences; 101,430 tokens), novels (399 sentences; 3,082...

Resource Type:Corpus
Media Type:Text
Language:Portuguese
Parallel Global Voices (English - Polish) (Processed)

This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Europe Facility - Automated Translation (CEF.AT) action. For further information on the project: http://lr-coordination.eu. Parallel Global Voices EN-PL is a parallel corpus genera...

Resource Type:Corpus
Media Type:Text
Languages:English
Polish
Priberam Compressive Summarization Corpus

This is a corpus for multi-document summarization for European Portuguese. It contains 80 topics, each of which has 10 documents, for a total of 800 documents. Each topic contains two human summaries. The summaries are compressive: they are the result of a compression of the sentences in the orig...

Resource Type:Corpus
Media Type:Text
Language:Portuguese
CINTIL-PropBank

The CINTIL-PropBank (Branco et al., 2012) is a corpus of sentences annotated with their constituency structure and semantic role tags, composed of 10,039 sentences and 110,166 tokens taken from different sources and domains: news (8,861 sentences; 101,430 tokens), and novels (399 sentences; 3,082...

Resource Type:Corpus
Media Type:Text
Language:Portuguese

Order by:

Filter by:

Text (446)
Audio (18)
Image (1)