RedditPT Dataset

This dataset is a collection of dialogues extracted from the Portugal subreddit with RDET (Reddit Dataset Extraction Tool). It is composed of around 58,964,715 tokens in 218,550 dialogues.

Resource Type:Corpus
Media Type:Text
Language:Portuguese
BDCamões DependencyBank (Part I)

BDCamões Corpus - Collection of Portuguese Literary Documents from the Digital Library of Camões I.P., is a collection of literary documents written in Portuguese, in plain text .txt format, with close to 4 million words from over 200 complete documents from 83 authors in 14 genres, covering a ti...

Resource Type:Corpus
Media Type:Text
Language:Portuguese
Romanian-English corpus with studies, reports and statistical data in the field of culture from the National Institute for Cultural Research and Training website (Processed)

This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Europe Facility - Automated Translation (CEF.AT) action. For further information on the project: http://lr-coordination.eu. Romanian-English corpus with studies, reports and statis...

Resource Type:Corpus
Media Type:Text
Languages:Moldavian; Moldovan
Romanian
Perfil Sociolinguístico da Fala Bracarense

Perfil Sociolinguístico da Fala Bracarense is a Portuguese speech corpus with 90 hours of recorded spontaneous speech, aligned with its transcription in EXMARaLDA format. The corpus is composed by 1h interviews with speakers of the same area (around Braga, Portugal), stratified according to sex,...

Resource Type:Corpus
Media Types:Text
Audio
Language:Portuguese
Corpus de Produções Escritas de Aprendentes de PL2 (PEAPL2)

A Portuguese as a non-native language learners' corpus of written texts with three independent subcorpora: - Portuguese as a Foreign Language: Subcorpus Português Língua Estrangeira (PEAPL2_PLE) http://teitok2.iltec.pt/peapl2-ple/index.php?action=home - East Timorese Portuguese: Subcorpus T...

Resource Type:Corpus
Media Type:Text
Language:Portuguese
CIPM

CIPM is a set of historical, religious, notarial, literary texts in prose and verse, written in medieval portuguese. It has around 3.5 million words.

Resource Type:Corpus
Media Type:Text
Language:Portuguese
Portuguese Parliamentary Corpus 4.0

The Portuguese Parliamentary Corpus is part of the Mutlilingual ParlaMint Corpus, a set of comparable corpora containing transcriptions of parliamentary debates of 29 European countries and autonomous regions. The Portuguese corpus (ParlaMint-PT) comprehends transcripts of sessions in the time pe...

Resource Type:Corpus
Media Type:Text
Language:Portuguese
CINTIL-Definitions

The corpus presented here is a collection of several tutorials and scientific papers in the field of Information Technology with 603 annotated definitions from Portuguese. The texts were collected from the Web at the beginning of the 2006 and they are organised in 32 files of three different sub-...

Resource Type:Corpus
Media Type:Text
Language:Portuguese
Perfil Sociolinguístico da Fala Bracarense - POS

Perfil Sociolinguístico da Fala Bracarense - POS is a manually verified part-of-speech annotation of the EXMARaLDA transcriptions in "Perfil Sociolinguístico da Fala Bracarense", a Portuguese speech corpus with 90 hours of recorded spontaneous speech, aligned with its transcription in EXMARaLDA f...

Resource Type:Corpus
Media Type:Text
Language:Portuguese
Carolina: General Corpus of Contemporary Brazilian Portuguese with provenance and typology information

Carolina is an open corpus for Linguistics and Artificial Intelligence with a robust volume of texts of varied typology in contemporary Brazilian Portuguese (1970-2021).

Resource Type:Corpus
Media Type:Text
Language:Brazilian Portuguese

Order by:

Filter by:

Text (446)
Audio (18)
Image (1)