RedditPT Dataset

This dataset is a collection of dialogues extracted from the Portugal subreddit with RDET (Reddit Dataset Extraction Tool). It is composed of around 58,964,715 tokens in 218,550 dialogues.

Resource Type:Corpus
Media Type:Text
Language:Portuguese
Hesita-POS

Hesita-POS is an annotaded corpus. Tv News.

Resource Type:Corpus
Media Type:Text
Language:Portuguese
EmoVoicePort

EmoVoicePort, Emotional Vocalization Corpus (see Lima, Castro, & Scott, 2013) is a validated set of nonverbal vocalizations that portray four positive emotions (achievement/triumph, amusement, sensual pleasure, relief) and four negative ones (anger, disgust, fear, sadness). The vocalizations (n =...

Resource Type:Corpus
Media Type:Audio
Language:Portuguese
Arquivo Dialetal CLUP - POS

Arquivo Dialetal CLUP - POS is a speech corpus with approximately 40 000 tokens (Utterances; spontaneous speech, mainly from Northern Portugal). Orthographic transcription, POS.

Resource Type:Corpus
Media Type:Audio
Language:Portuguese
CIPM-POS

CIPM-POS is a set of historical, religious, notarial, literary texts in prose and verse, written is medieval portuguese. It contains around 88000 words.

Resource Type:Corpus
Media Type:Text
Language:Portuguese
Arquivo Dialetal CLUP - Áudio

Arquivo Dialetal CLUP - Áudio is an audio corpus of spontaneous speech, mainly from Northern Portugal.

Resource Type:Corpus
Media Type:Audio
Language:Portuguese
News corpus categorised

The News corpus developed by LIACC in JSON format was complemented with POS and keyword topics annotation. POS-tagging =========== The POS-tagging used the tagger described in Généreux et al. (2012) The title and text body were extracted, tokenized and pos-tagged. Two new fields were added...

Resource Type:Corpus
Media Type:Text
Language:Portuguese
CINTIL-UPos

CINTIL-UPos is a corpus of Portuguese that is annotated with the Universal Part-of-Speech tagset (UPOS), related to the Universal Dependency framework, and that contains around 1 million annotated tokens. It is described in this article: António Branco, João Ricardo Silva, Luís Gomes and Jo...

Resource Type:Corpus
Media Type:Text
Language:Portuguese
Chancelaria de D. Afonso III: documentos em português

Os documentos em português da Chancelaria de D. Afonso III constituem o primeiro conjunto significativo de textos em português (34 documentos que recobrem um período de 24 anos: 1255 - 1279), sendo apenas a partir de 1279, com D. Dinis (1261-1325), que se inicia o uso sistemático do português co...

Resource Type:Corpus
Media Type:Text
Language:Portuguese
LX-LR4DistSemEval

A collection of language resources for the evaluation of distributional semantic models of Portuguese: LX-SimLex-999: http://metashare.metanet4u.eu/go2/lx-simlex-999 LX-Rare Word Similarity Data set: http://metashare.metanet4u.eu/go2/lx-rare-word-similarity-dataset LX-WordSim-353: h...

Resource Type:Corpus
Media Type:Text
Language:Portuguese

Order by:

Filter by:

Text (282)
Audio (16)