Search and Browse – PORTULAN CLARIN

Tweet corpus

Resource Type:	Corpus
Media Type:	Text
Language:	English

C-ORAL-ROM_EXM

This resource includes a spoken corpus with approximately 300.000 words, covering both formal (152.755 words) and informal (165.838 words) speech, with aligned sound and orthographic transcription and POS-tag information.

Resource Type:	Corpus
Media Types:	Text
Media Types:	Audio
Language:	Portuguese

Fundamental Portuguese

This resource includes a spoken Portuguese corpus - with aligned sound and orthographic transcription -, collected among sociolinguistically diverse speakers. It consists of recordings from informal conversations.

Resource Type:	Corpus
Media Types:	Text
Media Types:	Audio
Language:	Portuguese

Spoken Corpus Mozambique

The Spoken Corpus Mozambique contains approximately 121,958 running words of spoken Portuguese from Mozambique. It includes 40 transcriptions of spoken recordings (in a total of 40 hours of recordings) that were recorded between 1986 and 1987.

Resource Type:	Corpus
Media Type:	Text
Language:	Portuguese

SpeakerID

SpeakerID is a corpus of 100 spoken sentences and pseudosentences in European Portuguese (PT) and Mandarin Chinese (CH) designed to enable research on speaker identity. The utterances were recorded by five male speakers of European Portuguese (Speakers A-E) and five male speakers of Mandarin Chi...

Resource Type:	Corpus
Media Types:	Text
Media Types:	Audio
Languages:	Chinese
Languages:	Portuguese

Perfil Sociolinguístico da Fala Bracarense

Perfil Sociolinguístico da Fala Bracarense is a Portuguese speech corpus with 90 hours of recorded spontaneous speech, aligned with its transcription in EXMARaLDA format. The corpus is composed by 1h interviews with speakers of the same area (around Braga, Portugal), stratified according to sex,...

Resource Type:	Corpus
Media Types:	Text
Media Types:	Audio
Language:	Portuguese

Carolina: General Corpus of Contemporary Brazilian Portuguese with provenance and typology information

Carolina is an open corpus for Linguistics and Artificial Intelligence with a robust volume of texts of varied typology in contemporary Brazilian Portuguese (1970-2021).

Resource Type:	Corpus
Media Type:	Text
Language:	Brazilian Portuguese

HESITA database

The HESITA database is a corpus consisting of television daily news collected over a month and was annotated regarding to hesitation events, acoustical environments, speaking styles, speaker characteristics and respiratory events, among other characteristic sounds.

Resource Type:	Corpus
Media Types:	Text
Media Types:	Audio
Language:	Portuguese

EmoProsodyPort

EmoProsodyPort (see Castro & Lima, 2010) is a speech database with 368 short sentences and pseudosentences with neutral emotional content. Acoustic measurements and behavioral data.

Resource Type:	Corpus
Media Type:	Audio
Language:	Portuguese

Georeferenced Tweets

Tweets annotated with geographic coordinates

Resource Type:	Corpus
Media Type:	Text
Language:	English

Order by:

Filter by: