Tweet corpus
This resource includes a spoken corpus with approximately 300.000 words, covering both formal (152.755 words) and informal (165.838 words) speech, with aligned sound and orthographic transcription and POS-tag information.
This resource includes a spoken Portuguese corpus - with aligned sound and orthographic transcription -, collected among sociolinguistically diverse speakers. It consists of recordings from informal conversations.
The Spoken Corpus Mozambique contains approximately 121,958 running words of spoken Portuguese from Mozambique. It includes 40 transcriptions of spoken recordings (in a total of 40 hours of recordings) that were recorded between 1986 and 1987.
SpeakerID is a corpus of 100 spoken sentences and pseudosentences in European Portuguese (PT) and Mandarin Chinese (CH) designed to enable research on speaker identity. The utterances were recorded by five male speakers of European Portuguese (Speakers A-E) and five male speakers of Mandarin Chi...
Perfil Sociolinguístico da Fala Bracarense is a Portuguese speech corpus with 90 hours of recorded spontaneous speech, aligned with its transcription in EXMARaLDA format. The corpus is composed by 1h interviews with speakers of the same area (around Braga, Portugal), stratified according to sex,...
Carolina is an open corpus for Linguistics and Artificial Intelligence with a robust volume of texts of varied typology in contemporary Brazilian Portuguese (1970-2021).
The HESITA database is a corpus consisting of television daily news collected over a month and was annotated regarding to hesitation events, acoustical environments, speaking styles, speaker characteristics and respiratory events, among other characteristic sounds.
EmoProsodyPort (see Castro & Lima, 2010) is a speech database with 368 short sentences and pseudosentences with neutral emotional content. Acoustic measurements and behavioral data.
Tweets annotated with geographic coordinates