CIPM-POS is a set of historical, religious, notarial, literary texts in prose and verse, written is medieval portuguese. It contains around 88000 words.
The HESITA database is a corpus consisting of television daily news collected over a month and was annotated regarding to hesitation events, acoustical environments, speaking styles, speaker characteristics and respiratory events, among other characteristic sounds.
DVPM-EtyMor is a lexical database. Etymological, morphological and textual exemplification. Around 3000 verbs. Language: Medieval portuguese.
Arquivo Dialetal CLUP - Áudio is an audio corpus of spontaneous speech, mainly from Northern Portugal.
CORAA NURC-SP Minimal Corpus is a manually annotated corpus of Brazilian Portuguese spontaneous speech (São Paulo variety). The corpus is a subset of NURC (‘Cultured Linguistic Urban Norm’) project collection, one of the most influential in Brazilian Linguistics. The corpus was brought to digital...
A corpus of opinion articles annotated with arguments, following a claim-premise model.
Perfil Sociolinguístico da Fala Bracarense is a Portuguese speech corpus with 90 hours of recorded spontaneous speech, aligned with its transcription in EXMARaLDA format. The corpus is composed by 1h interviews with speakers of the same area (around Braga, Portugal), stratified according to sex,...
Embeddings used in: Branco, António, João Rodrigues, Małgorzata Salawa, Ruben Branco and Chakaveh Saedi, 2020. Comparative Probing of Lexical Semantics Theories for Cognitive Plausibility and Technological Usefulness. In Proceedings of the International Conference on Computational Linguistics (C...
Based on the MXPOST part of speech tagger and UNITEX dictionaries for Portuguese, this tool produces the lemmas of the words of a text stored in a plain text file. The source code is also provided.
Grafone-LEX is a lexical database for conversion from graphemes to phonemes