PS Corpus (Post-Scriptum)-ES is a corpus of 2368 informal mail letters written in Spanish during the Modern Ages (from the XVIth century to the beginning of the XIXth century). Each letter is available as a semi-palaeographic transcription, a modernized transcription, and with part-of-speech a...
Transcriptions in the CIEMPIESS-PNPD are based on a phonetic alphabet called Mexbet. Mexbet was design for the Spanish of Central Mexico and it has several levels of granularity. The CIEMPIESS-PNPD comes in two versions: Mexbet T29 and Mexbet T66. Level T29 of Mexbet means that transcriptions ...
The text of the novel Sania (eng. The Sledge) served as a training corpus. It was written in 1955 by Ion Druță and printed originally in Cyrillic scripts. We have followed a special previously developed technology of recognition and specialized lexicons. In such a way, we have obtained the electr...
Albertina PT-* is a foundation, large language model for the Portuguese language. It is an encoder of the BERT family, based on the neural architecture Transformer and developed over the DeBERTa model, and with most competitive performance for this language. It has different versions that were...
The resource consists of a Portuguese frequency lexicon based on a 16 million words corpus of written and spoken texts from different genres. The lexicon contains 26.443 entries (lemma) and 140
The CINTIL-WordSenses corpus, built upon the CINTIL International Corpus of Portuguese (Barreto et al., 2006), is composed of 23,825 sentences of written Portuguese with open-class terms manually disambiguated and annotated with synset identifiers from the Portuguese MultiWordNet (MWNPT) (Pianti ...
The PAROLE Portuguese Corpus – tagged subset contains 250.000 tokens and is a subset of the PAROLE Portuguese Corpus of 3 million running words of European Portuguese. The corpus was classified and encoded according to the common core parole encoding standard. The tagged subset reproduces appro...
The test set described in was used as the basis for the assessment of word embeddings. An example entry in this data set would read: ‘Berlin Germany Lisbon Portugal’. With these four words relations – as in this example – one can test semantic analogies by using any of the possible combinations o...