PS corpus (Post-Scriptum)-ES

PS Corpus (Post-Scriptum)-ES is a corpus of 2368 informal mail letters written in Spanish during the Modern Ages (from the XVIth century to the beginning of the XIXth century). Each letter is available as a semi-palaeographic transcription, a modernized transcription, and with part-of-speech a...

Resource Type:Corpus
Media Type:Text
Language:Spanish; Castilian
The CIEMPIESS Proper-Names Pronouncing Dictionary

Transcriptions in the CIEMPIESS-PNPD are based on a phonetic alphabet called Mexbet. Mexbet was design for the Spanish of Central Mexico and it has several levels of granularity. The CIEMPIESS-PNPD comes in two versions: Mexbet T29 and Mexbet T66. Level T29 of Mexbet means that transcriptions ...

Resource Type:Corpus
Media Type:Text
Language:Spanish; Castilian
Alignment of Parallel Texts from Cyrillic to Latin

The text of the novel Sania (eng. The Sledge) served as a training corpus. It was written in 1955 by Ion Druță and printed originally in Cyrillic scripts. We have followed a special previously developed technology of recognition and specialized lexicons. In such a way, we have obtained the electr...

Resource Type:Corpus
Media Type:Text
Language:Romanian
Albertina PT-BR No-brWaC

Albertina PT-* is a foundation, large language model for the Portuguese language. It is an encoder of the BERT family, based on the neural architecture Transformer and developed over the DeBERTa model, and with most competitive performance for this language. It has different versions that were...

Resource Type:Language Description
Media Type:Text
Language:Portuguese
Multifunctional Computational Lexicon of Contemporary Portuguese

The resource consists of a Portuguese frequency lexicon based on a 16 million words corpus of written and spoken texts from different genres. The lexicon contains 26.443 entries (lemma) and 140

Resource Type:Lexical / Conceptual
Media Type:Text
Language:Portuguese
CINTIL-WordSenses

The CINTIL-WordSenses corpus, built upon the CINTIL International Corpus of Portuguese (Barreto et al., 2006), is composed of 23,825 sentences of written Portuguese with open-class terms manually disambiguated and annotated with synset identifiers from the Portuguese MultiWordNet (MWNPT) (Pianti ...

Resource Type:Corpus
Media Type:Text
Language:Portuguese
PAROLE Portuguese Annotated Corpus

The PAROLE Portuguese Corpus – tagged subset contains 250.000 tokens and is a subset of the PAROLE Portuguese Corpus of 3 million running words of European Portuguese. The corpus was classified and encoded according to the common core parole encoding standard. The tagged subset reproduces appro...

Resource Type:Corpus
Media Type:Text
Language:Portuguese
LX-4WAnalogies

The test set described in was used as the basis for the assessment of word embeddings. An example entry in this data set would read: ‘Berlin Germany Lisbon Portugal’. With these four words relations – as in this example – one can test semantic analogies by using any of the possible combinations o...

Resource Type:Corpus
Media Type:Text
Language:Portuguese

Order by:

Filter by: