PORTULAN CLARIN

Albertina PT-* is a foundation, large language model for the Portuguese language.

It is an encoder of the BERT family, based on the neural architecture Transformer and developed over the DeBERTa model, with most competitive performance for this language. It has different versions that were trained for different variants of Portuguese (PT), namely the European variant from Portugal (PT-PT) and the American variant from Brazil (PT-BR), and different model sizes, and it is distributed free of charge and under a most permissible license.

Albertina PT-PT

Albertina PT-PT is the version for European Portuguese from Portugal, and to the best of our knowledge, this is an encoder specifically for this language and variant that, at the time of its initial distribution, sets a new state of the art for it, and is made publicly available and distributed for reuse.

This model is available in three sizes, specifically 1.5 billion, 900 million and 100 million parameters. Visit the respective HuggingFace pages for each of these models for instructions on how to use them in your experiments:

Albertina PT-BR (no brWaC)

Albertina PT-BR (no brWaC) is a version for American Portuguese from Brazil trained on data sets other than brWaC, and thus with a most permissive license.

This model is available in three sizes, specifically 1.5 billion, 900 million and 100 million parameters. Visit the respective HuggingFace pages for each of these models for instructions on how to use them in your experiments:

Albertina PT-BR (brWAC)

Albertina PT-BR is the version for American Portuguese from Brazil, trained on the brWaC data set.

Visit the HuggingFace page for the Albertina PT-BR (brWaC) model for instructions on how to use this model on your experiments.

BERTimbau is a pretrained BERT model for American Portuguese from Brazil that achieves state-of-the-art performances on three downstream NLP tasks: Named Entity Recognition, Sentence Textual Similarity and Recognizing Textual Entailment. It is available in two sizes: Base and Large.

BERTimbau Base

Visit the HuggingFace page for the BERTimbau Base model for instructions on how to use this model on your experiments.

BERTimbau Large

Visit the HuggingFace page for the BERTimbau Large model for instructions on how to use this model on your experiments.

Gervásio PT-* is a foundation, large language model for the Portuguese language.

It is a decoder of the LLaMA family, based on the neural architecture Transformer and developed over the LLaMA-3 8B model. Its further improvement through additional training was done over language resources that include new instruction data sets of Portuguese prepared for this purpose (extraGLUE-Instruct ). All versions of Gervásio are openly distributed for free under an open license, including thus for research and commercial purposes, and given its size, can be run on consumer-grade hardware.

Gervásio PT-PT is the version for European Portuguese from Portugal. Visit the HuggingFace page for the Gervásio PT-PT model for instructions on how to use this model on your experiments.

Serafim PT-* is a foundation, large language model for the Portuguese language.

It is a sentence encoder (embedder) based on the Albertina and BERTimbau encoders, with most competitive performance for this language. It is distributed free of charge and under a most permissible license.

This model is available in three sizes, specifically 900 million, 335 million and 100 million parameters. Visit the respective HuggingFace pages for each of these models for instructions on how to use them in your experiments:

Albertina PT-* models

Albertina PT-PT

Albertina PT-BR (no brWaC)

Albertina PT-BR (brWAC)

BERTimbau models

BERTimbau Base

BERTimbau Large

Gervásio models

Serafim models