Search and Browse – PORTULAN CLARIN

UIMA/U-Compare GENIA Tokeniser (GENIA Tagger)

Tokenisation is one of the functionalities of the GENIA tagger, which additionally outputs the base forms, part-of-speech tags, chunk tags, and named entity tags. The tagger is specifically tuned for biomedical text such as MEDLINE abstracts. The tool is a UIMA component, which forms part of th...

Resource Type:	Tool / Service
Language:	English

U-Compare Apertium Part-of-Speech Tagging Workflow

This is a workflow that is designed especially for use in the UIMA-based U-Compare workbench (see separate META-SHARE record). The workflow is in "ucz" format (specific to U-Compare) and can be imported via the "Import Workflow" item in the "Workflows" menu of the U-Compare interface. It include...

Resource Type:	Tool / Service
Languages:	Basque
	Catalan; Valencian
	English
	Galician
	Portuguese
	Spanish; Castilian

ExtraGLUE

ExtraGLUE is a Portuguese dataset obtained by the automatic translation of some of the tasks in the GLUE and SuperGLUE benchmarks. Two variants of Portuguese are considered, namely European Portuguese and American Portuguese. The 14 tasks in extraGLUE cover different aspects of language unders...

Resource Type:	Corpus
Media Type:	Text
Language:	Portuguese

ExtraGLUE-instruct

ExtraGLUE-instruct is a data set with examples from tasks, with instructions and with prompts that integrate instructions and examples, for both the European variant of Portuguese, spoken in Portugal, and the American variant of Portuguese, spoken in Brazil. For each variant, it contains over 170...

Resource Type:	Corpus
Media Type:	Text
Language:	Portuguese

AuCoPro - Splitting

The AuCoPro-Splitting dataset contains compounds annotated with their compound boundaries and linking morphemes. The dataset consists of two files, one for Afrikaans and one for Dutch. The annotation was performed according to annotation guidelines as described in Verhoeven, van Zaanen, van Huyss...

Resource Type:	Lexical / Conceptual
Media Type:	Text
Languages:	Afrikaans
Languages:	Dutch; Flemish

DVPM-browser

DVPM-browser is a browser for the DVPM lexical database of medieval Portuguese.

Resource Type:	Tool / Service

COVID-19 EU presscorner v1 dataset. Multilingual (CEF languages)

Multilingual (CEF languages) corpus acquired from website (https://ec.europa.eu/commission/presscorner/) of the EU portal (14th May 2020). It contains 23 TMX files (EN-X, where X is a CEF language) with 83217 TUs in total.

Resource Type:	Corpus
Media Type:	Text
Languages:	Bulgarian
	Croatian
	Czech
	Danish
	Dutch; Flemish
	English
	Estonian
	Finnish
	French
	German
	Greek, Modern (1453-)
	Hungarian
	Irish
	Italian
	Latvian
	Lithuanian
	Maltese
	Moldavian; Moldovan
	Polish
	Portuguese
	Romanian
	Slovak
	Slovenian
	Spanish; Castilian
	Swedish

COVID-19 EC-EUROPA v1 dataset. Multilingual (CEF languages)

Multilingual (CEF languages) corpus acquired from website (https://ec.europa.eu/*coronavirus-response) of the EU portal (20th May 2020). It contains 23 TMX files (EN-X, where X is a CEF language) with 53311 TUs in total.

Resource Type:	Corpus
Media Type:	Text
Languages:	Bulgarian
	Croatian
	Czech
	Danish
	Dutch; Flemish
	English
	Estonian
	Finnish
	French
	German
	Greek, Modern (1453-)
	Hungarian
	Irish
	Italian
	Latvian
	Lithuanian
	Maltese
	Moldavian; Moldovan
	Polish
	Portuguese
	Romanian
	Slovak
	Slovenian
	Spanish; Castilian
	Swedish

Hontology

Hontology (H stands for hotel, hostal and hostel) (available at http://ontolp.inf.pucrs.br/Recursos/downloads-Hontology.php) is a new multilingual ontology for the accommodation sector freely available, containing 282 concepts categorized into 16 top-level concepts. The concepts of other voca...

Resource Type:	Lexical / Conceptual
Media Type:	Text
Language:	Portuguese, English, Spanish, French

FLY corpus - morpho

FLY Corpus is a corpus composed by 2000 informal letters written in Portuguese, in the years spanning from 1900 to 1974, in the context of war, migration, imprisonment and exile. Each letter is in an XML file with two main parts: (a) the header, which contains metadata about the document (the ...

Resource Type:	Corpus
Media Type:	Text
Language:	Portuguese

Order by:

Filter by: