GistSumm (GIST SUMMarizer) is a summarization tool for Portuguese. It uses the gist as a guideline to identify and select text segments to include in the final extract. Automatically produced extracts have been evaluated under the light of gist preservation and textuality.
Gervásio PT-* is a foundation, large language model for the Portuguese language. It is a decoder of the GPT family, based on the neural architecture Transformer and developed over the Pythia model, with competitive performance for this language. It has different versions that were trained for ...
Gervásio PT-* is a foundation, large language model for the Portuguese language. It is a decoder of the GPT family, based on the neural architecture Transformer and developed over the Pythia model, with competitive performance for this language. It has different versions that were trained for ...
Tweets annotated with geographic coordinates
Geo-Net-PT 02 is a public Geospatial Ontology of Portugal (see Chaves et al., 2007), a computational resource (see Rodrigues et al., 2006 and Rodrigues, 2009) for applications demanding geographic information about Portugal, and contains 701,209 concepts stored in a GKB system, most of them admin...
The GENIA tagger analyzes English sentences and outputs the base forms, part-of-speech tags, chunk tags, and named entity tags. The tagger is specifically tuned for biomedical text such as MEDLINE abstracts.
A corpus of 2,000 MEDLINE abstracts, collected using the three MeSH terms human, blood cells and transcription factors. The corpus is available in three formats: 1) A text file containing part-of-speech (POS) annotation, based on the Penn Treebank format, 2) An XML file containing inline POS anno...
The corpus consists of 1000 MEDLINE abstracts. It is a subset of the original GENIA POS & term corpus, which was selected using the three MeSH terms human, blood cells and transcription factors. In each sentence, three types of information are annotated 1) biomedical terms are identified and assi...
This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Europe Facility - Automated Translation (CEF.AT) action. For further information on the project: http://lr-coordination.eu. Romanian – English corpus built from a Wikipedia dump.
This resource includes a spoken Portuguese corpus - with aligned sound and orthographic transcription -, collected among sociolinguistically diverse speakers. It consists of recordings from informal conversations.