VIDiom-PT

VIDiom-PT is a European Portuguese corpus annotated for verbal idioms, designed to support NLP applications in idiom processing. The resulting corpus comprises 5,178 annotated instances covering 747 distinct verbal idioms. The annotation process was validated through an inter-annotator agreement ...

Resource Type:Corpus
Media Type:Text
Language:Portuguese
Anonymised ParaCrawl release 7 Portuguese-English

This corpus was run through BiRoamer https://github.com/bitextor/biroamer to anonymise the Portuguese-English parallel data from release 7 of the ParaCrawl project, specifically "Broader Web-Scale Provision of Parallel Corpora for European Languages". This version is filtered with BiCleaner with ...

Resource Type:Corpus
Media Type:Text
Languages:English
Portuguese
BioLexicon

The BioLexicon is a large-scale, wide-coverage computational lexicon covering the biomedical domain. A large part of the lexicon is concerned with covering biomedical terms and their variants. Entries for domain-specific verbs include syntactic and semantic information. The lexicon includes entri...

Resource Type:Corpus
Media Type:Text
Language:English
Portuguese Parish Memories (1758)

«The Memórias Paroquiais (Parish Memories) are an essential source for obtaining a radiography of Portugal in 1758-1761. They correspond to a survey, organized in 3 major parts (the locality itself, the mountain and the river), which was printed and sent to those responsible for the dioceses of t...

Resource Type:Corpus
Media Type:Text
Language:Portuguese
LX-4WAnalogiesBR

The test set described in was used as the basis for the assessment of word embeddings. An example entry in this data set would read: ‘Berlin Germany Lisbon Portugal’. With these four words relations – as in this example – one can test semantic analogies by using any of the possible combinations o...

Resource Type:Corpus
Media Type:Text
Language:Portuguese
LX-4WAnalogies

The test set described in was used as the basis for the assessment of word embeddings. An example entry in this data set would read: ‘Berlin Germany Lisbon Portugal’. With these four words relations – as in this example – one can test semantic analogies by using any of the possible combinations o...

Resource Type:Corpus
Media Type:Text
Language:Portuguese
COVID-19 ANTIBIOTIC dataset. Bilingual (EN-PT)

Bilingual (EN-PT) corpus acquired from the website https://antibiotic.ecdc.europa.eu/

Resource Type:Corpus
Media Type:Text
Languages:English
Portuguese
CORP-ORAL

CORP-ORAL is a spontaneous speech corpus for European Portuguese. It is the main output of two R&D projects: CORP-ORAL and ORAL-PHON. The data consist of unscripted and unprompted face-to-face dialogues between family, friends, colleagues and unacquainted participants. All recordings are orthogra...

Resource Type:Corpus
Media Type:Audio
Language:Portuguese
FEUP Tweets

Tweet corpus

Resource Type:Corpus
Media Type:Text
Language:English
Georeferenced Tweets

Tweets annotated with geographic coordinates

Resource Type:Corpus
Media Type:Text
Language:English

Order by:

Filter by:

Text (446)
Audio (18)
Image (1)