CINTIL-Corpus Internacional do Português

CINTIL-Corpus Internacional do Português is a linguistically interpreted corpus of Portuguese. At present it is composed of 1 Million annotated tokens, verified by human expert annotators. The annotation comprises information on part-of-speech, open classes lemma and inflection, multi-word expres...

Resource Type:Corpus
Media Type:Text
Language:Portuguese
LT Corpus

The LT Corpus (Literary Corpus) contains approximately 1,781,083 running words of European and Brazilian Portuguese. It includes 70 copyright-free classics (61 Portugal and 9 from Brazil) published before 1940.

Resource Type:Corpus
Media Type:Text
Language:Portuguese
Spoken Corpus Mozambique

The Spoken Corpus Mozambique contains approximately 121,958 running words of spoken Portuguese from Mozambique. It includes 40 transcriptions of spoken recordings (in a total of 40 hours of recordings) that were recorded between 1986 and 1987.

Resource Type:Corpus
Media Type:Text
Language:Portuguese
EUROPARL Corpus Parallel Corpora: Portuguese-English

The EUROPARL Corpus (subpart Portuguese-English of the parallel corpora), available at http://www.statmt.org/europarl/, was extracted from the proceedings of the European Parliament (Koehn, 2005). It contains transcriptions of sessions dating back from 1996 to 2011, in a total of approximately 58...

Resource Type:Corpus
Media Type:Text
Languages:English
Portuguese
PTPARL Corpus

The PTPARL Corpus contains approximately 975,806 running words of European Portuguese. It includes 1076 texts consisting of adapted transcriptions of the Portuguese parliament sessions, which were made available in 2004.

Resource Type:Corpus
Media Type:Text
Language:Portuguese
SIMPLE Portuguese Lexicon

The SIMPLE Portuguese Lexicon is constituted by 10,438 entries semantically encoded, accordingly to the parole common encoding standards.

Resource Type:Lexical / Conceptual
Media Type:Text
Language:Portuguese
Uplug

Uplug (see Tiedemann, 2003a) is a collection of tools and scripts for processing text-corpora, for automatic alignment and for term extraction from parallel corpora. Several tools have been integrated in Uplug. Pre-processing tools include a sentence splitter, a general tokenizer and wrappers a...

Resource Type:Tool / Service
CINTIL-Definitions

The corpus presented here is a collection of several tutorials and scientific papers in the field of Information Technology with 603 annotated definitions from Portuguese. The texts were collected from the Web at the beginning of the 2006 and they are organised in 32 files of three different sub-...

Resource Type:Corpus
Media Type:Text
Language:Portuguese
MSTParser

MSTParser is a non-projective dependency parser (see McDonald et al., 2005a, 2006) that searches for maximum spanning trees over directed graphs. Models of dependency structure are based on large-margin discriminative training methods (see McDonald et al., 2005b). Projective parsing is also suppo...

Resource Type:Tool / Service
PAROLE Portuguese Annotated Corpus

The PAROLE Portuguese Corpus – tagged subset contains 250.000 tokens and is a subset of the PAROLE Portuguese Corpus of 3 million running words of European Portuguese. The corpus was classified and encoded according to the common core parole encoding standard. The tagged subset reproduces appro...

Resource Type:Corpus
Media Type:Text
Language:Portuguese

Order by: