The NPChunks training corpus contains approximately 1,000 sentences, in a total of 24,243 tokens, selected randomly from the written part of the CINTIL corpus (Barreto et al, 2006). The CINTIL corpus is a linguistically interpreted corpus of Portuguese composed of 1 Million annotated tokens from ...
CORDIAL-SIN is a corpus of spoken dialectal European Portuguese developed at Centro de Linguística da Universidade de Lisboa (CLUL). The materials for this corpus were drawn from the recordings of dialect speech collected by the CLUL ATLAS team as fieldwork interviews for linguistic atlases betwe...
A publicação Arquivo dos Açores, consagrada como obra de referência para a investigação histórica sobre o arquipélago dos Açores, conta com duas séries, num total de 20 volumes. A primeira série do Arquivo dos Açores, composta por 15 volumes, decorreu entre 1878 e 1959, com grandes interrupções r...
The corpus was developed as a linguistic resource for Automatic Summarization research and his relation with different issues to engage studies on the discourse treatment. Summ-it consists of fifty texts from Science domain extracted from Science section of Brazilian daily newspaper Folha de Sã...
BDCamões Corpus - Collection of Portuguese Literary Documents from the Digital Library of Camões I.P., is a collection of literary documents written in Portuguese, in plain text .txt format, with close to 4 million words from over 200 complete documents from 83 authors in 14 genres, covering a ti...
CINTIL DependencyBank PREMIUM is a corpus of Portuguese utterances manually annotated with the representation of grammatical dependency relations and the information of part-of-speech, inflection and lemmas. It is being developed and maintained at the University of Lisbon. The current version is ...
Arquivo Dialetal CLUP - ORTH is a speech corpus approximately with 40 000 tokens (Utterances; spontaneous speech, mainly from Northern Portugal). Orthographic and phonetic transcription.
The present tool, that was built to deal with specific issues concerning orthographic conventions adopted for Portuguese, marks sentence boundaries with <s>…</s>, and paragraph boundaries with <p>…</p>. Unwraps sentences split over different lines. A f-score of 99.94% was obtained when testing o...
Based on the MXPOST part of speech tagger and UNITEX dictionaries for Portuguese, this tool produces the lemmas of the words of a text stored in a plain text file. The source code is also provided.
The Spoken Corpus Mozambique contains approximately 121,958 running words of spoken Portuguese from Mozambique. It includes 40 transcriptions of spoken recordings (in a total of 40 hours of recordings) that were recorded between 1986 and 1987.