CORDIAL-SIN – Syntax-oriented Corpus of Portuguese Dialects


CORDIAL-SIN is a corpus of spoken dialectal European Portuguese developed at Centro de Linguística da Universidade de Lisboa (CLUL). The materials for this corpus were drawn from the recordings of dialect speech collected by the CLUL ATLAS team as fieldwork interviews for linguistic atlases between 1974 and 2004. The corpus amounts to c. 650,000 words collected from 42 locations within the continental territory of Portugal and the archipels of Madeira and Azores.

The data are linguistically annotated both at the morphosyntactic and the syntactic levels and include the mark-up of spoken language phenomena. Original transcription and textual mark-up conventions were based on the scheme designed for the CORAL – Corpus de Diálogo Etiquetado project (cf. CORDIAL-SIN transcription conventions). Part-of-speech tagging and syntactic annotation follow the system originally developed for the Penn Parsed Corpora of Historical English. The annotation guidelines for Portuguese were established in close cooperation with the Tycho Brahe Parsed Corpus of Historical Portuguese team (cf. CORDIAL-SIN POS annotation manual and Syntactic annotation manual).

An XML-TEI edition of the CORDIAL-SIN corpus was recently prepared, in which the whole data (transcription, textual mark-up, POS annotation and lemma) are stored in full-fledged XML files, complying with the standards defined by the Text-Encoding Initiative. Syntactic annotation adopts a standoff annotation format (see also CORDIAL-SIN treebank in this repository).


