Portuguese Parliamentary Corpus 4.0

ParlaMint-PT 4.0

The Portuguese Parliamentary Corpus is part of the Mutlilingual ParlaMint Corpus, a set of comparable corpora containing transcriptions of parliamentary debates of 29 European countries and autonomous regions. The Portuguese corpus (ParlaMint-PT) comprehends transcripts of sessions in the time period from 1 January 2015 until 22 March 2022. The corpus was divided into two subcorpora, according to the period each one covers: (i) the reference sub-corpus covers sessions from 1st January 2015 until 31st October 2019; (ii) the COVID subcorpus comprehends sessions between 1st November 2019 and 22nd March 2022. The time periods considered, as well as the division into two subcorpora taking into account the start of media coverage about COVID, follow ParlaCLARIN general guidelines and proceedings for parliamentary corpora (Erjavec and Pančur, 2019). The corpus has approximately 17M words, The Portuguese corpus provides information regarding the speaker’s ID, name and surname(s), birth date, death date, gender, political affiliation (only for MPs, not for occasional speakers), and the status of the speaker (role and role description). The information regarding political parties consists of the abbreviation of the party, the full name of the party (in Portuguese), and the party ID (which is the same as the abbreviation). Finally, the metadata concerning the session files encompasses date-stamped mandates, sessions and speeches. Each session contains the transcripts of the speeches divided into utterances and paragraphs. However, the transcripts also contain the transcribers’ commentary, which was retained and encoded. Each speech turn (i.e. utterance) is accompanied by the date, speaker ID, and role of the speaker (chair, regular or guest). The POS tagging was established using the MBT tagger (Daelemans et al., 1996) trained over the CINTIL corpus (Barreto et al., 2006). We adapted the tagset to be conformant to the UD POS tags used in ParlaMint. The CINTIL corpus includes NER annotation. We lemmatized the corpus with MBLEM (van den Bosch and Daelemans, 1999), which combines a dictionary lookup with a machine learning algorithm to produce lemmas. As a basis for the dictionary, we used a list of wordform - POS-tag combinations mapped to lemmas. This list was produced in-house. The dictionary used in MBLEM contains 102,196 word forms combined with 27,860 lemmas, leading to 120,768 wordform-lemma combinations. The adaptation of the MBT tagger and MBLEM lemmatizer are described in (Généreux et al., 2012). The UD Relations were established using the LX-UD dependency parser 3, adapted to the set of POS and relation types used in ParlaMint.

Contact Resource Maintainer



People who looked at this resource also viewed the following: