Manually annotated corpora for teaching and learning purposes of Brazilian Portuguese, Dutch, Estonian, and Slovene
Handle: | https://hdl.handle.net/21.11129/0000-0010-05DA-3 (persistent URL to this page) |
---|---|
URL: | https://www.uc.pt/celga-iltec/crowll/ |
These are manually annotated corpora for teaching and learning purposes of Brazilian Portuguese, Dutch, Estonian, and Slovene, as a contribution to the Manually Annotated Corpora Family available in CLARIN. Sentences are annotated with “problematic” or “non-problematic” labels, from the point of usage for pedagogical purposes. Sentences labelled as problematic also have annotations defining the category of the problem (offensive, vulgar, sensitive content, grammar and/or spelling problems, incomprehensible and/or lack of context). Each corpus consists of 10.000 sentences, which were annotated by language experts.
These corpora were compiled in the context of a larger project to develop the CrowLL game, which is a gamified solution for further corpus growth based on crowdsourcing techniques. These corpora are used as “seed” corpora in the game, i.e., as the starting point for the crowdsource-supported development of larger corpora.
Annotation guidelines for each one of the languages are included. For non-speakers of those languages, a general annotation guideline in English is included, so that it is possible to learn how the annotation was carried out.
These corpora will allow language teachers, materials developers and lexicographers to use the annotation to (de)select content/structure that is considered inappropriate or not (yet) suitable for the category of language learners involved. In addition to pedagogical goals, these corpora can also be used within NLP as datasets to train machine learning algorithms.
This project received funding from the CLARIN Resource Families Project (CRF), the Portuguese national funding agency, FCT – Foundation for Science and Technology, I.P. (grant number UIDP/04887/2020), and the Slovenian Research Agency, research core funding No. P6-0411, Language Resources and Technologies for Slovene. The project also acknowledges the support of the Dutch Language Institute, the Institute of the Estonian Language through the Estonian Research Council grant (PRG 1978), and Ruppin Academic Center.
Arhar Holdt, Špela; Kosem, Iztok (2023). Manually annotated corpora for teaching and learning purposes of Slovene.
Koppel, Kristina (2023). Manually annotated corpora for teaching and learning purposes of Estonian.
Tiberius, Carole (2023). Manually annotated corpora for teaching and learning purposes of Dutch.
Zingano Kuhn, Tanara (2023). Manually annotated corpora for teaching and learning purposes of Brazilian Portuguese.
Disclaimer:
This resource contains content that may be considered offensive or sensitive, such as racism, sexism, homophobia, pornography, death, illness or otherwise upsetting material. This content does not reflect the opinions and beliefs of the researchers involved in this project or their institutions.