The Wixarika-Spanish Parallel Corpus
Handle: | https://hdl.handle.net/21.11129/0000-000D-FA88-0 (persistent URL to this page) |
---|---|
URL: | https://opencor.gitlab.io/corpora/mager18wixarika/ |
Wixarika is an indigenous language spoken in central west Mexico by approximately fifty thousand people. For indigenous languages like Wixarika, there is a lack of digital resources in general since native speakers do not necessarily generate a digital fingerprint on public forums.
The lack of resources is even more noticeable for NLP related tasks. The corpus presented here aims to be a seed of a future larger effort to overcome this lag in the field, and especially for data-driven machine translation (MT). Since our collection has only 8,967 parallel phrases, it could be considered a low resource corpus. This could be a limiting for certain research purposes. – Wixarika has inherent linguistic properties which make it interesting to study for the sake of understanding the inner-working of languages. – Low resource scenarios offer an opportunity to imagine and create new tools for the transfer or exploitation of knowledge from other languages. – It requires to define new methodologies for the collection of corpora within the native speaker communities.
Wixarika is a language which belongs to the Coracholan subgroup of languages within the Uto-Aztecan family. It has a subject-object-verb (SOV) structure, and its morphological typology is polysynthetic. This means that it has a high morpheme-to-word ratio and a consequently large overall number of words. Therefore, this allows incorporating a great amount of information at the morphological level. Native speakers use 18 symbols Σwixarika = {a,e,h,i,+,k,m,n,p,r,t,s,u,w,x,y,’} from which ones five denote vowels: {a,e,i,u,+} with long and short variants. Although most linguists prefer a dashed i to denote the fourth vowel, in practice native speakers use a plus symbol (+). This corpus chose to use the latter in the orthography transcription of Wixarika.
To illustrate on the high amount of information contained in one single word in the Wixarika language let us analyze the nep+ka’ukats+k+, which means “I don’t have a dog”. This word is composed of the morphs ne|p+ |ka|’u|ka|ts+k+ 3 . In this example although this word is a verb, its polysynthetic nature makes it a full sentence: ts+k+ is the stem and means “dog”, ne is a first person possessive, ka negation, ’u refers to a visual object and ka is the second part of the negation.
Corpus
The corpus consists of a parallel collection of sentences which originated from the Hans Christian Andersen’s and brother Grimm classic fairy tales. A Wixarika native speaker fluent in Spanish carefully translated sentences from the tales. Although it is a small corpus you can notice that there is a big amount of token types given the rich morphology of the Wixarika language.
The Wixarika-Spanish parallel corpus is an effort to increase the research in Machine Translation for this language pair. Moreover, it can be a seed to promote the creation of more data collection for other indigenous languages. The main aim of the creation of such datasets is to feed data-driven MT systems.