The Wixarika-Spanish Parallel Corpus

Handle:	https://hdl.handle.net/21.11129/0000-000D-FA88-0 (persistent URL to this page)
URL:	https://opencor.gitlab.io/corpora/mager18wixarika/

Wixarika is an indigenous language spoken in central west Mexico by approximately fifty thousand people. For indigenous languages like Wixarika, there is a lack of digital resources in general since native speakers do not necessarily generate a digital fingerprint on public forums.

The lack of resources is even more noticeable for NLP related tasks. The corpus presented here aims to be a seed of a future larger effort to overcome this lag in the field, and especially for data-driven machine translation (MT). Since our collection has only 8,967 parallel phrases, it could be considered a low resource corpus. This could be a limiting for certain research purposes. – Wixarika has inherent linguistic properties which make it interesting to study for the sake of understanding the inner-working of languages. – Low resource scenarios offer an opportunity to imagine and create new tools for the transfer or exploitation of knowledge from other languages. – It requires to define new methodologies for the collection of corpora within the native speaker communities.

Wixarika is a language which belongs to the Coracholan subgroup of languages within the Uto-Aztecan family. It has a subject-object-verb (SOV) structure, and its morphological typology is polysynthetic. This means that it has a high morpheme-to-word ratio and a consequently large overall number of words. Therefore, this allows incorporating a great amount of information at the morphological level. Native speakers use 18 symbols Σwixarika = {a,e,h,i,+,k,m,n,p,r,t,s,u,w,x,y,’} from which ones five denote vowels: {a,e,i,u,+} with long and short variants. Although most linguists prefer a dashed i to denote the fourth vowel, in practice native speakers use a plus symbol (+). This corpus chose to use the latter in the orthography transcription of Wixarika.

To illustrate on the high amount of information contained in one single word in the Wixarika language let us analyze the nep+ka’ukats+k+, which means “I don’t have a dog”. This word is composed of the morphs ne|p+ |ka|’u|ka|ts+k+ 3 . In this example although this word is a verb, its polysynthetic nature makes it a full sentence: ts+k+ is the stem and means “dog”, ne is a first person possessive, ka negation, ’u refers to a visual object and ka is the second part of the negation.

Corpus
The corpus consists of a parallel collection of sentences which originated from the Hans Christian Andersen’s and brother Grimm classic fairy tales. A Wixarika native speaker fluent in Spanish carefully translated sentences from the tales. Although it is a small corpus you can notice that there is a big amount of token types given the rich morphology of the Wixarika language.
The Wixarika-Spanish parallel corpus is an effort to increase the research in Machine Translation for this language pair. Moreover, it can be a seed to promote the creation of more data collection for other indigenous languages. The main aim of the creation of such datasets is to feed data-driven MT systems.

Download

DistributionLicence

CC - BY - NC - SA

Restrictions: Academic - Non Commercial Use, Attribution, Share Alike

Download location: hidden

Distribution Access/Medium: Downloadable

User Nature: Academic

Contact Person

Manuel Mager

[javascript protected email address]

Mexico

text

Bilingual text corpusLanguages

Spanish; Castilian

Linguality

Linguality type: Bilingual

Multi-linguality type: Parallel

Size

11,562 Sentences

56,037 Tokens

Metadata

Created: 06/09/2020

Last Updated: 11/19/2020

Metadata Creator

Sara Grilo

[javascript protected email address]

University of Lisbon, Faculty of Sciences FCUL Sala 6.3.32, Edifício C6, Departamento de Informática, Faculdade de Ciências da Universidade de Lisboa

1749-016 Lisboa

Campo Grande

Portugal

Documentation

Document Type: Article

Manuel Mager, Diónico Carrillo, Ivan Meza, Probabilistic finite-state morphological segmenter for the Wixarika (Huichol) language, . In: Journal of Intelligent & Fuzzy Systems (Special Issue) , 2018

Document Type: Article

Manuel Mager, Diónico Carrillo, Ivan Meza, The Wikarika-Spanish Parallel Corpus, , Latin American and Iberian Languages Open Corpora Forum, At Canela (Brazil) , 2018

People who looked at this resource also viewed the following:

People who downloaded this resource also downloaded the following: