XGLUE benchmark dataset

XGLUE

Handle:	https://hdl.handle.net/21.11129/0000-000E-631D-3 (persistent URL to this page)
URL:	https://microsoft.github.io/XGLUE/

XGLUE is a new benchmark dataset to evaluate the performance of cross-lingual pre-trained models with respect to cross-lingual natural language understanding and generation.

XGLUE is composed of 11 tasks spans 19 languages. For each task, the training data is only available in English. This means that to succeed at XGLUE, a model must have a strong zero-shot cross-lingual transfer capability to learn from the English data of a specific task and transfer what it learned to other languages.

XGLUE has two characteristics: First, it includes cross-lingual NLU and cross-lingual NLG tasks at the same time; Second, besides including 5 existing cross-lingual tasks (i.e. NER, POS, MLQA, PAWS-X and XNLI), XGLUE selects 6 new tasks from Bing scenarios as well, including News Classification (NC), Query-Ad Matching (QADSM), Web Page Ranking (WPR), QA Matching (QAM), Question Generation (QG) and News Title Generation (NTG). Such diversities of languages, tasks and task origin provide a comprehensive benchmark for quantifying the quality of a pre-trained model on cross-lingual natural language understanding and generation.

Contact Resource Maintainer

DistributionLicence

Under Negotiation

Restrictions: Academic - Non Commercial Use

Contact Person

Yaobo Liang

[javascript protected email address]

text

Multilingual text corpusLanguages

English Spanish; Castilian Greek, Modern (1453-) Hindi French Arabic German Bulgarian Dutch; Flemish Italian Portuguese Polish Swahili Russian Turkish Thai Vietnamese Urdu Chinese

Linguality

Linguality type: Multilingual

Multi-linguality type: Parallel

Size

Metadata

Created: 07/09/2021

Last Updated: 07/13/2021

Metadata Creator

João Ricardo Silva

http://nlx-server.di.fc.ul.pt/~jsilva/

University of Lisbon, Faculty of Sciences

FCUL