ExtraGLUE-instruct

Handle:	https://hdl.handle.net/21.11129/0000-0011-A087-F (persistent URL to this page)
URL:	https://huggingface.co/datasets/PORTULAN/extraglue-instruct

ExtraGLUE-instruct is a data set with examples from tasks, with instructions and with prompts that integrate instructions and examples, for both the European variant of Portuguese, spoken in Portugal, and the American variant of Portuguese, spoken in Brazil. For each variant, it contains over 170,000 examples with over 68 million tokens.

It is based on eight of the tasks in the Portuguese extraGLUE dataset (also available from this repository) which cover different aspects of language understanding:

Similarity:
- STS-B (Semantic Textual Similarity Benchmark): A data set of sentence pairs annotated with a 0-5 score indicating the semantic similarity between the two sentences.
- MRPC (Microsoft Research Paraphrase Corpus): A data set of sentence pairs, annotated as to whether they are paraphrases of each other.

Inference:
- RTE (Recognizing Textual Entailment): A data set of sentence pairs, annotated as to whether one (the premise) entails the other (the hypothesis).
- WNLI (Winograd Natural Language Inference): A data set of sentence pairs where the first sentence contains a pronoun whose referent must be correctly resolved in order to determine whether the first sentence entails the second sentence.
- CB (CommitmentBank): A data set of excerpt-clause pairs, where the clause has been extracted from the excerpt. Each pair is classified as to whether the excerpt implies, contradicts, or is neutral in relation to the clause.

Question answering:
- BoolQ (Boolean Questions): A data set of text excerpts and questions with yes/no answers.
- MultiRC (Multi-Sentence Reading Comprehension): A data set where each instance consists of a context paragraph, a question about that paragraph, and an answer, labeled as to whether the answer is true or false. For the a given context paragraph there may be multiple questions, and for each question there may be multiple answers, some true and some false.

Reasoning:
- COPA (Choice of Plausible Alternatives): A data set containing a premise, two alternative sentences, and a cause/effect indication. The task consists of indicating which of the two alternative sentences is the cause/effect of the premise.