|https://hdl.handle.net/21.11129/0000-000B-D391-2 (persistent URL to this page)
LX-AP was created from the translation of Almuhareb-Poesio (ap) benchmark (Almuhareb and Poesio, 2005). The original data set was created considering three aspects: POS, frequency and ambiguity.
It contains 402 names from 21 categories of WordNet, with 13 to 21 names from each one of those categories. Examples of some categories: feeling, game, time, tree, vehicle, chemical element or motivation (more examples are shown in Table 6).
To estimate the word frequency it was used the British National Corpus. Concerning frequency, ⅓ of the words of the corpus has high frequency (1 000 occurrences or more), ⅓ has medium frequency (between 100 to 1 000 occurrences) and ⅓ has low frequency (5 to 100 occurrences).
The evaluation of the degree of ambiguity of each word was calculated taking into account the amount of senses of each word found in the WordNet. With four or more senses, the word was considered very ambiguous; with two or three meanings, the word would have medium ambiguity; and with one meaning, the word was considered not ambiguous. Each level of frequency and ambiguity is equally represented in the set.
We are aware that a word that is frequent in English can be less frequent in Portuguese and that a word that is ambiguous in English can be less ambiguous in Portuguese. More than translating the original data set, it would be interesting to build a data set that, in Portuguese, would also be balanced in terms of frequency and ambiguity of words. As a possible future work, an analysis of the frequency of the words using a large Portuguese data set as a reference, and an analysis of the ambiguity of the words using the Portuguese Wordnet would improve this data set. However, because the lexicographic resources required to fulfil those tasks are not available yet, the LX-AP is made of the translation from the English words, resulting in a test set with the same size as the original.
The translation process of this data set from English to Portuguese involved two annotators and a third adjudicator.
You may also be interested in the other resources for the evaluation of distributional semantic models of Portuguese that are also available from this repository: LX-SimLex-999, LX-Rare Word Similarity Dataset, LX-WordSim-353, LX-ESSLLI 2008, LX-Battig, LX-4WAnalogies and LX-4WAnalogiesBR.