LX-Rare Word Similarity Dataset
|Handle:||https://hdl.handle.net/21.11129/0000-000B-D38D-8 (persistent URL to this page)|
The LX-Rare Word Similarity Data set was created from Stanford Rare Word (RW) Similarity data set (Luong et al., 2013). This list contains 2 034 words (1 017 pairs of words). All the words were extracted from Wikipedia and from WordNet (Miller, 1995), a lexical database where the concepts are grouped into sets of synonyms.
The construction of this list followed this procedure: a) firstly, a list of rare words was selected from Wikipedia, b) after that, each rare word was paired with a related word picked from WordNet. Rare words are those words that have between 5 000 to 10 000 occurrences in Wikipedia.
In the end, the result was a set of word pairs in which one of the words is rare and the other one, which can be rare or not, is related to the first word by some WordNet relation - it can be an hyponym, hyperonym, meronym, holonym or attribute of the former.
You may also be interested in the other resources for the evaluation of distributional semantic models of Portuguese that are also available from this repository: LX-SimLex-999, LX-WordSim-353, LX-ESSLLI 2008, LX-Battig, LX-AP, LX-4WAnalogies and LX-4WAnalogiesBR.