Code-switched English-Spanish Tweets

This package contains the collection of tweets described in the LREC 2018 paper: "Collecting Code-Switched Data from Social Media", Gideon Mendels, Victor Soto, Aaron Jaech and Julia Hirschberg, LREC 2018. Please remember to cite this paper if you use this resource. The tagged_tweets_ids file contains the IDs of the 8,285 tweets for which we crowdsourced language tags. These tweets were collected using Babler (https://github.com/gidim/Babler/blob/master/README.md) and the anchor wordlists described in the paper and that can be found in http://www.cs.columbia.edu/~vsoto/files/anchor_wordlists.zip The tagged_tweets_labels file contains the crowdsourced language tags for each token in the collection of 8,285 tweets. The format of the file is one line per token and each line contains a tweet ID, token index and language tag. The language tag values are the following (for a more thorough explanation read the paper): lang1 = English, lang2 = Spanish, ne = Named Entity, unk = Unknown, fw = Foreign Word, ambiguous, mixed and other.

Download



People who looked at this resource also viewed the following: