Code-switched English-Spanish Tweets
Handle: | https://hdl.handle.net/21.11129/0000-000D-FEA1-F (persistent URL to this page) |
---|---|
URL: | http://lrec2018.lrec-conf.org/sharedlrs2018/92_res_2.zip |
This package contains the collection of tweets described in the LREC 2018 paper: "Collecting Code-Switched Data from Social Media", Gideon Mendels, Victor Soto, Aaron Jaech and Julia Hirschberg, LREC 2018. Please remember to cite this paper if you use this resource. The tagged_tweets_ids file contains the IDs of the 8,285 tweets for which we crowdsourced language tags. These tweets were collected using Babler (https://github.com/gidim/Babler/blob/master/README.md) and the anchor wordlists described in the paper and that can be found in http://www.cs.columbia.edu/~vsoto/files/anchor_wordlists.zip The tagged_tweets_labels file contains the crowdsourced language tags for each token in the collection of 8,285 tweets. The format of the file is one line per token and each line contains a tweet ID, token index and language tag. The language tag values are the following (for a more thorough explanation read the paper): lang1 = English, lang2 = Spanish, ne = Named Entity, unk = Unknown, fw = Foreign Word, ambiguous, mixed and other.
Download- EUIPO - list of goods and services Spanish and English (Processed)
- Spanish-English website parallel corpus (Processed)
- English-Swedish parallel corpus from the web site of the Swedish Migration Board - Migrationsverket (Processed)
- Bilingual documents Bulgarian-English in the field of transport (Processed)