Code-switched English-Spanish Tweets

Handle:	https://hdl.handle.net/21.11129/0000-000D-FEA1-F (persistent URL to this page)
URL:	http://lrec2018.lrec-conf.org/sharedlrs2018/92_res_2.zip

This package contains the collection of tweets described in the LREC 2018 paper: "Collecting Code-Switched Data from Social Media", Gideon Mendels, Victor Soto, Aaron Jaech and Julia Hirschberg, LREC 2018. Please remember to cite this paper if you use this resource. The tagged_tweets_ids file contains the IDs of the 8,285 tweets for which we crowdsourced language tags. These tweets were collected using Babler (https://github.com/gidim/Babler/blob/master/README.md) and the anchor wordlists described in the paper and that can be found in http://www.cs.columbia.edu/~vsoto/files/anchor_wordlists.zip The tagged_tweets_labels file contains the crowdsourced language tags for each token in the collection of 8,285 tweets. The format of the file is one line per token and each line contains a tweet ID, token index and language tag. The language tag values are the following (for a more thorough explanation read the paper): lang1 = English, lang2 = Spanish, ne = Named Entity, unk = Unknown, fw = Foreign Word, ambiguous, mixed and other.

Download

DistributionLicence

Apache Licence 2.0

Restrictions: Other, Share Alike

Contact Person

Gideon Mendels

Columbia University

[javascript protected email address]

New York

USA

New York

USA

[javascript protected email address]

text

Bilingual text corpusLanguages

Spanish; Castilian English

Linguality

Linguality type: Bilingual

Size

493 Kb

Metadata

Created: 10/08/2020

Last Updated: 11/09/2020

Metadata Creator

Andrea Teixeira Female

University of Lisbon

[javascript protected email address]

Campo Grande

1749-016 Lisboa

Lisboa

Portugal

Tel.: 00350 217 500 000

Faculty of Sciences of the University of Lisbon

Campo Grande

1749-016 Lisboa

Lisboa

Portugal

[javascript protected email address]

Tel.: 00350 217 500 000

Documentation

Document Type: In Proceedings

Gideon Mendels, Victor Soto, Aaron Jaech and Julia Hirschberg, Collecting Code-Switched Data from Social Media, , 2018

Editor: Nicoletta Calzolari, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Koiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis and Takenobu Tokunaga

Publisher: European Language Resources Association (ELRA)

Book Title: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

ISBN: 979-10-95546-00-9

People who looked at this resource also viewed the following:

People who downloaded this resource also downloaded the following:

Spanish-English website parallel corpus (Processed)