Evaluating Word Embeddings for Indonesian–English Code-Mixed Text Based on Synthetic Data
Arra’Di Nur Rizal, Sara Stymne
Abstract
Code-mixed texts are abundant, especially in social media, and poses a problem for NLP tools, which are typically trained on monolingual corpora. In this paper, we explore and evaluate different types of word embeddings for Indonesian–English code-mixed text. We propose the use of code-mixed embeddings, i.e. embeddings trained on code-mixed text. Because large corpora of code-mixed text are required to train embeddings, we describe a method for synthesizing a code-mixed corpus, grounded in literature and a survey. Using sentiment analysis as a case study, we show that code-mixed embeddings trained on synthesized data are at least as good as cross-lingual embeddings and better than monolingual embeddings.- Anthology ID:
- 2020.calcs-1.4
- Volume:
- Proceedings of the The 4th Workshop on Computational Approaches to Code Switching
- Month:
- May
- Year:
- 2020
- Address:
- Marseille, France
- Venues:
- CALCS | LREC | WS
- SIG:
- Publisher:
- European Language Resources Association
- Note:
- Pages:
- 26–35
- URL:
- https://www.aclweb.org/anthology/2020.calcs-1.4
- DOI:
- PDF:
- https://www.aclweb.org/anthology/2020.calcs-1.4.pdf
You can write comments here (and agree to place them under CC-by). They are not guaranteed to stay and there is no e-mail functionality.