An Assessment of Language Identification Methods on Tweets and Wikipedia Articles
Pedro Vernetti, Larissa Freitas
Abstract
Language identification is the task of determining the language which a given text is written. This task is important for Natural Language Processing and Information Retrieval activities. Two popular approaches for language identification are the N-grams and stopwords models. In this paper, these two models were tested on different types of documents such as short, irregular texts (tweets) and long, regular texts (Wikipedia articles).- Anthology ID:
- 2020.winlp-1.15
- Volume:
- Proceedings of the The Fourth Widening Natural Language Processing Workshop
- Month:
- July
- Year:
- 2020
- Address:
- Seattle, USA
- Venues:
- ACL | WS | WiNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 58–60
- URL:
- DOI:
You can write comments here (and agree to place them under CC-by). They are not guaranteed to stay and there is no e-mail functionality.