An Assessment of Language Identification Methods on Tweets and Wikipedia Articles

An Assessment of Language Identification Methods on Tweets and Wikipedia Articles Pedro Vernetti author Larissa Freitas author 2020-jul text Proceedings of the The Fourth Widening Natural Language Processing Workshop Association for Computational Linguistics Seattle, USA conference publication Language identification is the task of determining the language which a given text is written. This task is important for Natural Language Processing and Information Retrieval activities. Two popular approaches for language identification are the N-grams and stopwords models. In this paper, these two models were tested on different types of documents such as short, irregular texts (tweets) and long, regular texts (Wikipedia articles). vernetti-freitas-2020-assessment 2020-jul 58 60