Sebastin Santy


2020

pdf bib
The State and Fate of Linguistic Diversity and Inclusion in the NLP World
Pratik Joshi | Sebastin Santy | Amar Budhiraja | Kalika Bali | Monojit Choudhury
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Language technologies contribute to promoting multilingualism and linguistic diversity around the world. However, only a very small number of the over 7000 languages of the world are represented in the rapidly evolving language technologies and applications. In this paper we look at the relation between the types of languages, resources, and their representation in NLP conferences to understand the trajectory that different languages have followed over time. Our quantitative investigation underlines the disparity between languages, especially in terms of their resources, and calls into question the “language agnostic” status of current models and systems. Through this paper, we attempt to convince the ACL community to prioritise the resolution of the predicaments highlighted here, so that no language is left behind.

pdf bib
Learnings from Technological Interventions in a Low Resource Language: A Case-Study on Gondi
Devansh Mehta | Sebastin Santy | Ramaravind Kommiya Mothilal | Brij Mohan Lal Srivastava | Alok Sharma | Anurag Shukla | Vishnu Prasad | Venkanna U | Amit Sharma | Kalika Bali
Proceedings of The 12th Language Resources and Evaluation Conference

The primary obstacle to developing technologies for low-resource languages is the lack of usable data. In this paper, we report the adaption and deployment of 4 technology-driven methods of data collection for Gondi, a low-resource vulnerable language spoken by around 2.3 million tribal people in south and central India. In the process of data collection, we also help in its revival by expanding access to information in Gondi through the creation of linguistic resources that can be used by the community, such as a dictionary, children’s stories, an app with Gondi content from multiple sources and an Interactive Voice Response (IVR) based mass awareness platform. At the end of these interventions, we collected a little less than 12,000 translated words and/or sentences and identified more than 650 community members whose help can be solicited for future translation efforts. The larger goal of the project is collecting enough data in Gondi to build and deploy viable language technologies like machine translation and speech to text systems that can help take the language onto the internet.