Luke Gessler


2020

pdf bib
A Summary of the First Workshop on Language Technology for Language Documentation and Revitalization
Graham Neubig | Shruti Rijhwani | Alexis Palmer | Jordan MacKenzie | Hilaria Cruz | Xinjian Li | Matthew Lee | Aditi Chaudhary | Luke Gessler | Steven Abney | Shirley Anugrah Hayati | Antonios Anastasopoulos | Olga Zamaraeva | Emily Prud’hommeaux | Jennette Child | Sara Child | Rebecca Knowles | Sarah Moeller | Jeffrey Micher | Yiyuan Li | Sydney Zink | Mengzhou Xia | Roshan S Sharma | Patrick Littell
Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL)

Despite recent advances in natural language processing and other language technology, the application of such technology to language documentation and conservation has been limited. In August 2019, a workshop was held at Carnegie Mellon University in Pittsburgh, PA, USA to attempt to bring together language community members, documentary linguists, and technologists to discuss how to bridge this gap and create prototypes of novel and practical language revitalization technologies. The workshop focused on developing technologies to aid language documentation and revitalization in four areas: 1) spoken language (speech transcription, phone to orthography decoding, text-to-speech and text-speech forced alignment), 2) dictionary extraction and management, 3) search tools for corpora, and 4) social media (language learning bots and social media analysis). This paper reports the results of this workshop, including issues discussed, and various conceived and implemented technologies for nine languages: Arapaho, Cayuga, Inuktitut, Irish Gaelic, Kidaw’ida, Kwak’wala, Ojibwe, San Juan Quiahije Chatino, and Seneca.

pdf bib
Supervised Grapheme-to-Phoneme Conversion of Orthographic Schwas in Hindi and Punjabi
Aryaman Arora | Luke Gessler | Nathan Schneider
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Hindi grapheme-to-phoneme (G2P) conversion is mostly trivial, with one exception: whether a schwa represented in the orthography is pronounced or unpronounced (deleted). Previous work has attempted to predict schwa deletion in a rule-based fashion using prosodic or phonetic analysis. We present the first statistical schwa deletion classifier for Hindi, which relies solely on the orthography as the input and outperforms previous approaches. We trained our model on a newly-compiled pronunciation lexicon extracted from various online dictionaries. Our best Hindi model achieves state of the art performance, and also achieves good performance on a closely related language, Punjabi, without modification.

pdf bib
AMALGUM – A Free, Balanced, Multilayer English Web Corpus
Luke Gessler | Siyao Peng | Yang Liu | Yilun Zhu | Shabnam Behzad | Amir Zeldes
Proceedings of The 12th Language Resources and Evaluation Conference

We present a freely available, genre-balanced English web corpus totaling 4M tokens and featuring a large number of high-quality automatic annotation layers, including dependency trees, non-named entity annotations, coreference resolution, and discourse trees in Rhetorical Structure Theory. By tapping open online data sources the corpus is meant to offer a more sizable alternative to smaller manually created annotated data sets, while avoiding pitfalls such as imbalanced or unknown composition, licensing problems, and low-quality natural language processing. We harness knowledge from multiple annotation layers in order to achieve a “better than NLP” benchmark and evaluate the accuracy of the resulting resource.