Massively Multilingual Pronunciation Modeling with WikiPron
Jackson L. Lee, Lucas F.E. Ashby, M. Elizabeth Garza, Yeonju Lee-Sikka, Sean Miller, Alan Wong, Arya D. McCarthy, Kyle Gorman
Abstract
We introduce WikiPron, an open-source command-line tool for extracting pronunciation data from Wiktionary, a collaborative multilingual online dictionary. We first describe the design and use of WikiPron. We then discuss the challenges faced scaling this tool to create an automatically-generated database of 1.7 million pronunciations from 165 languages. Finally, we validate the pronunciation database by using it to train and evaluating a collection of generic grapheme-to-phoneme models. The software, pronunciation data, and models are all made available under permissive open-source licenses.- Anthology ID:
- 2020.lrec-1.521
- Volume:
- Proceedings of The 12th Language Resources and Evaluation Conference
- Month:
- May
- Year:
- 2020
- Address:
- Marseille, France
- Venue:
- LREC
- SIG:
- Publisher:
- European Language Resources Association
- Note:
- Pages:
- 4223–4228
- URL:
- https://www.aclweb.org/anthology/2020.lrec-1.521
- DOI:
- PDF:
- https://www.aclweb.org/anthology/2020.lrec-1.521.pdf
You can write comments here (and agree to place them under CC-by). They are not guaranteed to stay and there is no e-mail functionality.