Data Augmentation for Transformer-based G2P

Zach Ryan, Mans Hulden


Abstract
The Transformer model has been shown to outperform other neural seq2seq models in several character-level tasks. It is unclear, however, if the Transformer would benefit as much as other seq2seq models from data augmentation strategies in the low-resource setting. In this paper we explore strategies for data augmentation in the g2p task together with the Transformer model. Our results show that a relatively simple alignment-based strategy of identifying consistent input-output subsequences in grapheme-phoneme data coupled together with a subsequent splicing together of such pieces to generate hallucinated data works well in the low-resource setting, often delivering substantial performance improvement over a standard Transformer model.
Anthology ID:
2020.sigmorphon-1.21
Volume:
Proceedings of the 17th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology
Month:
July
Year:
2020
Address:
Online
Venues:
ACL | SIGMORPHON | WS
SIG:
SIGMORPHON
Publisher:
Association for Computational Linguistics
Note:
Pages:
184–188
URL:
https://www.aclweb.org/anthology/2020.sigmorphon-1.21
DOI:
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
https://www.aclweb.org/anthology/2020.sigmorphon-1.21.pdf

You can write comments here (and agree to place them under CC-by). They are not guaranteed to stay and there is no e-mail functionality.