An Evaluation of Subword Segmentation Strategies for Neural Machine Translation of Morphologically Rich Languages

Aquia Richburg; Ramy Eskander; Smaranda Muresan; Marine Carpuat

An Evaluation of Subword Segmentation Strategies for Neural Machine Translation of Morphologically Rich Languages

Aquia Richburg, Ramy Eskander, Smaranda Muresan, Marine Carpuat

Abstract

Byte-Pair Encoding (BPE) (Sennrich et al., 2016) has become a standard pre-processing step when building neural machine translation systems. However, it is not clear whether this is an optimal strategy in all settings. We conduct a controlled comparison of subword segmentation strategies for translating two low-resource morphologically rich languages (Swahili and Turkish) into English. We show that segmentations based on a unigram language model (Kudo, 2018) yield comparable BLEU and better recall for translating rare source words than BPE.

Anthology ID:: 2020.winlp-1.40
Volume:: Proceedings of the The Fourth Widening Natural Language Processing Workshop
Month:: July
Year:: 2020
Address:: Seattle, USA
Venues:: ACL | WS | WiNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 151–155
URL:
DOI:
Bib Export formats:: BibTeX MODS XML EndNote

You can write comments here (and agree to place them under CC-by). They are not guaranteed to stay and there is no e-mail functionality.

BibTeX Search