Zero-shot North Korean to English Neural Machine Translation by Character Tokenization and Phoneme Decomposition

Zero-shot North Korean to English Neural Machine Translation by Character Tokenization and Phoneme Decomposition Hwichan Kim author Tosho Hirasawa author Mamoru Komachi author 2020-jul text Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop Association for Computational Linguistics Online conference publication The primary limitation of North Korean to English translation is the lack of a parallel corpus; therefore, high translation accuracy cannot be achieved. To address this problem, we propose a zero-shot approach using South Korean data, which are remarkably similar to North Korean data. We train a neural machine translation model after tokenizing a South Korean text at the character level and decomposing characters into phonemes.We demonstrate that our method can effectively learn North Korean to English translation and improve the BLEU scores by +1.01 points in comparison with the baseline. kim-etal-2020-zero https://www.aclweb.org/anthology/2020.acl-srw.11 2020-jul 72 78