Sara Szoc
2020
Being Generous with Sub-Words towards Small NMT Children
Arne Defauw
|
Tom Vanallemeersch
|
Koen Van Winckel
|
Sara Szoc
|
Joachim Van den Bogaert
Proceedings of The 12th Language Resources and Evaluation Conference
In the context of under-resourced neural machine translation (NMT), transfer learning from an NMT model trained on a high resource language pair, or from a multilingual NMT (M-NMT) model, has been shown to boost performance to a large extent. In this paper, we focus on so-called cold start transfer learning from an M-NMT model, which means that the parent model is not trained on any of the child data. Such a set-up enables quick adaptation of M-NMT models to new languages. We investigate the effectiveness of cold start transfer learning from a many-to-many M-NMT model to an under-resourced child. We show that sufficiently large sub-word vocabularies should be used for transfer learning to be effective in such a scenario. When adopting relatively large sub-word vocabularies we observe increases in performance thanks to transfer learning from a parent M-NMT model, both when translating to and from the under-resourced language. Our proposed approach involving dynamic vocabularies is both practical and effective. We report results on two under-resourced language pairs, i.e. Icelandic-English and Irish-English.
A Post-Editing Dataset in the Legal Domain: Do we Underestimate Neural Machine Translation Quality?
Julia Ive
|
Lucia Specia
|
Sara Szoc
|
Tom Vanallemeersch
|
Joachim Van den Bogaert
|
Eduardo Farah
|
Christine Maroti
|
Artur Ventura
|
Maxim Khalilov
Proceedings of The 12th Language Resources and Evaluation Conference
We introduce a machine translation dataset for three pairs of languages in the legal domain with post-edited high-quality neural machine translation and independent human references. The data was collected as part of the EU APE-QUEST project and comprises crawled content from EU websites with translation from English into three European languages: Dutch, French and Portuguese. Altogether, the data consists of around 31K tuples including a source sentence, the respective machine translation by a neural machine translation system, a post-edited version of such translation by a professional translator, and - where available - the original reference translation crawled from parallel language websites. We describe the data collection process, provide an analysis of the resulting post-edits and benchmark the data using state-of-the-art quality estimation and automatic post-editing models. One interesting by-product of our post-editing analysis suggests that neural systems built with publicly available general domain data can provide high-quality translations, even though comparison to human references suggests that this quality is quite low. This makes our dataset a suitable candidate to test evaluation metrics. The data is freely available as an ELRC-SHARE resource.
Search
Co-authors
- Tom Vanallemeersch 2
- Joachim Van den Bogaert 2
- Arne Defauw 1
- Koen Van Winckel 1
- Julia Ive 1
- show all...
Venues
- LREC2