Julia Ive
2020
A Post-Editing Dataset in the Legal Domain: Do we Underestimate Neural Machine Translation Quality?
Julia Ive
|
Lucia Specia
|
Sara Szoc
|
Tom Vanallemeersch
|
Joachim Van den Bogaert
|
Eduardo Farah
|
Christine Maroti
|
Artur Ventura
|
Maxim Khalilov
Proceedings of The 12th Language Resources and Evaluation Conference
We introduce a machine translation dataset for three pairs of languages in the legal domain with post-edited high-quality neural machine translation and independent human references. The data was collected as part of the EU APE-QUEST project and comprises crawled content from EU websites with translation from English into three European languages: Dutch, French and Portuguese. Altogether, the data consists of around 31K tuples including a source sentence, the respective machine translation by a neural machine translation system, a post-edited version of such translation by a professional translator, and - where available - the original reference translation crawled from parallel language websites. We describe the data collection process, provide an analysis of the resulting post-edits and benchmark the data using state-of-the-art quality estimation and automatic post-editing models. One interesting by-product of our post-editing analysis suggests that neural systems built with publicly available general domain data can provide high-quality translations, even though comparison to human references suggests that this quality is quite low. This makes our dataset a suitable candidate to test evaluation metrics. The data is freely available as an ELRC-SHARE resource.
Exploring Transformer Text Generation for Medical Dataset Augmentation
Ali Amin-Nejad
|
Julia Ive
|
Sumithra Velupillai
Proceedings of The 12th Language Resources and Evaluation Conference
Natural Language Processing (NLP) can help unlock the vast troves of unstructured data in clinical text and thus improve healthcare research. However, a big barrier to developments in this field is data access due to patient confidentiality which prohibits the sharing of this data, resulting in small, fragmented and sequestered openly available datasets. Since NLP model development requires large quantities of data, we aim to help side-step this roadblock by exploring the usage of Natural Language Generation in augmenting datasets such that they can be used for NLP model development on downstream clinically relevant tasks. We propose a methodology guiding the generation with structured patient information in a sequence-to-sequence manner. We experiment with state-of-the-art Transformer models and demonstrate that our augmented dataset is capable of beating our baselines on a downstream classification task. Finally, we also create a user interface and release the scripts to train generation models to stimulate further research in this area.
Search
Co-authors
- Lucia Specia 1
- Sara Szoc 1
- Tom Vanallemeersch 1
- Joachim Van den Bogaert 1
- Eduardo Farah 1
- show all...
Venues
- LREC2