Umut Sulubacak


2020

pdf bib
OpusTools and Parallel Corpus Diagnostics
Mikko Aulamo | Umut Sulubacak | Sami Virpioja | Jörg Tiedemann
Proceedings of The 12th Language Resources and Evaluation Conference

This paper introduces OpusTools, a package for downloading and processing parallel corpora included in the OPUS corpus collection. The package implements tools for accessing compressed data in their archived release format and make it possible to easily convert between common formats. OpusTools also includes tools for language identification and data filtering as well as tools for importing data from various sources into the OPUS format. We show the use of these tools in parallel corpus creation and data diagnostics. The latter is especially useful for the identification of potential problems and errors in the extensive data set. Using these tools, we can now monitor the validity of data sets and improve the overall quality and consitency of the data collection.

pdf bib
The University of Helsinki Submission to the IWSLT2020 Offline SpeechTranslation Task
Raúl Vázquez | Mikko Aulamo | Umut Sulubacak | Jörg Tiedemann
Proceedings of the 17th International Conference on Spoken Language Translation

This paper describes the University of Helsinki Language Technology group’s participation in the IWSLT 2020 offline speech translation task, addressing the translation of English audio into German text. In line with this year’s task objective, we train both cascade and end-to-end systems for spoken language translation. We opt for an end-to-end multitasking architecture with shared internal representations and a cascade approach that follows a standard procedure consisting of ASR, correction, and MT stages. We also describe the experiments that served as a basis for the submitted systems. Our experiments reveal that multitasking training with shared internal representations is not only possible but allows for knowledge-transfer across modalities.