Faheem Kirefu
2020
Parallel Sentence Mining by Constrained Decoding
Pinzhen Chen
|
Nikolay Bogoychev
|
Kenneth Heafield
|
Faheem Kirefu
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
We present a novel method to extract parallel sentences from two monolingual corpora, using neural machine translation. Our method relies on translating sentences in one corpus, but constraining the decoding by a prefix tree built on the other corpus. We argue that a neural machine translation system by itself can be a sentence similarity scorer and it efficiently approximates pairwise comparison with a modified beam search. When benchmarked on the BUCC shared task, our method achieves results comparable to other submissions.
ParaCrawl: Web-Scale Acquisition of Parallel Corpora
Marta Bañón
|
Pinzhen Chen
|
Barry Haddow
|
Kenneth Heafield
|
Hieu Hoang
|
Miquel Esplà-Gomis
|
Mikel L. Forcada
|
Amir Kamran
|
Faheem Kirefu
|
Philipp Koehn
|
Sergio Ortiz Rojas
|
Leopoldo Pla Sempere
|
Gema Ramírez-Sánchez
|
Elsa Sarrías
|
Marek Strelec
|
Brian Thompson
|
William Waites
|
Dion Wiggins
|
Jaume Zaragoza
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
We report on methods to create the largest publicly available parallel corpora by crawling the web, using open source software. We empirically compare alternative methods and publish benchmark data sets for sentence alignment and sentence pair filtering. We also describe the parallel corpora released and evaluate their quality and their usefulness to create machine translation systems.
Search
Co-authors
- Pinzhen Chen 2
- Kenneth Heafield 2
- Nikolay Bogoychev 1
- Marta Bañón 1
- Barry Haddow 1
- show all...
Venues
- ACL2