Paul McNamee
2020
Benchmarking Neural and Statistical Machine Translation on Low-Resource African Languages
Kevin Duh
|
Paul McNamee
|
Matt Post
|
Brian Thompson
Proceedings of The 12th Language Resources and Evaluation Conference
Research in machine translation (MT) is developing at a rapid pace. However, most work in the community has focused on languages where large amounts of digital resources are available. In this study, we benchmark state of the art statistical and neural machine translation systems on two African languages which do not have large amounts of resources: Somali and Swahili. These languages are of social importance and serve as test-beds for developing technologies that perform reasonably well despite the low-resource constraint. Our findings suggest that statistical machine translation (SMT) and neural machine translation (NMT) can perform similarly in low-resource scenarios, but neural systems require more careful tuning to match performance. We also investigate how to exploit additional data, such as bilingual text harvested from the web, or user dictionaries; we find that NMT can significantly improve in performance with the use of these additional data. Finally, we survey the landscape of machine translation resources for the languages of Africa and provide some suggestions for promising future research directions.
Tagging Location Phrases in Text
Paul McNamee
|
James Mayfield
|
Cash Costello
|
Caitlyn Bishop
|
Shelby Anderson
Proceedings of The 12th Language Resources and Evaluation Conference
For over thirty years researchers have studied the problem of automatically detecting named entities in written language. Throughout this time the majority of such work has focused on detection and classification of entities into coarse-grained types like: PERSON, ORGANIZATION, and LOCATION. Less attention has been focused on non-named mentions of entities, including non-named location phrases such as “the medical clinic in Telonge” or “2 km below the Dolin Maniche bridge”. In this work we describe the Location Phrase Detection task to identify such spans. Our key accomplishments include: developing a sequential tagging approach; crafting annotation guidelines; building annotated datasets for English and Russian news; and, conducting experiments in automated detection of location phrases with both statistical and neural taggers. This work is motivated by extracting rich location information to support situational awareness during humanitarian crises such as natural disasters.
Dragonfly: Advances in Non-Speaker Annotation for Low Resource Languages
Cash Costello
|
Shelby Anderson
|
Caitlyn Bishop
|
James Mayfield
|
Paul McNamee
Proceedings of The 12th Language Resources and Evaluation Conference
Dragonfly is an open source software tool that supports annotation of text in a low resource language by non-speakers of the language. Using semantic and contextual information, non-speakers of a language familiar with the Latin script can produce high quality named entity annotations to support construction of a name tagger. We describe a procedure for annotating low resource languages using Dragonfly that others can use, which we developed based on our experience annotating data in more than ten languages. We also present performance comparisons between models trained on native speaker and non-speaker annotations.
Search
Co-authors
- James Mayfield 2
- Cash Costello 2
- Caitlyn Bishop 2
- Shelby Anderson 2
- Kevin Duh 1
- show all...
Venues
- LREC3