Alexey Sorokin
2020
UniMorph 3.0: Universal Morphology
Arya D. McCarthy
|
Christo Kirov
|
Matteo Grella
|
Amrit Nidhi
|
Patrick Xia
|
Kyle Gorman
|
Ekaterina Vylomova
|
Sabrina J. Mielke
|
Garrett Nicolai
|
Miikka Silfverberg
|
Timofey Arkhangelskiy
|
Nataly Krizhanovsky
|
Andrew Krizhanovsky
|
Elena Klyachko
|
Alexey Sorokin
|
John Mansfield
|
Valts Ernštreits
|
Yuval Pinter
|
Cassandra L. Jacobs
|
Ryan Cotterell
|
Mans Hulden
|
David Yarowsky
Proceedings of The 12th Language Resources and Evaluation Conference
The Universal Morphology (UniMorph) project is a collaborative effort providing broad-coverage instantiated normalized morphological paradigms for hundreds of diverse world languages. The project comprises two major thrusts: a language-independent feature schema for rich morphological annotation and a type-level resource of annotated data in diverse languages realizing that schema. We have implemented several improvements to the extraction pipeline which creates most of our data, so that it is both more complete and more correct. We have added 66 new languages, as well as new parts of speech for 12 languages. We have also amended the schema in several ways. Finally, we present three new community tools: two to validate data for resource creators, and one to make morphological data available from the command line. UniMorph is based at the Center for Language and Speech Processing (CLSP) at Johns Hopkins University in Baltimore, Maryland. This paper details advances made to the schema, tooling, and dissemination of project resources since the UniMorph 2.0 release described at LREC 2018.
Getting More Data for Low-resource Morphological Inflection: Language Models and Data Augmentation
Alexey Sorokin
Proceedings of The 12th Language Resources and Evaluation Conference
We investigate how to improve quality of low-resource morphological inflection without annotating more data. We examine two methods, language models and data augmentation. We show that the model whose decoder that additionally uses the states of the langauge model improves the model quality by 1.5% in combination with both baselines. We also demonstrate that the augmentation of data improves performance by 9% in average when adding 1000 artificially generated word forms to the dataset.
Search
Co-authors
- Arya D. McCarthy 1
- Christo Kirov 1
- Matteo Grella 1
- Amrit Nidhi 1
- Patrick Xia 1
- show all...
Venues
- LREC2