Mitch Marcus

Also published as: Mitchell Marcus


2020

pdf bib
Modeling Morphological Typology for Unsupervised Learning of Language Morphology
Hongzhi Xu | Jordan Kodner | Mitchell Marcus | Charles Yang
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

This paper describes a language-independent model for fully unsupervised morphological analysis that exploits a universal framework leveraging morphological typology. By modeling morphological processes including suffixation, prefixation, infixation, and full and partial reduplication with constrained stem change rules, our system effectively constrains the search space and offers a wide coverage in terms of morphological typology. The system is tested on nine typologically and genetically diverse languages, and shows superior performance over leading systems. We also investigate the effect of an oracle that provides only a handful of bits per language to signal morphological type.

pdf bib
Morphological Segmentation for Low Resource Languages
Justin Mott | Ann Bies | Stephanie Strassel | Jordan Kodner | Caitlin Richter | Hongzhi Xu | Mitchell Marcus
Proceedings of The 12th Language Resources and Evaluation Conference

This paper describes a new morphology resource created by Linguistic Data Consortium and the University of Pennsylvania for the DARPA LORELEI Program. The data consists of approximately 2000 tokens annotated for morphological segmentation in each of 9 low resource languages, along with root information for 7 of the languages. The languages annotated show a broad diversity of typological features. A minimal annotation scheme for segmentation was developed such that it could capture the patterns of a wide range of languages and also be performed reliably by non-linguist annotators. The basic annotation guidelines were designed to be language-independent, but included language-specific morphological paradigms and other specifications. The resulting annotated corpus is designed to support and stimulate the development of unsupervised morphological segmenters and analyzers by providing a gold standard for their evaluation on a more typologically diverse set of languages than has previously been available. By providing root annotation, this corpus is also a step toward supporting research in identifying richer morphological structures than simple morpheme boundaries.