2020
pdf
bib
abs
Emerging Cross-lingual Structure in Pretrained Language Models
Alexis Conneau
|
Shijie Wu
|
Haoran Li
|
Luke Zettlemoyer
|
Veselin Stoyanov
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
We study the problem of multilingual masked language modeling, i.e. the training of a single model on concatenated text from multiple languages, and present a detailed study of several factors that influence why these models are so effective for cross-lingual transfer. We show, contrary to what was previously hypothesized, that transfer is possible even when there is no shared vocabulary across the monolingual corpora and also when the text comes from very different domains. The only requirement is that there are some shared parameters in the top layers of the multi-lingual encoder. To better understand this result, we also show that representations from monolingual BERT models in different languages can be aligned post-hoc quite effectively, strongly suggesting that, much like for non-contextual word embeddings, there are universal latent symmetries in the learned embedding spaces. For multilingual masked language modeling, these symmetries are automatically discovered and aligned during the joint training process.
pdf
bib
abs
The Paradigm Discovery Problem
Alexander Erdmann
|
Micha Elsner
|
Shijie Wu
|
Ryan Cotterell
|
Nizar Habash
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
This work treats the paradigm discovery problem (PDP), the task of learning an inflectional morphological system from unannotated sentences. We formalize the PDP and develop evaluation metrics for judging systems. Using currently available resources, we construct datasets for the task. We also devise a heuristic benchmark for the PDP and report empirical results on five diverse languages. Our benchmark system first makes use of word embeddings and string similarity to cluster forms by cell and by paradigm. Then, we bootstrap a neural transducer on top of the clustered data to predict words to realize the empty paradigm slots. An error analysis of our system suggests clustering by cell across different inflection classes is the most pressing challenge for future work.
pdf
bib
abs
SIGMORPHON 2020 Shared Task 0: Typologically Diverse Morphological Inflection
Ekaterina Vylomova
|
Jennifer White
|
Elizabeth Salesky
|
Sabrina J. Mielke
|
Shijie Wu
|
Edoardo Maria Ponti
|
Rowan Hall Maudslay
|
Ran Zmigrod
|
Josef Valvoda
|
Svetlana Toldova
|
Francis Tyers
|
Elena Klyachko
|
Ilya Yegorov
|
Natalia Krizhanovsky
|
Paula Czarnowska
|
Irene Nikkarinen
|
Andrew Krizhanovsky
|
Tiago Pimentel
|
Lucas Torroba Hennigen
|
Christo Kirov
|
Garrett Nicolai
|
Adina Williams
|
Antonios Anastasopoulos
|
Hilaria Cruz
|
Eleanor Chodroff
|
Ryan Cotterell
|
Miikka Silfverberg
|
Mans Hulden
Proceedings of the 17th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology
A broad goal in natural language processing (NLP) is to develop a system that has the capacity to process any natural language. Most systems, however, are developed using data from just one language such as English. The SIGMORPHON 2020 shared task on morphological reinflection aims to investigate systems’ ability to generalize across typologically distinct languages, many of which are low resource. Systems were developed using data from 45 languages and just 5 language families, fine-tuned with data from an additional 45 languages and 10 language families (13 in total), and evaluated on all 90 languages. A total of 22 systems (19 neural) from 10 teams were submitted to the task. All four winning systems were neural (two monolingual transformers and two massively multilingual RNN-based models with gated attention). Most teams demonstrate utility of data hallucination and augmentation, ensembles, and multilingual training for low-resource languages. Non-neural learners and manually designed grammars showed competitive and even superior performance on some languages (such as Ingrian, Tajik, Tagalog, Zarma, Lingala), especially with very limited data. Some language families (Afro-Asiatic, Niger-Congo, Turkic) were relatively easy for most systems and achieved over 90% mean accuracy while others were more challenging.
pdf
bib
abs
The SIGMORPHON 2020 Shared Task on Multilingual Grapheme-to-Phoneme Conversion
Kyle Gorman
|
Lucas F.E. Ashby
|
Aaron Goyzueta
|
Arya McCarthy
|
Shijie Wu
|
Daniel You
Proceedings of the 17th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology
We describe the design and findings of the SIGMORPHON 2020 shared task on multilingual grapheme-to-phoneme conversion. Participants were asked to submit systems which take in a sequence of graphemes in a given language as input, then output a sequence of phonemes representing the pronunciation of that grapheme sequence. Nine teams submitted a total of 23 systems, at best achieving a 18% relative reduction in word error rate (macro-averaged over languages), versus strong neural sequence-to-sequence baselines. To facilitate error analysis, we publicly release the complete outputs for all systems—a first for the SIGMORPHON workshop.
pdf
bib
abs
Are All Languages Created Equal in Multilingual BERT?
Shijie Wu
|
Mark Dredze
Proceedings of the 5th Workshop on Representation Learning for NLP
Multilingual BERT (mBERT) trained on 104 languages has shown surprisingly good cross-lingual performance on several NLP tasks, even without explicit cross-lingual signals. However, these evaluations have focused on cross-lingual transfer with high-resource languages, covering only a third of the languages covered by mBERT. We explore how mBERT performs on a much wider set of languages, focusing on the quality of representation for low-resource languages, measured by within-language performance. We consider three tasks: Named Entity Recognition (99 languages), Part-of-speech Tagging and Dependency Parsing (54 languages each). mBERT does better than or comparable to baselines on high resource languages but does much worse for low resource languages. Furthermore, monolingual BERT models for these languages do even worse. Paired with similar languages, the performance gap between monolingual BERT and mBERT can be narrowed. We find that better models for low resource languages require more efficient pretraining techniques or more data.