Mark Dredze


2020

pdf bib
Sources of Transfer in Multilingual Named Entity Recognition
David Mueller | Nicholas Andrews | Mark Dredze
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Named-entities are inherently multilingual, and annotations in any given language may be limited. This motivates us to consider polyglot named-entity recognition (NER), where one model is trained using annotated data drawn from more than one language. However, a straightforward implementation of this simple idea does not always work in practice: naive training of NER models using annotated data drawn from multiple languages consistently underperforms models trained on monolingual data alone, despite having access to more training data. The starting point of this paper is a simple solution to this problem, in which polyglot models are fine-tuned on monolingual data to consistently and significantly outperform their monolingual counterparts. To explain this phenomena, we explore the sources of multilingual transfer in polyglot NER models and examine the weight structure of polyglot models compared to their monolingual counterparts. We find that polyglot models efficiently share many parameters across languages and that fine-tuning may utilize a large number of those parameters.

pdf bib
Clinical Concept Linking with Contextualized Neural Representations
Elliot Schumacher | Andriy Mulyar | Mark Dredze
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

In traditional approaches to entity linking, linking decisions are based on three sources of information – the similarity of the mention string to an entity’s name, the similarity of the context of the document to the entity, and broader information about the knowledge base (KB). In some domains, there is little contextual information present in the KB and thus we rely more heavily on mention string similarity. We consider one example of this, concept linking, which seeks to link mentions of medical concepts to a medical concept ontology. We propose an approach to concept linking that leverages recent work in contextualized neural models, such as ELMo (Peters et al. 2018), which create a token representation that integrates the surrounding context of the mention and concept name. We find a neural ranking approach paired with contextualized embeddings provides gains over a competitive baseline (Leaman et al. 2013). Additionally, we find that a pre-training step using synonyms from the ontology offers a useful initialization for the ranker.

pdf bib
Are All Languages Created Equal in Multilingual BERT?
Shijie Wu | Mark Dredze
Proceedings of the 5th Workshop on Representation Learning for NLP

Multilingual BERT (mBERT) trained on 104 languages has shown surprisingly good cross-lingual performance on several NLP tasks, even without explicit cross-lingual signals. However, these evaluations have focused on cross-lingual transfer with high-resource languages, covering only a third of the languages covered by mBERT. We explore how mBERT performs on a much wider set of languages, focusing on the quality of representation for low-resource languages, measured by within-language performance. We consider three tasks: Named Entity Recognition (99 languages), Part-of-speech Tagging and Dependency Parsing (54 languages each). mBERT does better than or comparable to baselines on high resource languages but does much worse for low resource languages. Furthermore, monolingual BERT models for these languages do even worse. Paired with similar languages, the performance gap between monolingual BERT and mBERT can be narrowed. We find that better models for low resource languages require more efficient pretraining techniques or more data.