Pierre Zweigenbaum


2020

pdf bib
Handling Entity Normalization with no Annotated Corpus: Weakly Supervised Methods Based on Distributional Representation and Ontological Information
Arnaud Ferré | Robert Bossy | Mouhamadou Ba | Louise Deléger | Thomas Lavergne | Pierre Zweigenbaum | Claire Nédellec
Proceedings of The 12th Language Resources and Evaluation Conference

Entity normalization (or entity linking) is an important subtask of information extraction that links entity mentions in text to categories or concepts in a reference vocabulary. Machine learning based normalization methods have good adaptability as long as they have enough training data per reference with a sufficient quality. Distributional representations are commonly used because of their capacity to handle different expressions with similar meanings. However, in specific technical and scientific domains, the small amount of training data and the relatively small size of specialized corpora remain major challenges. Recently, the machine learning-based CONTES method has addressed these challenges for reference vocabularies that are ontologies, as is often the case in life sciences and biomedical domains. And yet, its performance is dependent on manually annotated corpus. Furthermore, like other machine learning based methods, parametrization remains tricky. We propose a new approach to address the scarcity of training data that extends the CONTES method by corpus selection, pre-processing and weak supervision strategies, which can yield high-performance results without any manually annotated examples. We also study which hyperparameters are most influential, with sometimes different patterns compared to previous work. The results show that our approach significantly improves accuracy and outperforms previous state-of-the-art algorithms.

pdf bib
Proceedings of the 13th Workshop on Building and Using Comparable Corpora
Reinhard Rapp | Pierre Zweigenbaum | Serge Sharoff
Proceedings of the 13th Workshop on Building and Using Comparable Corpora

pdf bib
Overview of the Fourth BUCC Shared Task: Bilingual Dictionary Induction from Comparable Corpora
Reinhard Rapp | Pierre Zweigenbaum | Serge Sharoff
Proceedings of the 13th Workshop on Building and Using Comparable Corpora

The shared task of the 13th Workshop on Building and Using Comparable Corpora was devoted to the induction of bilingual dictionaries from comparable rather than parallel corpora. In this task, for a number of language pairs involving Chinese, English, French, German, Russian and Spanish, the participants were supposed to determine automatically the target language translations of several thousand source language test words of three frequency ranges. We describe here some background, the task definition, the training and test data sets and the evaluation used for ranking the participating systems. We also summarize the approaches used and present the results of the evaluation. In conclusion, the outcome of the competition are the results of a number of systems which provide surprisingly good solutions to the ambitious problem.