Béatrice Daille

Also published as: Beatrice Daille


2020

pdf bib
Proceedings of the 6th International Workshop on Computational Terminology
Béatrice Daille | Kyo Kageura | Ayla Rigouts Terryn
Proceedings of the 6th International Workshop on Computational Terminology

pdf bib
A study of semantic projection from single word terms to multi-word terms in the environment domain
Yizhe WANG | Beatrice Daille | Nabil Hathout
Proceedings of the 6th International Workshop on Computational Terminology

The semantic projection method is often used in terminology structuring to infer semantic relations between terms. Semantic projection relies upon the assumption of semantic compositionality: the relation that links simple term pairs remains valid in pairs of complex terms built from these simple terms. This paper proposes to investigate whether this assumption commonly adopted in natural language processing is actually valid. First, we describe the process of constructing a list of semantically linked multi-word terms (MWTs) related to the environmental field through the extraction of semantic variants. Second, we present our analysis of the results from the semantic projection. We find that contexts play an essential role in defining the relations between MWTs.

pdf bib
Towards Automatic Thesaurus Construction and Enrichment.
Amir Hazem | Beatrice Daille | Lanza Claudia
Proceedings of the 6th International Workshop on Computational Terminology

Thesaurus construction with minimum human efforts often relies on automatic methods to discover terms and their relations. Hence, the quality of a thesaurus heavily depends on the chosen methodologies for: (i) building its content (terminology extraction task) and (ii) designing its structure (semantic similarity task). The performance of the existing methods on automatic thesaurus construction is still less accurate than the handcrafted ones of which is important to highlight the drawbacks to let new strategies build more accurate thesauri models. In this paper, we will provide a systematic analysis of existing methods for both tasks and discuss their feasibility based on an Italian Cybersecurity corpus. In particular, we will provide a detailed analysis on how the semantic relationships network of a thesaurus can be automatically built, and investigate the ways to enrich the terminological scope of a thesaurus by taking into account the information contained in external domain-oriented semantic sets.

pdf bib
TermEval 2020: TALN-LS2N System for Automatic Term Extraction
Amir Hazem | Mérieme Bouhandi | Florian Boudin | Beatrice Daille
Proceedings of the 6th International Workshop on Computational Terminology

Automatic terminology extraction is a notoriously difficult task aiming to ease effort demanded to manually identify terms in domain-specific corpora by automatically providing a ranked list of candidate terms. The main ways that addressed this task can be ranged in four main categories: (i) rule-based approaches, (ii) feature-based approaches, (iii) context-based approaches, and (iv) hybrid approaches. For this first TermEval shared task, we explore a feature-based approach, and a deep neural network multitask approach -BERT- that we fine-tune for term extraction. We show that BERT models (RoBERTa for English and CamemBERT for French) outperform other systems for French and English languages.

pdf bib
Books of Hours. the First Liturgical Data Set for Text Segmentation.
Amir Hazem | Beatrice Daille | Christopher Kermorvant | Dominique Stutzmann | Marie-Laurence Bonhomme | Martin Maarand | Mélodie Boillet
Proceedings of The 12th Language Resources and Evaluation Conference

The Book of Hours was the bestseller of the late Middle Ages and Renaissance. It is a historical invaluable treasure, documenting the devotional practices of Christians in the late Middle Ages. Up to now, its textual content has been scarcely studied because of its manuscript nature, its length and its complex content. At first glance, it looks too standardized. However, the study of book of hours raises important challenges: (i) in image analysis, its often lavish ornamentation (illegible painted initials, line-fillers, etc.), abbreviated words, multilingualism are difficult to address in Handwritten Text Recognition (HTR); (ii) its hierarchical entangled structure offers a new field of investigation for text segmentation; (iii) in digital humanities, its textual content gives opportunities for historical analysis. In this paper, we provide the first corpus of books of hours, which consists of Latin transcriptions of 300 books of hours generated by Handwritten Text Recognition (HTR) - that is like Optical Character Recognition (OCR) but for handwritten and not printed texts. We designed a structural scheme of the book of hours and annotated manually two books of hours according to this scheme. Lastly, we performed a systematic evaluation of the main state of the art text segmentation approaches.