2021
pdf
bib
Daniel at the FinSBD-2 Task: Extracting List and Sentence Boundaries from PDF Documents, a model-driven approach to PDF document analysis
Emmanuel Giguet
|
Gaël Lejeune
Proceedings of the Second Workshop on Financial Technology and Natural Language Processing
2020
pdf
bib
abs
A Dataset for Multi-lingual Epidemiological Event Extraction
Stephen Mutuvi
|
Antoine Doucet
|
Gaël Lejeune
|
Moses Odeo
Proceedings of The 12th Language Resources and Evaluation Conference
This paper proposes a corpus for the development and evaluation of tools and techniques for identifying emerging infectious disease threats in online news text. The corpus can not only be used for information extraction, but also for other natural language processing (NLP) tasks such as text classification. We make use of articles published on the Program for Monitoring Emerging Diseases (ProMED) platform, which provides current information about outbreaks of infectious diseases globally. Among the key pieces of information present in the articles is the uniform resource locator (URL) to the online news sources where the outbreaks were originally reported. We detail the procedure followed to build the dataset, which includes leveraging the source URLs to retrieve the news reports and subsequently pre-processing the retrieved documents. We also report on experimental results of event extraction on the dataset using the Data Analysis for Information Extraction in any Language(DAnIEL) system. DAnIEL is a multilingual news surveillance system that leverages unique attributes associated with news reporting to extract events: repetition and saliency. The system has wide geographical and language coverage, including low-resource languages. In addition, we compare different classification approaches in terms of their ability to differentiate between epidemic-related and unrelated news articles that constitute the corpus.
pdf
bib
abs
Out-of-the-Box and into the Ditch? Multilingual Evaluation of Generic Text Extraction Tools
Adrien Barbaresi
|
Gaël Lejeune
Proceedings of the 12th Web as Corpus Workshop
This article examines extraction methods designed to retain the main text content of web pages and discusses how the extraction could be oriented and evaluated: can and should it be as generic as possible to ensure opportunistic corpus construction? The evaluation grounds on a comparative benchmark of open-source tools used on pages in five different languages (Chinese, English, Greek, Polish and Russian), it features several metrics to obtain more fine-grained differentiations. Our experiments highlight the diversity of web page layouts across languages or publishing countries. These discrepancies are reflected by diverging performances so that the right tool has to be chosen accordingly.
pdf
bib
abs
Dating Ancient texts: an Approach for Noisy French Documents
Anaëlle Baledent
|
Nicolas Hiebel
|
Gaël Lejeune
Proceedings of LT4HALA 2020 - 1st Workshop on Language Technologies for Historical and Ancient Languages
Automatic dating of ancient documents is a very important area of research for digital humanities applications. Many documents available via digital libraries do not have any dating or dating that is uncertain. Document dating is not only useful by itself but it also helps to choose the appropriate NLP tools (lemmatizer, POS tagger ) for subsequent analysis. This paper provides a dataset with thousands of ancient documents in French and present methods and evaluation metrics for this task. We compare character-level methods with token-level methods on two different datasets of two different time periods and two different text genres. Our results show that character-level models are more robust to noise than classical token-level models. The experiments presented in this article focused on documents written in French but we believe that the ability of character-level models to handle noise properly would help to achieve comparable results on other languages and more ancient languages in particular.