Adrien Barbaresi
2020
Proceedings of the 12th Web as Corpus Workshop
Adrien Barbaresi
|
Felix Bildhauer
|
Roland Schäfer
|
Egon Stemle
Proceedings of the 12th Web as Corpus Workshop
Out-of-the-Box and into the Ditch? Multilingual Evaluation of Generic Text Extraction Tools
Adrien Barbaresi
|
Gaël Lejeune
Proceedings of the 12th Web as Corpus Workshop
This article examines extraction methods designed to retain the main text content of web pages and discusses how the extraction could be oriented and evaluated: can and should it be as generic as possible to ensure opportunistic corpus construction? The evaluation grounds on a comparative benchmark of open-source tools used on pages in five different languages (Chinese, English, Greek, Polish and Russian), it features several metrics to obtain more fine-grained differentiations. Our experiments highlight the diversity of web page layouts across languages or publishing countries. These discrepancies are reflected by diverging performances so that the right tool has to be chosen accordingly.
Proceedings of the 8th Workshop on Challenges in the Management of Large Corpora
Piotr Bański
|
Adrien Barbaresi
|
Simon Clematide
|
Marc Kupietz
|
Harald Lüngen
|
Ines Pisetta
Proceedings of the 8th Workshop on Challenges in the Management of Large Corpora
Search
Co-authors
- Felix Bildhauer 1
- Roland Schäfer 1
- Egon Stemle 1
- Gaël Lejeune 1
- Piotr Bański 1
- show all...