Balázs Indig
2020
The xtsv Framework and the Twelve Virtues of Pipelines
Balázs Indig
|
Bálint Sass
|
Iván Mittelholcz
Proceedings of The 12th Language Resources and Evaluation Conference
We present xtsv, an abstract framework for building NLP pipelines. It covers several kinds of functionalities which can be implemented at an abstract level. We survey these features and argue that all are desired in a modern pipeline. The framework has a simple yet powerful internal communication format which is essentially tsv (tab separated values) with header plus some additional features. We put emphasis on the capabilities of the presented framework, for example its ability to allow new modules to be easily integrated or replaced, or the variety of its usage options. When a module is put into xtsv, all functionalities of the system are immediately available for that module, and the module can be be a part of an xtsv pipeline. The design also allows convenient investigation and manual correction of the data flow from one module to another. We demonstrate the power of our framework with a successful application: a concrete NLP pipeline for Hungarian called e-magyar text processing system (emtsv) which integrates Hungarian NLP tools in xtsv. All the advantages of the pipeline come from the inherent properties of the xtsv framework.
The ELTE.DH Pilot Corpus – Creating a Handcrafted Gigaword Web Corpus with Metadata
Balázs Indig
|
Árpád Knap
|
Zsófia Sárközi-Lindner
|
Mária Timári
|
Gábor Palkó
Proceedings of the 12th Web as Corpus Workshop
In this article, we present the method we used to create a middle-sized corpus using targeted web crawling. Our corpus contains news portal articles along with their metadata, that can be useful for diverse audiences, ranging from digital humanists to NLP users. The method presented in this paper applies rule-based components that allow the curation of the text and the metadata content. The curated data can thereon serve as a reference for various tasks and measurements. We designed our workflow to encourage modification and customisation. Our concept can also be applied to other genres of portals by using the discovered patterns in the architecture of the portals. We found that for a systematic creation or extension of a similar corpus, our method provides superior accuracy and ease of use compared to The Wayback Machine, while requiring minimal manpower and computational resources. Reproducing the corpus is possible if changes are introduced to the text-extraction process. The standard TEI format and Schema.org encoded metadata is used for the output format, but we stress that placing the corpus in a digital repository system is recommended in order to be able to define semantic relations between the segments and to add rich annotation.
Search
Co-authors
- Bálint Sass 1
- Iván Mittelholcz 1
- Árpád Knap 1
- Zsófia Sárközi-Lindner 1
- Mária Timári 1
- show all...