Christoph Draxler

2020

pdf bib abs
Building a Time-Aligned Cross-Linguistic Reference Corpus from Language Documentation Data (DoReCo)
Ludger Paschen | François Delafontaine | Christoph Draxler | Susanne Fuchs | Matthew Stave | Frank Seifart
Proceedings of The 12th Language Resources and Evaluation Conference

Natural speech data on many languages have been collected by language documentation projects aiming to preserve lingustic and cultural traditions in audivisual records. These data hold great potential for large-scale cross-linguistic research into phonetics and language processing. Major obstacles to utilizing such data for typological studies include the non-homogenous nature of file formats and annotation conventions found both across and within archived collections. Moreover, time-aligned audio transcriptions are typically only available at the level of broad (multi-word) phrases but not at the word and segment levels. We report on solutions developed for these issues within the DoReCo (DOcumentation REference COrpus) project. DoReCo aims at providing time-aligned transcriptions for at least 50 collections of under-resourced languages. This paper gives a preliminary overview of the current state of the project and details our workflow, in particular standardization of formats and conventions, the addition of segmental alignments with WebMAUS, and DoReCo’s applicability for subsequent research programs. By making the data accessible to the scientific community, DoReCo is designed to bridge the gap between language documentation and linguistic inquiry.

pdf bib abs
A CLARIN Transcription Portal for Interview Data
Christoph Draxler | Henk van den Heuvel | Arjan van Hessen | Silvia Calamai | Louise Corti
Proceedings of The 12th Language Resources and Evaluation Conference

In this paper we present a first version of a transcription portal for audio files based on automatic speech recognition (ASR) in various languages. The portal is implemented in the CLARIN resources research network and intended for use by non-technical scholars. We explain the background and interdisciplinary nature of interview data, the perks and quirks of using ASR for transcribing the audio in a research context, the dos and don’ts for optimal use of the portal, and future developments foreseen. The portal is promoted in a range of workshops, but there are a number of challenges that have to be met. These challenges concern privacy issues, ASR quality, and cost, amongst others.

Co-authors

Henk van den Heuvel 1

Arjan van Hessen 1

Silvia Calamai 1

Louise Corti 1

Venues

LREC2