2020
pdf
bib
abs
Improving the Language Model for Low-Resource ASR with Online Text Corpora
Nils Hjortnaes
|
Timofey Arkhangelskiy
|
Niko Partanen
|
Michael Rießler
|
Francis Tyers
Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL)
In this paper, we expand on previous work on automatic speech recognition in a low-resource scenario typical of data collected by field linguists. We train DeepSpeech models on 35 hours of dialectal Komi speech recordings and correct the output using language models constructed from various sources. Previous experiments showed that transfer learning using DeepSpeech can improve the accuracy of a speech recognizer for Komi, though the error rate remained very high. In this paper we present further experiments with language models created using KenLM from text materials available online. These are constructed from two corpora, one containing literary texts, one for social media content, and another combining the two. We then trained the model using each language model to explore the impact of the language model data source on the speech recognition model. Our results show significant improvements of over 25% in character error rate and nearly 20% in word error rate. This offers important methodological insight into how ASR results can be improved under low-resource conditions: transfer learning can be used to compensate the lack of training data in the target language, and online texts are a very useful resource when developing language models in this context.
pdf
bib
abs
A Finite-State Morphological Analyser for Evenki
Anna Zueva
|
Anastasia Kuznetsova
|
Francis Tyers
Proceedings of The 12th Language Resources and Evaluation Conference
It has been widely admitted that morphological analysis is an important step in automated text processing for morphologically rich languages. Evenki is a language with rich morphology, therefore a morphological analyser is highly desirable for processing Evenki texts and developing applications for Evenki. Although two morphological analysers for Evenki have already been developed, they are able to analyse less than a half of the available Evenki corpora. The aim of this paper is to create a new morphological analyser for Evenki. It is implemented using the Helsinki Finite-State Transducer toolkit (HFST). The lexc formalism is used to specify the morphotactic rules, which define the valid orderings of morphemes in a word. Morphophonological alternations and orthographic rules are described using the twol formalism. The lexicon is extracted from available machine-readable dictionaries. Since a part of the corpora belongs to texts in Evenki dialects, a version of the analyser with relaxed rules is developed for processing dialectal features. We evaluate the analyser on available Evenki corpora and estimate precision, recall and F-score. We obtain coverage scores of between 61% and 87% on the available Evenki corpora.
pdf
bib
abs
An Unsupervised Method for Weighting Finite-state Morphological Analyzers
Amr Keleg
|
Francis Tyers
|
Nick Howell
|
Tommi Pirinen
Proceedings of The 12th Language Resources and Evaluation Conference
Morphological analysis is one of the tasks that have been studied for years. Different techniques have been used to develop models for performing morphological analysis. Models based on finite state transducers have proved to be more suitable for languages with low available resources. In this paper, we have developed a method for weighting a morphological analyzer built using finite state transducers in order to disambiguate its results. The method is based on a word2vec model that is trained in a completely unsupervised way using raw untagged corpora and is able to capture the semantic meaning of the words. Most of the methods used for disambiguating the results of a morphological analyzer relied on having tagged corpora that need to manually built. Additionally, the method developed uses information about the token irrespective of its context unlike most of the other techniques that heavily rely on the word’s context to disambiguate its set of candidate analyses.
pdf
bib
abs
Universal Dependencies v2: An Evergrowing Multilingual Treebank Collection
Joakim Nivre
|
Marie-Catherine de Marneffe
|
Filip Ginter
|
Jan Hajič
|
Christopher D. Manning
|
Sampo Pyysalo
|
Sebastian Schuster
|
Francis Tyers
|
Daniel Zeman
Proceedings of The 12th Language Resources and Evaluation Conference
Universal Dependencies is an open community effort to create cross-linguistically consistent treebank annotation for many languages within a dependency-based lexicalist framework. The annotation consists in a linguistically motivated word segmentation; a morphological layer comprising lemmas, universal part-of-speech tags, and standardized morphological features; and a syntactic layer focusing on syntactic relations between predicates, arguments and modifiers. In this paper, we describe version 2 of the universal guidelines (UD v2), discuss the major changes from UD v1 to UD v2, and give an overview of the currently available treebanks for 90 languages.
pdf
bib
abs
Common Voice: A Massively-Multilingual Speech Corpus
Rosana Ardila
|
Megan Branson
|
Kelly Davis
|
Michael Kohler
|
Josh Meyer
|
Michael Henretty
|
Reuben Morais
|
Lindsay Saunders
|
Francis Tyers
|
Gregor Weber
Proceedings of The 12th Language Resources and Evaluation Conference
The Common Voice corpus is a massively-multilingual collection of transcribed speech intended for speech technology research and development. Common Voice is designed for Automatic Speech Recognition purposes but can be useful in other domains (e.g. language identification). To achieve scale and sustainability, the Common Voice project employs crowdsourcing for both data collection and data validation. The most recent release includes 29 languages, and as of November 2019 there are a total of 38 languages collecting data. Over 50,000 individuals have participated so far, resulting in 2,500 hours of collected audio. To our knowledge this is the largest audio corpus in the public domain for speech recognition, both in terms of number of hours and number of languages. As an example use case for Common Voice, we present speech recognition experiments using Mozilla’s DeepSpeech Speech-to-Text toolkit. By applying transfer learning from a source English model, we find an average Character Error Rate improvement of 5.99 ± 5.48 for twelve target languages (German, French, Italian, Turkish, Catalan, Slovenian, Welsh, Irish, Breton, Tatar, Chuvash, and Kabyle). For most of these languages, these are the first ever published results on end-to-end Automatic Speech Recognition.
pdf
bib
Proceedings of the Sixth International Workshop on Computational Linguistics of Uralic Languages
Tommi A Pirinen
|
Francis M. Tyers
|
Michael Rießler
Proceedings of the Sixth International Workshop on Computational Linguistics of Uralic Languages
pdf
bib
abs
SIGMORPHON 2020 Shared Task 0: Typologically Diverse Morphological Inflection
Ekaterina Vylomova
|
Jennifer White
|
Elizabeth Salesky
|
Sabrina J. Mielke
|
Shijie Wu
|
Edoardo Maria Ponti
|
Rowan Hall Maudslay
|
Ran Zmigrod
|
Josef Valvoda
|
Svetlana Toldova
|
Francis Tyers
|
Elena Klyachko
|
Ilya Yegorov
|
Natalia Krizhanovsky
|
Paula Czarnowska
|
Irene Nikkarinen
|
Andrew Krizhanovsky
|
Tiago Pimentel
|
Lucas Torroba Hennigen
|
Christo Kirov
|
Garrett Nicolai
|
Adina Williams
|
Antonios Anastasopoulos
|
Hilaria Cruz
|
Eleanor Chodroff
|
Ryan Cotterell
|
Miikka Silfverberg
|
Mans Hulden
Proceedings of the 17th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology
A broad goal in natural language processing (NLP) is to develop a system that has the capacity to process any natural language. Most systems, however, are developed using data from just one language such as English. The SIGMORPHON 2020 shared task on morphological reinflection aims to investigate systems’ ability to generalize across typologically distinct languages, many of which are low resource. Systems were developed using data from 45 languages and just 5 language families, fine-tuned with data from an additional 45 languages and 10 language families (13 in total), and evaluated on all 90 languages. A total of 22 systems (19 neural) from 10 teams were submitted to the task. All four winning systems were neural (two monolingual transformers and two massively multilingual RNN-based models with gated attention). Most teams demonstrate utility of data hallucination and augmentation, ensembles, and multilingual training for low-resource languages. Non-neural learners and manually designed grammars showed competitive and even superior performance on some languages (such as Ingrian, Tajik, Tagalog, Zarma, Lingala), especially with very limited data. Some language families (Afro-Asiatic, Niger-Congo, Turkic) were relatively easy for most systems and achieved over 90% mean accuracy while others were more challenging.