2020
pdf
bib
abs
Large Vocabulary Read Speech Corpora for Four Ethiopian Languages: Amharic, Tigrigna, Oromo and Wolaytta
Solomon Teferra Abate
|
Martha Yifiru Tachbelie
|
Michael Melese
|
Hafte Abera
|
Tewodros Abebe
|
Wondwossen Mulugeta
|
Yaregal Assabie
|
Million Meshesha
|
Solomon Afnafu
|
Binyam Ephrem Seyoum
Proceedings of The 12th Language Resources and Evaluation Conference
Automatic Speech Recognition (ASR) is one of the most important technologies to support spoken communication in modern life. However, its development benefits from large speech corpus. The development of such a corpus is expensive and most of the human languages, including the Ethiopian languages, do not have such resources. To address this problem, we have developed four large (about 22 hours) speech corpora for four Ethiopian languages: Amharic, Tigrigna, Oromo and Wolaytta. To assess usability of the corpora for (the purpose of) speech processing, we have developed ASR systems for each language. In this paper, we present the corpora and the baseline ASR systems we have developed. We have achieved word error rates (WERs) of 37.65%, 31.03%, 38.02%, 33.89% for Amharic, Tigrigna, Oromo and Wolaytta, respectively. This results show that the corpora are suitable for further investigation towards the development of ASR systems. Thus, the research community can use the corpora to further improve speech processing systems. From our results, it is clear that the collection of text corpora to train strong language models for all of the languages is still required, especially for Oromo and Wolaytta.
bib
abs
Large Vocabulary Read Speech Corpora for Four Ethiopian Languages: Amharic, Tigrigna, Oromo, and Wolaytta
Solomon Teferra Abate
|
Martha Yifiru Tachbelie
|
Michael Melese
|
Hafte Abera
|
Tewodros Gebreselassie
|
Wondwossen Mulugeta
|
Yaregal Assabie
|
Million Meshesha Beyene
|
Solomon Atinafu
|
Binyam Ephrem Seyoum
Proceedings of the The Fourth Widening Natural Language Processing Workshop
Automatic Speech Recognition (ASR) is one of the most important technologies to help people live a better life in the 21st century. However, its development requires a big speech corpus for a language. The development of such a corpus is expensive especially for under-resourced Ethiopian languages. To address this problem we have developed four medium-sized (longer than 22 hours each) speech corpora for four Ethiopian languages: Amharic, Tigrigna, Oromo, and Wolaytta. In a way of checking the usability of the corpora and deliver a baseline ASR for each language. In this paper, we present the corpora and the baseline ASR systems for each language. The word error rates (WERs) we achieved show that the corpora are usable for further investigation and we recommend the collection of text corpora to train strong language models for Oromo and Wolaytta compared to others.
bib
abs
Tigrinya Automatic Speech recognition with Morpheme based recognition units
Hafte Abera
|
sebsibe hailemariam
Proceedings of the The Fourth Widening Natural Language Processing Workshop
The Tigrinya language is agglutinative and has a large number of inflected and derived forms of words. Therefore a Tigrinya large vocabulary continuous speech recognition system often has a large number of different units and a high out-of-vocabulary (OOV) rate if a word is used as a recognition unit of a language model (LM) and lexicon. Therefore a morpheme-based approach has often been used and a morpheme is used as the recognition unit to reduce the high OOV rate. This paper presents an automatic speech recognition experiment conducted to see the effect of OOV words on the performance speech recognition system for Tigrinya. We tried to solve the OOV problem by using morphemes as lexicon and language model units. It has been found that the morpheme-based recognition system is better lexical and language modeling units than words. An absolute improvement (in word recognition accuracy) of 3.45 token and 8.36 types has been obtained as a result of using a morph-based vocabulary.