Emre Yilmaz


2020

pdf bib
Semi-supervised Acoustic Modelling for Five-lingual Code-switched ASR using Automatically-segmented Soap Opera Speech
Nick Wilkinson | Astik Biswas | Emre Yilmaz | Febe De Wet | Ewald Van der westhuizen | Thomas Niesler
Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL)

This paper considers the impact of automatic segmentation on the fully-automatic, semi-supervised training of automatic speech recog-nition (ASR) systems for five-lingual code-switched (CS) speech. Four automatic segmentation techniques were evaluated in terms ofthe recognition performance of an ASR system trained on the resulting segments in a semi-supervised manner. For comparative purposesa semi-supervised syste Three of these use a newly proposed convolutional neural network (CNN) model for framewise classification,and include a novel form of HMM smoothing of the CNN outputs. Automatic segmentation was applied in combination with automaticspeaker diarization. The best-performing segmentation technique was also evaluated without speaker diarization. An evaluation basedon 248 unsegmented soap opera episodes indicated that voice activity detection (VAD) based on a CNN followed by Gaussian mixturemodel-hidden Markov model smoothing (CNN-GMM-HMM) yields the best ASR performance. The semi-supervised system trainedwith the best automatic segmentation achieved an overall WER improvement of 1.1% absolute over a semi-supervised system trainedwith manually created segments. Furthermore, we found that recognition rates improved even further when the automatic segmentationwas used in conjunction with speaker diarization.

pdf bib
Semi-supervised Development of ASR Systems for Multilingual Code-switched Speech in Under-resourced Languages
Astik Biswas | Emre Yilmaz | Febe De Wet | Ewald Van der westhuizen | Thomas Niesler
Proceedings of The 12th Language Resources and Evaluation Conference

This paper reports on the semi-supervised development of acoustic and language models for under-resourced, code-switched speech in five South African languages. Two approaches are considered. The first constructs four separate bilingual automatic speech recognisers (ASRs) corresponding to four different language pairs between which speakers switch frequently. The second uses a single, unified, five-lingual ASR system that represents all the languages (English, isiZulu, isiXhosa, Setswana and Sesotho). We evaluate the effectiveness of these two approaches when used to add additional data to our extremely sparse training sets. Results indicate that batch-wise semi-supervised training yields better results than a non-batch-wise approach. Furthermore, while the separate bilingual systems achieved better recognition performance than the unified system, they benefited more from pseudolabels generated by the five-lingual system than from those generated by the bilingual systems.