Semi-supervised learning in automatic speech recognition

MikoĊ‚aj Pudo

supervisor: Artur Janicki



Training production-ready machine learning models usually requires large amounts of good quality data. Gathering the data in many cases is easy, however, preparing annotations is costly and time consuming. This is caused by the fact that this task still needs to be done by human experts. Furthermore, the model should be trained with the same type of data which will be most commonly processed during inference. End users of the models are the best source of such data. However, manual preparation of the annotations is time-consuming and might cause privacy breach.

The above mentioned problem appears in the field of Automatic Speech Recognition (ASR). There are large amounts of good quality data available in the public domain, but they do not contain transcriptions. Since these databases contain thousands of hours of speech data, their manual transcription is very difficult. Semi-supervised learning methods (SSL) attempt to solve this issue. In this work we present selected SSL methods which can be applied to train ASR models. Our experiments prove that even a limited amount of unlabeled data can improve performance of the models. Performing adaptations with small datasets does not require large amounts of computation power. Hence, performing such adaptations on the user's device becomes feasible. In consequence, such an approach can solve the issue of possible privacy breach, since the user data do not need to be sent outside of the device at any moment.