Speech recognition in conditions of impaired acoustic signal transmission

Karolina Pondel-Sycz

supervisor: Piotr Bilski



The research concerns speech recognition with distorted acoustic signal transmission. This mainly concerns telephone conversations, where signal distortion and interference occur. The first step is to assess what kind of distortion and interference is present in the tested signal. The next step is to select and prepare a suitable database and to examine the signal and propose methods for its repair. Then, the appropriate ASR system architecture must be selected. Currently, the most promising architecture for ASR systems is End-To-End (E2E). In these systems, the input audio signal is directly converted into an output result (transcription) using deep neural networks. Recognition in an E2E system can be divided into three stages: encoding by mapping the input speech sequence to a feature sequence, aligning the feature sequence to the language, decoding the final classification results. The E2E system is a complete structure, and it is often difficult to determine which part of it performs the above subtasks. The networks directly map acoustic signals to label sequences without the need for intermediate states. The E2E model uses soft alignment. Each audio frame corresponds to all possible states with a specific probability distribution, which does not require forced explicit alignment. In the field of E2E systems, there are three main models: Connectionist Temporal Classification (CTC), Attention-based Encoder-Decoder (AED) and Recurrent Neural Network Transducer (RNN-T).