Piotr Czarnecki
supervisor: Piotr Bilski
Real-time speaker separation and localization is crucial to enable applications for video call enhancement, automatic subtitle localization, as well as spatial voice generation/panning. The common approach to perform speaker localization and separation is to detect candidate faces and then perform visual guided voice separation for each. There are two methods used for face detection: with using face detector on static video frames or with audio visual sequence processing for active speaker detection. In this work, crucial improvements is proposed for the visual guided speaker separation model to make it real-time. The described model follows the approach with a face detector. The model extends real-time models known for speech enhancement by adding face processing to ultimately perform speaker separation. The system is lightweight with 0.6M trainable parameters. It performs speaker separation near instantaneously with the delay of a single input audio frame. To my knowledge, it is the first real-time system for visual guided speaker separation. From the application point of view it is important that the model performs both tasks at the time: speech separation and active speaker localization.