Piotr Czarnecki
supervisor: Piotr Bilski
Real-time speaker separation and matching is crucial to enable applications for video call enhancement, automatic subtitles localization, as well as spatial voice generation/panning. In the presentation, improvements for the visual guided speaker separation model to make it real-time will be depicted. The model extends real-time models known for speech enhancement by adding face processing to ultimately perform visual guided speaker separation. The common approach to perform speaker separation and matching is to detect candidate faces and then perform visual guided voice separation for each. There are two methods used for face detection: with face detector on static video frames or with audio visual sequence processing for active speaker detection. The described model follows the approach with a face detector. The system is lightweight with 0.6M trainable parameters. To authors knowledge, it is the first real-time system for visual guided speaker separation. It performs speaker separation near instantaneously with the delay of a single input audio frame. From the application point of view it is important that the model performs both tasks at the time: speech separation and active speaker detection.