Looking to listen: Audio separation and localization with visual clues

Piotr Czarnecki

supervisor: Przemysław Biecek



Presentation on IDSS covers current progress of research related to build model able to localize and separate sounds on unconstrained videos. Unconstrained videos are the ones available on services like YouTube, mostly created by amateurs, not professional produced video content. Such content usually contains raw audio (usually mono, with unwanted noise e.g. wind, sounds of street, etc.), or replaced raw audio with voice recorded by lapel mics, mixed in a single channel in case of more speakers. Ultimate goal is to separate sounds of all visible objects on screen which make sounds, additionally separate all sounds from out of visual scene. Separated sounds of objects should be localized properly on screen.

One of the goal of research is to localize and separate voices (human sounds) in order to enhance audio by adding spatial information correlated to video, i.e. if there are two speakers, each on other side of screen, voice of the one on the right side should be audible on the right speaker, voice of speaking person on the left side should be audible on the left speaker. This feature is important to enhance audio experience, especially when listening on headset, or watching content on big screens.

Current research focuses on improvements in quality and processing complexity of model for voice processing. Presentation covers review of existing approaches for Audio-Video processing for voice sources separation and localization. There are two main approaches for model training, one based on synchronization or simpler correspondence, second based on voice separation. Both approaches usually utilise similar model architecture which also is planned shortly to be presented. As a presentation summary, It is planned to show sample result of own model.