4D Audio-Visual Learning: A Visual Perspective of Sound Propagation and Production
Humans use multiple modalities to perceive the world, including vision, sound, touch, and smell. Among them, vision and sound are two of the most important modalities that naturally co-occur. Recent works have been exploring this natural correspondence between sight and sound, which are however mainly object-centric, i.e., the semantic relations between objects and the sounds they make. While exciting, the correspondence with the surrounding 3D space is often overlooked. For example, we hear the same sound differently in different environments or even different locations in the same environment. In this talk, I present 4D audio-visual learning, which learns the correspondence between sight and sounds in spaces, providing a visual perspective of sound propagation and sound production. More specifically, I focus on four topics in this direction: simulating sounds in spaces, navigating with sounds in spaces, synthesizing sounds in spaces and learning action sounds in spaces. Throughout these topics, I use vision as the main bridge to connect audio and scene understanding and show promising results in building fundamental simulation platforms, enabling multimodal embodied navigation, providing faithful multimodal synthesis in 3D environments, and learning how actions sound from in-the-wild egocentric videos. I show results on real videos and real-world environments, as well as simulation. In the last part of my talk, I will discuss potential research that remains to be explored in the future for 4D audio-visual learning.
NOTE: this event is part of the DL4MIR workshop series (ccrma-mir.github.io); guest speaker talks are open to the broader CCRMA community.