Dataset distillation for Audio-visual tasks
Data distillation aims to create a condensed dataset that retains the essential information of the entire training data. While recent advances in data distillation techniques have shown remarkable performance on image datasets, their potential in other domains remains largely unexplored. We extend this concept to the audio-visual domain, introducing audio-visual dataset distillation, a task of creating smaller yet representative synthetic datasets that maintain cross-modal semantic associations between audio and visual modalities. To address this, we extend the Distribution Matching approach and introduce additional cross-modal alignment losses. Comprehensive experiments on recognition and cross-modal retrieval tasks demonstrate the representativeness and effective audio-visual alignment of our distilled data.
NOTE: this event is part of the 2024 DL4MIR workshop series (ccrma-mir.github.io); guest speaker talks are open to the broader CCRMA community.