Artem Sokolov, Swapnil Bhosale, Xiatian Zhu
This repository is the official implementation of "3D Audio-Visual Segmentation". In this paper, we introduce a novel research problem, 3D Audio-Visual Segmentation, extending the existing AVS to the 3D output space. To facilitate this research, we create the very first simulation based benchmark, 3DAVS-S34-O7, providing photorealistic 3D scene environments with grounded spatial audio under single-instance and multi-instance settings, across 34 scenes and 7 object categories. Subsequently, we propose a new approach, EchoSegnet, characterized by integrating the ready-to-use knowledge from pretrained 2D audio-visual foundation models synergistically with 3D visual scene representation through spatial audio-aware mask alignment and refinement.
- Data & Code coming soon!
If you find our project useful, please use the following BibTeX entry:
@inproceedings{sokolov20243daudiovisualsegmentation,
title = {3D Audio-Visual Segmentation},
author = {Sokolov, Artem and Bhosale, Swapnil and Zhu, Xiatian},
booktitle = {Audio Imagination: NeurIPS 2024 Workshop AI-Driven Speech, Music, and Sound Generation},
year = {2024}
}
For feedback or questions please contact Artem Sokolov