Disentangled Acoustic Fields For Multimodal Physical Scene Understanding (IROS2024)

Jie Yin, Andrew Luo, Yilun Du, Anoop Cherian, Tim K. Marks, Jonathan Le Roux, Chuang Gan

📝 [Paper] ➡️ [Dataset] ⭐️ [Presentation Video]

Note

This website is for an overview of our work titled "Disentangled Acoustic Fields For Multimodal Physical Scene Understanding" accepted to IROS2024. For further details about this work, please go to Preprint.

Introduction

We study the problem of multimodal physical scene understanding, where an embodied agent needs to find fallen objects by inferring object properties, direction, and distance of an impact sound source. Previous works adopt feed-forward neural networks to directly regress the variables from sound, leading to poor generalization and domain adaptation issues. In this paper, we illustrate that learning a disentangled model of acoustic formation, referred to as disentangled acoustic field (DAF), to capture the sound generation and propagation process, enables the embodied agent to construct a spatial uncertainty map over where the objects may have fallen. We demonstrate that our analysis-by-synthesis framework can jointly infer sound properties by explicitly decomposing and factorizing the latent space of the disentangled model. We further show that the spatial uncertainty map can significantly improve the success rate for the localization of fallen objects by proposing multiple plausible exploration locations. .

Methodology

In the DAF pipeline, the encoder and decoder work together in an analysis-by-synthesis loop to infer object properties, with both components being trained jointly. For navigation with DAF, the planner uses the uncertainty map generated by DAF to prioritize locations, helping the agent efficiently locate the fallen object.

Experiment

We conducted several experiments to evaluate DAF’s performance in inferring object properties. Results on the ObjectFolder2, Find Fallen Challenge, and Real-Impact datasets show significant improvements over baseline methods in predicting object properties.

Next, we explored DAF’s potential in navigation and planning. Our ablation studies demonstrate that incorporating the uncertainty map significantly improves the agent’s efficiency in searching for objects, reducing both path length and the number of actions taken. Moreover, our method generalizes well across different room types and layouts.

Conclusion

This paper proposes an egocentric disentangled acoustic field framework that generalizes and reasons across various scenes by decomposing sound properties in the latent space.
The method improves object localization success rates and generates multimodal uncertainty maps, demonstrating its potential for sound localization and understanding in complex environments.

Cite This Paper

If you use this work in an academic work, please cite:

@article{yin2024disentangled,
  title={Disentangled Acoustic Fields For Multimodal Physical Scene Understanding},
  author={Yin, Jie and Luo, Andrew and Du, Yilun and Cherian, Anoop and Marks, Tim K and Roux, Jonathan Le and Gan, Chuang},
  journal={arXiv preprint arXiv:2407.11333},
  year={2024}
}

Acknowledgement

Thanks funding support from MERL and computation resources from MIT-IBM AI Lab.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
IROS24_0286.jpg		IROS24_0286.jpg
README.md		README.md
daf_final.pdf		daf_final.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Disentangled Acoustic Fields For Multimodal Physical Scene Understanding (IROS2024)

Introduction

Methodology

Experiment

Conclusion

Cite This Paper

Acknowledgement

About

Releases

Packages

sjtuyinjie/DAF

Folders and files

Latest commit

History

Repository files navigation

Disentangled Acoustic Fields For Multimodal Physical Scene Understanding (IROS2024)

Introduction

Methodology

Experiment

Conclusion

Cite This Paper

Acknowledgement

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages