diff --git a/papers/imagebind-unified-embeddings.md b/papers/imagebind-unified-embeddings.md new file mode 100644 index 00000000..97a27aa8 --- /dev/null +++ b/papers/imagebind-unified-embeddings.md @@ -0,0 +1,35 @@ +# ImageBind: One Embedding Space To Bind Them All + +## Overview +- **Authors:** Rohit Girdhar, Alaaeldin El-Nouby, et al. (Meta AI) +- **Year:** 2023 +- **Links:** [Paper](https://arxiv.org/abs/2305.05665) | [GitHub](https://github.com/facebookresearch/ImageBind) | [Project Page](https://imagebind.metademolab.com/) + +## Key Contributions +ImageBind presents a groundbreaking approach to unified multimodal embeddings by: +- Creating a single embedding space for six modalities (images, text, audio, depth, thermal, IMU) +- Achieving cross-modal binding using only image-paired data +- Enabling zero-shot transfer across modalities without explicit paired training +- Demonstrating emergent capabilities in cross-modal retrieval and arithmetic + +## Technical Implementation +```python +import torch +from imagebind import data +from imagebind.models import imagebind_model + +# Initialize model +model = imagebind_model.imagebind_huge(pretrained=True) +model.eval() +model.to("cuda") + +# Prepare multimodal inputs +inputs = { + "image": data.load_and_transform_vision_data(["image.jpg"]), + "text": data.load_and_transform_text(["A dog playing"]), + "audio": data.load_and_transform_audio_data(["audio.wav"]) +} + +# Generate embeddings +with torch.no_grad(): + embeddings = model(inputs)