SARChat-Bench-2M is the first large-scale multimodal dialogue dataset focusing on Synthetic Aperture Radar (SAR) imagery. It contains approximately 2 million high-quality SAR image-text pairs, supporting multiple tasks including scene classification, image captioning, visual question answering, and object localization. We conducted comprehensive evaluations on 16 state-of-the-art vision-language models (including Qwen2VL, InternVL2.5, and LLaVA), establishing the first multi-task benchmark in the SAR domain.
📑 Read more about SARChat in our paper.
Figure 1: Overview of SARChat's architecture (left) and comprehensive evaluation results showing model capabilities across different tasks (right)
Figure 2: Data processing workflow of SARChat
- 🌟 2M+ high-quality SAR image-text pairs
- 🔍 Covers diverse scenes including marine, terrestrial and urban areas
- 📊 6 task-specific benchmarks with fine-grained annotations
- 🤖 Evaluated on 11 SOTA vision-language models
- 🛠️ Ready-to-use format with shape, count, location labels
Figure 3: Distribution of tasks in training (left) and test (right) sets
Task | Train Set | Test Set |
---|---|---|
Classification | 81,788 | 10,024 |
Fine-Grained Description | 46,141 | 6,032 |
Instance Counting | 95,493 | 11,704 |
Spatial Grounding | 94,456 | 11,608 |
Cross-Modal Identification | 1,423,548 | 175,565 |
Referring | 95,486 | 11,703 |
Figure 4: Category distribution in training (left) and test (right) sets
Metric | Value |
---|---|
Total Words | 43,978,559 |
Total Sentences | 4,222,143 |
Average Caption Length | 10.66 |
🤗 Visit our Hugging Face dataset page for more details and examples.
Figure 6: Example results from SARChat-InternVL2.5-8B model on various SAR vision-language tasks
The above figure demonstrates the capabilities of our SARChat-InternVL2.5-8B model across different tasks. The model shows strong performance in understanding complex SAR imagery, providing detailed descriptions, accurate counting, and precise spatial reasoning. These results highlight the model's ability to bridge the gap between SAR imagery and natural language understanding.
We have trained and evaluated several models using the SARChat dataset:
Organization | Model | Size | Link |
---|---|---|---|
InternVL | SARChat-InternVL2.5 | 1B | Link |
InternVL | SARChat-InternVL2.5 | 2B | Link |
InternVL | SARChat-InternVL2.5 | 4B | Link |
InternVL | SARChat-InternVL2.5 | 8B | Link |
QwenVL | SARChat-Qwen2VL | 2B | Link |
QwenVL | SARChat-Qwen2VL | 7B | Link |
DeepSeek | SARChat-DeepSeekVL | 1.3B | Link |
DeepSeek | SARChat-DeepSeekVL | 7B | Link |
mPLUG-Owl | SARChat-Owl3 | 1B | Link |
mPLUG-Owl | SARChat-Owl3 | 2B | Link |
mPLUG-Owl | SARChat-Owl3 | 7B | Link |
Microsoft | SARChat-Phi3V | 4.3B | Link |
Zhipu AI | SARChat-GLM-Edge | 2B | Link |
Zhipu AI | SARChat-GLM-Edge | 5B | Link |
LLaVA-Team | SARChat-LLaVA-1.5 | 7B | Link |
01.AI | SARChat-Yi-VL | 6B | Link |
If you use this dataset or our models in your research, please cite our paper.
@inproceedings{Ma2025SARChatBench2MAM,
title={SARChat-Bench-2M: A Multi-Task Vision-Language Benchmark for SAR Image Interpretation},
author={Zhiming Ma and Xiayang Xiao and Sihao Dong and Peidong Wang and HaiPeng Wang and Qingyun Pan},
year={2025},
url={https://api.semanticscholar.org/CorpusID:276287423}
}
For any questions or feedback, please contact:
- 📧 Email: [email protected]
- 💬 GitHub Issues: Feel free to open an issue in this repository
If you find SARChat useful, please consider giving it a star ⭐