Please install ProBench as follows:
git clone https://github.com/Yan98/ProBench_eval
cd ProBench_eval
pip install -e .
We encourage users to customize models in gen_answer_vllm.py
. Currently, we provide examples for Pixtral-12B-2409
and QwenQwen2-VL-7B-Instruct"
.
- Generating MLLM outputs
python3 gen_answer_vllm.py --model Pixtral-12B-2409 --save-name Pixtral
- Running judgements Configuring GPT-4o as the evaluation judge:
export base_url=YOUR_BASE_URL
export api_key=YOUR_API_KEY
python3 gen_judgement.py --model Pixtral-12B-2409 --model-answer-file output/Pixtral.jsonl --judge_model gpt-4o-2024-08-06 --num_workers 64
- Displaying Results
for track in singleround multi-round multi-linguistic
do
python3 show_result.py --model Pixtral-12B-2409 --model-answer-file output/Pixtral.jsonl --judgement-file output/Pixtral --track $track
done
Additional settings allow evaluation based on:
- chanllenge level
- question type
- image type.
- more...
We encourage users to explore and customize their evaluations.
Please contact [email protected]
for any queries.
This repository is built using the arena-hard-auto repository.
This dataset follows CC-BY-NC-SA 4.0 license. Please use this dataset for non-commercial use ONLY.
@misc{yang2025probenchjudgingmultimodalfoundation,
title={ProBench: Judging Multimodal Foundation Models on Open-ended Multi-domain Expert Tasks},
author={Yan Yang and Dongxu Li and Haoning Wu and Bei Chen and Liu Liu and Liyuan Pan and Junnan Li},
year={2025},
eprint={2503.06885},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2503.06885},
}