ProBench: Judging Multimodal Foundation Models on Open-ended Multi-domain Expert Tasks

Yan Yang, Dongxu Li, Haoning Wu, Bei Chen, Liu Liu, Liyuan Pan, Junnan Li

Dataset on HuggingFace | Homepage | Leaderboard | ArXiv

Introduction

Solving expert-level multimodal tasks is a key milestone towards general intelligence. As the capabilities of multimodal large language models (MLLMs) continue to improve, evaluation of such advanced multimodal intelligence becomes necessary yet challenging. To this end, we introduce ProBench, a benchmark that contains open-ended multimodal queries that require intensive expert-level knowledge to solve . ProBench contains 10 task fields and 56 sub-fields, supports 17 languages, and supports conversations with up to 13 conversation turns.

Example

ProBench focuses on open-ended expert tasks. Here are sample evaluations:

Install

Please install ProBench as follows:

git clone https://github.com/Yan98/ProBench_eval
cd ProBench_eval
pip install -e .

[Custom Use] Evaluating on Probench

We encourage users to customize models in gen_answer_vllm.py. Currently, we provide examples for Pixtral-12B-2409 and QwenQwen2-VL-7B-Instruct".

Generating MLLM outputs

python3 gen_answer_vllm.py --model Pixtral-12B-2409 --save-name Pixtral

Running judgements Configuring GPT-4o as the evaluation judge:

export base_url=YOUR_BASE_URL
export api_key=YOUR_API_KEY
python3 gen_judgement.py --model Pixtral-12B-2409 --model-answer-file output/Pixtral.jsonl --judge_model gpt-4o-2024-08-06 --num_workers 64

Displaying Results

for track in singleround multi-round multi-linguistic
do
    python3 show_result.py --model Pixtral-12B-2409 --model-answer-file output/Pixtral.jsonl --judgement-file output/Pixtral --track $track
done

Additional settings allow evaluation based on:

chanllenge level
question type
image type.
more...

We encourage users to explore and customize their evaluations.

Contact

Please contact [email protected] for any queries.

Acknowledgement

This repository is built using the arena-hard-auto repository.

License

This dataset follows CC-BY-NC-SA 4.0 license. Please use this dataset for non-commercial use ONLY.

Citation

@misc{yang2025probenchjudgingmultimodalfoundation,
      title={ProBench: Judging Multimodal Foundation Models on Open-ended Multi-domain Expert Tasks}, 
      author={Yan Yang and Dongxu Li and Haoning Wu and Bei Chen and Liu Liu and Liyuan Pan and Junnan Li},
      year={2025},
      eprint={2503.06885},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.06885}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
assets/img		assets/img
engine		engine
README.md		README.md
gen_answer_vllm.py		gen_answer_vllm.py
gen_judgement.py		gen_judgement.py
setup.py		setup.py
show_result.py		show_result.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ProBench: Judging Multimodal Foundation Models on Open-ended Multi-domain Expert Tasks

Introduction

Example

Install

[Custom Use] Evaluating on Probench

Contact

Acknowledgement

License

Citation

About

Releases

Packages

Contributors 2

Languages

Yan98/ProBench_eval

Folders and files

Latest commit

History

Repository files navigation

ProBench: Judging Multimodal Foundation Models on Open-ended Multi-domain Expert Tasks

Introduction

Example

Install

[Custom Use] Evaluating on Probench

Contact

Acknowledgement

License

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages