Skip to content

ProBench: Automatic Evaluation on Open-ended Multi-domain Expert Tasks

Notifications You must be signed in to change notification settings

Yan98/ProBench_eval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ProBench: Judging Multimodal Foundation Models on Open-ended Multi-domain Expert Tasks


Introduction

Solving expert-level multimodal tasks is a key milestone towards general intelligence. As the capabilities of multimodal large language models (MLLMs) continue to improve, evaluation of such advanced multimodal intelligence becomes necessary yet challenging. To this end, we introduce ProBench, a benchmark that contains open-ended multimodal queries that require intensive expert-level knowledge to solve . ProBench contains 10 task fields and 56 sub-fields, supports 17 languages, and supports conversations with up to 13 conversation turns.

Example

ProBench focuses on open-ended expert tasks. Here are sample evaluations:

Install

Please install ProBench as follows:

git clone https://github.com/Yan98/ProBench_eval
cd ProBench_eval
pip install -e .

[Custom Use] Evaluating on Probench

We encourage users to customize models in gen_answer_vllm.py. Currently, we provide examples for Pixtral-12B-2409 and QwenQwen2-VL-7B-Instruct".

  1. Generating MLLM outputs
python3 gen_answer_vllm.py --model Pixtral-12B-2409 --save-name Pixtral 
  1. Running judgements Configuring GPT-4o as the evaluation judge:
export base_url=YOUR_BASE_URL
export api_key=YOUR_API_KEY
python3 gen_judgement.py --model Pixtral-12B-2409 --model-answer-file output/Pixtral.jsonl --judge_model gpt-4o-2024-08-06 --num_workers 64 
  1. Displaying Results
for track in singleround multi-round multi-linguistic
do
    python3 show_result.py --model Pixtral-12B-2409 --model-answer-file output/Pixtral.jsonl --judgement-file output/Pixtral --track $track
done

Additional settings allow evaluation based on:

  • chanllenge level
  • question type
  • image type.
  • more...

We encourage users to explore and customize their evaluations.

Contact

Please contact [email protected] for any queries.

Acknowledgement

This repository is built using the arena-hard-auto repository.

License

This dataset follows CC-BY-NC-SA 4.0 license. Please use this dataset for non-commercial use ONLY.

Citation

@misc{yang2025probenchjudgingmultimodalfoundation,
      title={ProBench: Judging Multimodal Foundation Models on Open-ended Multi-domain Expert Tasks}, 
      author={Yan Yang and Dongxu Li and Haoning Wu and Bei Chen and Liu Liu and Liyuan Pan and Junnan Li},
      year={2025},
      eprint={2503.06885},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.06885}, 
}

About

ProBench: Automatic Evaluation on Open-ended Multi-domain Expert Tasks

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages