- [2024-12] π Our arXiv paper has been published. arXiv Paper
- [2024-12] π Our Bench-CoE repository has been established. Currently, only inference code is included. GitHub Repository
Bench-CoE introduces a novel framework for expert collaboration through benchmark-driven approaches. This work pioneers subject-level expert collaboration, moving beyond traditional query-level methods to achieve more efficient and generalizable model cooperation.
Our framework introduces a simple yet effective approach for expert collaboration:
-
Subject Router:
- BERT-based subject classification
- Efficient subject-level task distribution
- Low computational overhead
-
Expert Models:
- Pre-trained LLMs as subject experts
- No additional training required
- Direct deployment of existing models
-
Simple Aggregation:
- Straightforward answer selection
- Lightweight combination strategy
- Efficient inference process
- Simplicity: Minimal architectural modifications to existing models
- Efficiency: Direct utilization of pre-trained models without fine-tuning
- Scalability: Easy integration of new expert models
- Practicality: Low computational and resource requirements
Our experimental validation follows a three-stage progressive approach, systematically demonstrating the framework's capabilities:
Performance Results

Our framework demonstrates significant improvements on two key benchmarks:
- Best base Model (Gemma-2-9b-it): 52.04%
- Bench-CoE (Subject-Level): 52.24% (+0.2%)
- Bench-CoE (Query-Level): 64.28% (+12.24%)
- Key achievement: Substantial improvement through query-level routing
- Best base Model (InternVL2-8B): 47.67%
- Bench-CoE (Subject-Level): 51.78% (+4.11%)
- Key achievement: Effective subject-level knowledge organization
Key Findings:
- Query-level routing shows strong performance on navie tasks
- Subject-level approach demonstrates promising potential
- Significant improvements over state-of-the-art base models
Domain-Specific Performance Analysis
Our framework demonstrates robust performance on in-distribution tasks, validating the effectiveness of both query-level and subject-level approaches:
- Best base Model (Gemma-2-9b-it): 66.14%
- Bench-CoE (Query-Level): 67.01% (+0.87%)
- Key achievement: Consistent improvement in commonsense reasoning
- Best base Model (InternVL2-8B): 47.67%
- Bench-CoE (Subject-Level): 50.78% (+3.11%)
- Key achievement: Strong performance in multimodal understanding
Key Findings:
- Both approaches demonstrate consistent improvements over strong baselines
- Bench-CoE(Subject-Level) demonstrates stronger generalization ability and robustness when the distribution between the training and test datasets starts to diverge
Cross-Dataset Generalization Analysis
Our framework demonstrates strong generalization capabilities across different domains and datasets:
- Training: MMLU-Pro dataset for expert construction
- Testing: Big-Bench-Hard (BBH) for out-distribution evaluation
- Results:
- Best base Model (Mathstral-7B-v0.1): 66.35%
- Bench-CoE (Subject-Level): 69.91% (+3.56%)
- Bench-CoE (Query-Level): 67.07% (+0.72%)
- Key achievement: Superior subject-level generalization to complex reasoning tasks
- Training: MMMU dataset for expert construction
- Testing: MMStar for cross-domain evaluation
- Results:
- Best base Model (InternVL2-8B): 59.22%
- Bench-CoE (Subject-Level): 60.09% (+0.87%)
- Bench-CoE (Query-Level): 56.00% (-3.22%)
- Key achievement: Robust subject-level transfer in multimodal scenarios
Key Findings:
- Subject-level approach shows superior generalization ability
- Effective knowledge transfer across different task distributions
- Robust performance in both language and multimodal domains
- Demonstrates the scalability of benchmark-driven expert construction
environment.yml
: Contains a list of Python dependencies and their versions, essential for setting up the development environment.
The scripts in the /coe_evaluation
directory are designed to evaluate various aspects of trained models:
eval_bbh_vllm_query.py
: Evaluates the performance of query-level Bench-CoE on the Big Bench Hard dataseteval_bbh_vllm_subject.py
: Evaluates the performance of subject-level Bench-CoE on the Big Bench Hard dataseteval_hellaswag_vllm_query.py
: Evaluates the performance of query-level Bench-CoE on the hellaswag dataseteval_mmlu_pro_vllm_query.py
: Evaluates the performance of query-level Bench-CoE on the MMLU Pro dataseteval_mmlu_pro_vllm_subject.py
: Evaluates the performance of subject-level Bench-CoE on the MMLU Pro dataseteval_winogrand_vllm_query.py
: Evaluates the performance of query-level Bench-CoE on the Winogrande dataset
- lmms-eval: Framework for standardized multimodal evaluation
- Custom evaluation: Scripts for specific multimodal tasks (MMMU, SQA)
- Setting Up the Conda Environment
# Create and activate environment
conda env create -f environment.yml
conda activate bench-coe
- Downloading Pre-trained BERT Router Models
# Clone router model repositories
git clone https://huggingface.co/anonymous/subject_bert_mmlu_pro
git clone https://huggingface.co/anonymous/query_bert_mmlu_pro
git clone https://huggingface.co/anonymous/query_bert_hellaswag
git clone https://huggingface.co/anonymous/query_bert_winogrande
- Downloading and Setting Up Large Models
- Download required sub-models to your local environment
- Modify model paths in configuration files accordingly
- Downloading Datasets
- Obtain relevant datasets from Hugging Face
- Ensure dataset compatibility with evaluation requirements
- Running Evaluations
# BBH Evaluation
python eval_bbh_vllm_query.py # Query-level evaluation
python eval_bbh_vllm_subject.py # Subject-level evaluation
# MMLU-Pro Evaluation
python eval_mmlu_pro_vllm_query.py
python eval_mmlu_pro_vllm_subject.py
# Additional Tasks
python eval_hellaswag_vllm_query.py
python eval_winogrand_vllm_query.py
- Environment Setup
- Install lmms-eval and required dependencies
- Set up the selected model environment
- Model Setup
- Download pre-trained router model from anonymous/subject_bert_mmmu
- Place
coemodel.py
in.lmms_eval/models
- Modify 'AVAILABLE_MODELS' list in
__init__.py
- Running Evaluation
CUDA_VISIBLE_DEVICES=0 python3 -m accelerate.commands.launch \
--num_processes=8 -m lmms_eval \
--model coemodel \
--model_args pretrained="None" \
--tasks name --batch_size 1 \
--log_samples --log_samples_suffix coemodel \
--output_path ./logs/
- Environment Setup
- Install TinyLLaVA_Factory and Bunny
# Create and activate environment
conda create -n bench-coe python=3.10 -y
conda activate bench-coe
cd /path/to/your/Bench-CoE/multimodal_evaluation/mm_eval/TinyLLaVA_Factory
pip install -e .
pip install xformers==0.0.20
- Prepare required datasets
- Model Setup
- Download pre-trained router model
- Configure model paths and parameters
- Running Evaluation
CUDA_VISIBLE_DEVICES=0 bash scripts/eval_coe_mmmu.sh
- Verify all dependencies are correctly installed
- Check GPU memory requirements
- Ensure model paths are properly configured
- If you need to add a model, make modifications according to the loading method of the corresponding model.
- Our model utilizes the bert-base-uncased. We extend our gratitude to the Hugging Face community for providing open access to this foundational technology, which has significantly propelled our research and development efforts.
- Our multimodal model experiments is built upon the lmms-eval project. Great work!
@misc{wang2024benchcoeframeworkcollaborationexperts,
title={Bench-CoE: a Framework for Collaboration of Experts from Benchmark},
author={Yuanshuai Wang and Xingjian Zhang and Jinkun Zhao and Siwei Wen and Peilin Feng and Shuhao Liao and Lei Huang and Wenjun Wu},
year={2024},
eprint={2412.04167},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2412.04167},
}