Bench-CoE: A Framework for Collaboration of Experts from Benchmark

News

[2024-12] 🎉 Our arXiv paper has been published. arXiv Paper
[2024-12] 🔊 Our Bench-CoE repository has been established. Currently, only inference code is included. GitHub Repository

📌 About

Bench-CoE introduces a novel framework for expert collaboration through benchmark-driven approaches. This work pioneers subject-level expert collaboration, moving beyond traditional query-level methods to achieve more efficient and generalizable model cooperation.

🔍 Technical Approach

Our framework introduces a simple yet effective approach for expert collaboration:

Subject Router:
- BERT-based subject classification
- Efficient subject-level task distribution
- Low computational overhead
Expert Models:
- Pre-trained LLMs as subject experts
- No additional training required
- Direct deployment of existing models
Simple Aggregation:
- Straightforward answer selection
- Lightweight combination strategy
- Efficient inference process

Workflow Advantages

Simplicity: Minimal architectural modifications to existing models
Efficiency: Direct utilization of pre-trained models without fine-tuning
Scalability: Easy integration of new expert models
Practicality: Low computational and resource requirements

📊 Experimental Journey

Our experimental validation follows a three-stage progressive approach, systematically demonstrating the framework's capabilities:

Stage 1: Navie Tasks

Performance Results

Our framework demonstrates significant improvements on two key benchmarks:

MMLU-Pro Performance

Best base Model (Gemma-2-9b-it): 52.04%
Bench-CoE (Subject-Level): 52.24% (+0.2%)
Bench-CoE (Query-Level): 64.28% (+12.24%)
Key achievement: Substantial improvement through query-level routing

MMMU Performance

Best base Model (InternVL2-8B): 47.67%
Bench-CoE (Subject-Level): 51.78% (+4.11%)
Key achievement: Effective subject-level knowledge organization

Key Findings:

Query-level routing shows strong performance on navie tasks
Subject-level approach demonstrates promising potential
Significant improvements over state-of-the-art base models

Stage 2: In-Distribution Excellence

Domain-Specific Performance Analysis

Our framework demonstrates robust performance on in-distribution tasks, validating the effectiveness of both query-level and subject-level approaches:

Winogrande Performance

Best base Model (Gemma-2-9b-it): 66.14%
Bench-CoE (Query-Level): 67.01% (+0.87%)
Key achievement: Consistent improvement in commonsense reasoning

MMMU Performance

Best base Model (InternVL2-8B): 47.67%
Bench-CoE (Subject-Level): 50.78% (+3.11%)
Key achievement: Strong performance in multimodal understanding

Key Findings:

Both approaches demonstrate consistent improvements over strong baselines
Bench-CoE(Subject-Level) demonstrates stronger generalization ability and robustness when the distribution between the training and test datasets starts to diverge

Stage 3: Out-Distribution Generalization

Cross-Dataset Generalization Analysis

Our framework demonstrates strong generalization capabilities across different domains and datasets:

Language Task Generalization (MMLU-Pro → BBH)

Training: MMLU-Pro dataset for expert construction
Testing: Big-Bench-Hard (BBH) for out-distribution evaluation
Results:
- Best base Model (Mathstral-7B-v0.1): 66.35%
- Bench-CoE (Subject-Level): 69.91% (+3.56%)
- Bench-CoE (Query-Level): 67.07% (+0.72%)
Key achievement: Superior subject-level generalization to complex reasoning tasks

Multimodal Task Generalization (MMMU → MMStar)

Training: MMMU dataset for expert construction
Testing: MMStar for cross-domain evaluation
Results:
- Best base Model (InternVL2-8B): 59.22%
- Bench-CoE (Subject-Level): 60.09% (+0.87%)
- Bench-CoE (Query-Level): 56.00% (-3.22%)
Key achievement: Robust subject-level transfer in multimodal scenarios

Key Findings:

Subject-level approach shows superior generalization ability
Effective knowledge transfer across different task distributions
Robust performance in both language and multimodal domains
Demonstrates the scalability of benchmark-driven expert construction

🛠️ Settings

Environment Setup

environment.yml: Contains a list of Python dependencies and their versions, essential for setting up the development environment.

Language Models

Evaluation Scripts

The scripts in the /coe_evaluation directory are designed to evaluate various aspects of trained models:

eval_bbh_vllm_query.py: Evaluates the performance of query-level Bench-CoE on the Big Bench Hard dataset
eval_bbh_vllm_subject.py: Evaluates the performance of subject-level Bench-CoE on the Big Bench Hard dataset
eval_hellaswag_vllm_query.py: Evaluates the performance of query-level Bench-CoE on the hellaswag dataset
eval_mmlu_pro_vllm_query.py: Evaluates the performance of query-level Bench-CoE on the MMLU Pro dataset
eval_mmlu_pro_vllm_subject.py: Evaluates the performance of subject-level Bench-CoE on the MMLU Pro dataset
eval_winogrand_vllm_query.py: Evaluates the performance of query-level Bench-CoE on the Winogrande dataset

Multimodal Models

Evaluation Scripts

lmms-eval: Framework for standardized multimodal evaluation
Custom evaluation: Scripts for specific multimodal tasks (MMMU, SQA)

💻 Usage

Language Models

Setting Up the Conda Environment

# Create and activate environment
conda env create -f environment.yml
conda activate bench-coe

Downloading Pre-trained BERT Router Models

# Clone router model repositories
git clone https://huggingface.co/anonymous/subject_bert_mmlu_pro
git clone https://huggingface.co/anonymous/query_bert_mmlu_pro
git clone https://huggingface.co/anonymous/query_bert_hellaswag
git clone https://huggingface.co/anonymous/query_bert_winogrande

Downloading and Setting Up Large Models

Download required sub-models to your local environment
Modify model paths in configuration files accordingly

Downloading Datasets

Obtain relevant datasets from Hugging Face
Ensure dataset compatibility with evaluation requirements

Running Evaluations

# BBH Evaluation
python eval_bbh_vllm_query.py     # Query-level evaluation
python eval_bbh_vllm_subject.py   # Subject-level evaluation

# MMLU-Pro Evaluation
python eval_mmlu_pro_vllm_query.py
python eval_mmlu_pro_vllm_subject.py

# Additional Tasks
python eval_hellaswag_vllm_query.py
python eval_winogrand_vllm_query.py

Multimodal Models

Evaluation using lmms-eval

Environment Setup

Install lmms-eval and required dependencies
Set up the selected model environment

Model Setup

Download pre-trained router model from anonymous/subject_bert_mmmu
Place coemodel.py in .lmms_eval/models
Modify 'AVAILABLE_MODELS' list in __init__.py

Running Evaluation

CUDA_VISIBLE_DEVICES=0 python3 -m accelerate.commands.launch \
    --num_processes=8 -m lmms_eval \
    --model coemodel \
    --model_args pretrained="None" \
    --tasks name --batch_size 1 \
    --log_samples --log_samples_suffix coemodel \
    --output_path ./logs/

Custom Evaluation

Environment Setup

Install TinyLLaVA_Factory and Bunny

# Create and activate environment
conda create -n bench-coe python=3.10 -y
conda activate bench-coe
cd /path/to/your/Bench-CoE/multimodal_evaluation/mm_eval/TinyLLaVA_Factory
pip install -e .
pip install xformers==0.0.20

Prepare required datasets

Model Setup

Download pre-trained router model
Configure model paths and parameters

Running Evaluation

CUDA_VISIBLE_DEVICES=0 bash scripts/eval_coe_mmmu.sh

Notes

Verify all dependencies are correctly installed
Check GPU memory requirements
Ensure model paths are properly configured
If you need to add a model, make modifications according to the loading method of the corresponding model.

❤️ Community efforts

Our model utilizes the bert-base-uncased. We extend our gratitude to the Hugging Face community for providing open access to this foundational technology, which has significantly propelled our research and development efforts.
Our multimodal model experiments is built upon the lmms-eval project. Great work!

📚 Citation

@misc{wang2024benchcoeframeworkcollaborationexperts,
      title={Bench-CoE: a Framework for Collaboration of Experts from Benchmark}, 
      author={Yuanshuai Wang and Xingjian Zhang and Jinkun Zhao and Siwei Wen and Peilin Feng and Shuhao Liao and Lei Huang and Wenjun Wu},
      year={2024},
      eprint={2412.04167},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2412.04167}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
assets		assets
multimodal_evaluation		multimodal_evaluation
nlp_evaluation		nlp_evaluation
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bench-CoE: A Framework for Collaboration of Experts from Benchmark

News

📌 About

🔍 Technical Approach

Workflow Advantages

📊 Experimental Journey

Stage 1: Navie Tasks

MMLU-Pro Performance

MMMU Performance

Stage 2: In-Distribution Excellence

Winogrande Performance

MMMU Performance

Stage 3: Out-Distribution Generalization

Language Task Generalization (MMLU-Pro → BBH)

Multimodal Task Generalization (MMMU → MMStar)

🛠️ Settings

Environment Setup

Language Models

Evaluation Scripts

Multimodal Models

Evaluation Scripts

💻 Usage

Language Models

Multimodal Models

Evaluation using lmms-eval

Custom Evaluation

Notes

❤️ Community efforts

📚 Citation

About

Releases

Packages

Contributors 4

Languages

License

ZhangXJ199/Bench-CoE

Folders and files

Latest commit

History

Repository files navigation

Bench-CoE: A Framework for Collaboration of Experts from Benchmark

News

📌 About

🔍 Technical Approach

Workflow Advantages

📊 Experimental Journey

Stage 1: Navie Tasks

MMLU-Pro Performance

MMMU Performance

Stage 2: In-Distribution Excellence

Winogrande Performance

MMMU Performance

Stage 3: Out-Distribution Generalization

Language Task Generalization (MMLU-Pro → BBH)

Multimodal Task Generalization (MMMU → MMStar)

🛠️ Settings

Environment Setup

Language Models

Evaluation Scripts

Multimodal Models

Evaluation Scripts

💻 Usage

Language Models

Multimodal Models

Evaluation using lmms-eval

Custom Evaluation

Notes

❤️ Community efforts

📚 Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages