CLEVA is a Chinese Language Models EVAluation Platform developed by CUHK LaVi Lab. CLEVA would like to thank Shanghai AI Lab for the great collaboration in the process. The main features of CLEVA include:
- A comprehensive Chinese Benchmark, featuring 31 tasks (11 application assessments + 20 ability evaluation tasks), with a total of 370K Chinese test samples (33.98% are newly collected, mitigating data contamination issues);
- A standardized Prompt-Based Evaluation Methodology, incorporating unified pre-processing for all data and using a consistent set of Chinese prompt templates for evaluation.
- A trustworthy Leaderboard, as CLEVA uses a significant amount of new data to minimize data contamination and regularly organizes evaluations.
The leaderboard is evaluated and maintained by CLEVA using new test data. Past leaderboard data (processed test samples, annotated prompt templates, etc.) are made available to users for local evaluation runs.
- [2023.11.02] Thanks for the support of Stanford CRFM HELM team! CLEVA has now been integrated into the latest release of HELM. You can use CLEVA to evaluate your own models locally via HELM.
- [2023.09.30] CLEVA has been accepted to EMNLP 2023 System Demonstrations!
- [2023.08.09] Our paper for CLEVA is out!
CLEVA has been integrated into HELM. CLEVA would like to thank Stanford CRFM HELM team for the support. Users can employ CLEVA's datasets, prompt templates, perturbations, and Chinese automatic metrics for local evaluations via HELM.
Note
If you want to evaluate your models on CLEVA online, please contact us via [email protected] for authentication and check out 📘Documentation for API development.
Users can refer to the installation guide of HELM for setting up the Python environment and dependencies (Python>=3.8
).
Installation Using Anaconda
Here is an example for installation using Anaconda:
Create the environment first:
# Create virtual environment
# Only need to run once
conda create -n cleva python=3.8 pip
# Activate the virtual environment
conda activate cleva
Then install the dependencies:
pip install crfm-helm
Example command to evaluate gpt-3.5-turbo-0613
on CLEVA's Chinese-to-English translation task using HELM:
helm-run \
-r "cleva:model=openai/gpt-3.5-turbo-0613,task=translation,subtask=zh2en,prompt_id=0,version=v1,data_augmentation=cleva" \
--num-train-trials <num_trials> \
--max-eval-instances <max_eval_instances> \
--suite <suite_id>
Explanation of parameters in -r
(run configuration):
task
represents one of the 31 tasks included in CLEVA;subtask
specifies the subcategory under each CLEVA task;prompt_id
is the index of CLEVA's annotated prompt templates (starting from 0);version
is the version number of the CLEVA dataset (currently only thev1
dataset used in the paper is provided);data_augmentation
specifies the data augmentation strategy, where values likecleva_robustness
,cleva_fairness
, andcleva
are unique to CLEVA for evaluating Chinese language robustness, fairness and both respectively.
For other parameters, please refer to HELM's tutorial.
The full list of available task
, subtask
, and prompt_id
of CLEVA (version=v1
) can be found in HELM's .conf file. Users can run the entire CLEVA evaluation suite using the following command (the running time for reproducing CLEVA results can be found in the paper):
helm-run \
-c src/helm/benchmark/presentation/run_specs_cleva_v1.conf \
--num-train-trials <num_trials> \
--max-eval-instances <max_eval_instances> \
--suite <suite_id>
Generally, setting --max-eval-instances
to over 5000 ensures all CLEVA task data are used for evaluation.
Comparison between the results obtained using HELM for evaluating gpt-3.5-turbo-0613
on selected CLEVA tasks (version=v1
) and those from the CLEVA platform:
Scenario | Metric | Reproduced in HELM | Evaluated by CLEVA |
---|---|---|---|
task=summarization,subtask=dialogue_summarization | ROUGE-2 | 0.3045 | 0.3065 |
task=translation,subtask=en2zh | SacreBLEU | 60.48 | 59.23 |
task=fact_checking | Exact Match | 0.4595 | 0.4528 |
task=bias,subtask=dialogue_region_bias | Micro F1 | 0.5656 | 0.5589 |
Note
The difference is mainly due to different random seeds resulting in different in-context demonstrations, and the ChatGPT versions used by CLEVA and HELM are not completely aligned.
If you want to use CLEVA data for evaluation with your own code, you can download the data by:
bash download_data.sh
After a successful run, a folder named with the data version will be generated in the current directory, which contains the data of each task of CLEVA. You can specify the data version by passing arguments to download_data.sh
. It is v1
by default.
CLEVA is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
You should have received a copy of the license along with this work. If not, see https://creativecommons.org/licenses/by-nc-nd/4.0/.
Please cite our paper if you use CLEVA in your work:
@misc{li2023cleva,
title={CLEVA: Chinese Language Models EVAluation Platform},
author={Yanyang Li and Jianqiao Zhao and Duo Zheng and Zi-Yuan Hu and Zhi Chen and Xiaohui Su and Yongfeng Huang and Shijia Huang and Dahua Lin and Michael R. Lyu and Liwei Wang},
year={2023},
eprint={2308.04813},
archivePrefix={arXiv},
primaryClass={cs.CL}
}