Skip to content

Commit

Permalink
add vlmevalkit doc
Browse files Browse the repository at this point in the history
  • Loading branch information
Yunnglin committed Nov 25, 2024
1 parent 65fdd6f commit 470318b
Showing 1 changed file with 313 additions and 0 deletions.
313 changes: 313 additions & 0 deletions LLM-tutorial/VLMEvalKit多模态模型评估.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,313 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 使用VLMEvalKit进行多模态模型评估\n",
"\n",
"VLMEvalKit (python 包名为 vlmeval) 是一款专为大型视觉语言模型 (Large Vision-Language Models, LVLMs) 评测而设计的开源工具包。该工具支持在各种基准测试上对大型视觉语言模型进行一键评估,无需进行繁重的数据准备工作,让评估过程更加简便。\n",
"\n",
"以下展示两种方式进行模型评估:\n",
"1. 使用EvalScope封装的VLMEvalKit评测接口\n",
"2. 直接使用VLMEvalKit评测接口"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. 使用EvalScope封装的VLMEvalKit评测接口\n",
"\n",
"[EvalScope](https://github.com/modelscope/evalscope) 是魔搭社区官方推出的模型评估与性能基准测试框架,内置多个常用测试基准和评估指标,如MMLU、CMMLU、C-Eval、GSM8K、ARC、HellaSwag、TruthfulQA、MATH和HumanEval等;支持多种类型的模型评测,包括LLM、多模态LLM、embedding模型和reranker模型。EvalScope还适用于多种评测场景,如端到端RAG评测、竞技场模式和模型推理性能压测等。此外,通过ms-swift训练框架的无缝集成,可一键发起评测,实现了模型训练到评测的全链路支持。\n",
"\n",
"使用指南:[EvalScope使用指南](https://evalscope.readthedocs.io/zh-cn/latest/user_guides/backend/vlmevalkit_backend.html)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 环境准备"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "shellscript"
}
},
"outputs": [],
"source": [
"!pip install evalscope[vlmeval] -U\n",
"!pip install ms-swift -U"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 部署模型"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "shellscript"
}
},
"outputs": [],
"source": [
"!CUDA_VISIBLE_DEVICES=0 swift deploy --model_type internvl2-8b --port 8000"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"2024-11-25 19:48:03,083 - evalscope - INFO - ** Args: Task config is provided with dictionary type. **\n",
"2024-11-25 19:48:03,084 - evalscope - INFO - Check VLM Evaluation Kit: Installed\n",
"2024-11-25 19:48:03,085 - evalscope - INFO - *** Run task with config: Arguments(data=['SEEDBench_IMG', 'ChartQA_TEST'], model=['internvl2-8b'], fps=-1, nframe=8, pack=False, use_subtitle=False, work_dir='outputs', mode='all', nproc=16, retry=None, judge='exact_matching', verbose=False, ignore=False, reuse=False, limit=30, config=None, OPENAI_API_KEY='EMPTY', OPENAI_API_BASE=None, LOCAL_LLM=None) \n",
"\n",
"[2024-11-25 19:48:03,085] WARNING - RUN - run.py: run_task - 145: --reuse is not set, will not reuse previous (before one day) temporary files\n",
"2024-11-25 19:48:03,085 - RUN - WARNING - --reuse is not set, will not reuse previous (before one day) temporary files\n",
"[2024-11-25 19:48:07,410] INFO - ChatAPI - gpt.py: __init__ - 135: Using API Base: http://localhost:8000/v1/chat/completions; API Key: EMPTY\n",
"2024-11-25 19:48:07,410 - ChatAPI - INFO - Using API Base: http://localhost:8000/v1/chat/completions; API Key: EMPTY\n",
" 0%| | 0/10 [00:00<?, ?it/s][2024-11-25 19:48:08,634] INFO - ChatAPI - base.py: generate - 248: C. The man's hair is longer than his beard\n",
"2024-11-25 19:48:08,634 - ChatAPI - INFO - C. The man's hair is longer than his beard\n",
"[2024-11-25 19:48:09,379] INFO - ChatAPI - base.py: generate - 248: B. Muddy and murky\n",
"2024-11-25 19:48:09,379 - ChatAPI - INFO - B. Muddy and murky\n",
"[2024-11-25 19:48:15,361] INFO - ChatAPI - base.py: generate - 248: B. In the foreground, left side of the image\n",
"2024-11-25 19:48:15,361 - ChatAPI - INFO - B. In the foreground, left side of the image\n",
"[2024-11-25 19:48:15,362] INFO - ChatAPI - base.py: generate - 248: D. Window\n",
"2024-11-25 19:48:15,362 - ChatAPI - INFO - D. Window\n",
"[2024-11-25 19:48:15,362] INFO - ChatAPI - base.py: generate - 248: C. The clock tower\n",
"2024-11-25 19:48:15,362 - ChatAPI - INFO - C. The clock tower\n",
"[2024-11-25 19:48:15,363] INFO - ChatAPI - base.py: generate - 248: D. Standing\n",
"[2024-11-25 19:48:15,364] INFO - ChatAPI - base.py: generate - 248: D. Bench\n",
"2024-11-25 19:48:15,363 - ChatAPI - INFO - D. Standing\n",
"[2024-11-25 19:48:15,364] INFO - ChatAPI - base.py: generate - 248: A. A couple in the center\n",
"[2024-11-25 19:48:15,366] INFO - ChatAPI - base.py: generate - 248: C. A river\n",
"2024-11-25 19:48:15,364 - ChatAPI - INFO - D. Bench\n",
"[2024-11-25 19:48:15,367] INFO - ChatAPI - base.py: generate - 248: C. No\n",
" 10%|█ | 1/10 [00:07<01:11, 7.95s/it]2024-11-25 19:48:15,364 - ChatAPI - INFO - A. A couple in the center\n",
"2024-11-25 19:48:15,366 - ChatAPI - INFO - C. A river\n",
"2024-11-25 19:48:15,367 - ChatAPI - INFO - C. No\n",
"100%|██████████| 10/10 [00:07<00:00, 1.26it/s]\n",
"[2024-11-25 19:48:15,392] INFO - RUN - run.py: run_task - 340: {'nproc': 16, 'verbose': False, 'retry': 3, 'model': 'exact_matching'}\n",
"2024-11-25 19:48:15,392 - RUN - INFO - {'nproc': 16, 'verbose': False, 'retry': 3, 'model': 'exact_matching'}\n",
"100%|██████████| 10/10 [00:00<00:00, 8507.72it/s]\n",
"[2024-11-25 19:48:15,448] INFO - RUN - run.py: run_task - 395: The evaluation of model internvl2-8b x dataset SEEDBench_IMG has finished! \n",
"2024-11-25 19:48:15,448 - RUN - INFO - The evaluation of model internvl2-8b x dataset SEEDBench_IMG has finished! \n",
"[2024-11-25 19:48:15,449] INFO - RUN - run.py: run_task - 396: Evaluation Results:\n",
"2024-11-25 19:48:15,449 - RUN - INFO - Evaluation Results:\n",
"[2024-11-25 19:48:15,450] INFO - RUN - run.py: run_task - 402: \n",
"-------------------- ------------------\n",
"split none\n",
"Overall 0.6333333333333333\n",
"Instance Attributes 0.8571428571428571\n",
"Instance Identity 0.3333333333333333\n",
"Instance Interaction 1.0\n",
"Instance Location 0.0\n",
"Instances Counting 0.5\n",
"Scene Understanding 0.75\n",
"Visual Reasoning 1.0\n",
"-------------------- ------------------\n",
"2024-11-25 19:48:15,450 - RUN - INFO - \n",
"-------------------- ------------------\n",
"split none\n",
"Overall 0.6333333333333333\n",
"Instance Attributes 0.8571428571428571\n",
"Instance Identity 0.3333333333333333\n",
"Instance Interaction 1.0\n",
"Instance Location 0.0\n",
"Instances Counting 0.5\n",
"Scene Understanding 0.75\n",
"Visual Reasoning 1.0\n",
"-------------------- ------------------\n",
" 0%| | 0/10 [00:00<?, ?it/s][2024-11-25 19:48:16,745] INFO - ChatAPI - base.py: generate - 248: 0.56\n",
"2024-11-25 19:48:16,745 - ChatAPI - INFO - 0.56\n",
"[2024-11-25 19:48:17,034] INFO - ChatAPI - base.py: generate - 248: 80\n",
"2024-11-25 19:48:17,034 - ChatAPI - INFO - 80\n",
"[2024-11-25 19:48:18,640] INFO - ChatAPI - base.py: generate - 248: No\n",
"2024-11-25 19:48:18,640 - ChatAPI - INFO - No\n",
"[2024-11-25 19:48:18,640] INFO - ChatAPI - base.py: generate - 248: Child Labor (Boys, World, 2000-2012) (ILO)\n",
"2024-11-25 19:48:18,640 - ChatAPI - INFO - Child Labor (Boys, World, 2000-2012) (ILO)\n",
" 10%|█ | 1/10 [00:02<00:19, 2.12s/it][2024-11-25 19:48:19,011] INFO - ChatAPI - base.py: generate - 248: 60\n",
"2024-11-25 19:48:19,011 - ChatAPI - INFO - 60\n",
"[2024-11-25 19:48:21,908] INFO - ChatAPI - base.py: generate - 248: 77\n",
"2024-11-25 19:48:21,908 - ChatAPI - INFO - 77\n",
"[2024-11-25 19:48:21,909] INFO - ChatAPI - base.py: generate - 248: 29\n",
"2024-11-25 19:48:21,909 - ChatAPI - INFO - 29\n",
"[2024-11-25 19:48:21,910] INFO - ChatAPI - base.py: generate - 248: Yes\n",
"2024-11-25 19:48:21,910 - ChatAPI - INFO - Yes\n",
"[2024-11-25 19:48:21,911] INFO - ChatAPI - base.py: generate - 248: 2004\n",
"2024-11-25 19:48:21,911 - ChatAPI - INFO - 2004\n",
"[2024-11-25 19:48:21,912] INFO - ChatAPI - base.py: generate - 248: 0.6875\n",
"2024-11-25 19:48:21,912 - ChatAPI - INFO - 0.6875\n",
"100%|██████████| 10/10 [00:05<00:00, 1.86it/s]\n",
"[2024-11-25 19:48:21,931] INFO - RUN - run.py: run_task - 340: {'nproc': 16, 'verbose': False, 'retry': 3, 'model': 'exact_matching'}\n",
"2024-11-25 19:48:21,931 - RUN - INFO - {'nproc': 16, 'verbose': False, 'retry': 3, 'model': 'exact_matching'}\n",
"[2024-11-25 19:48:22,217] INFO - RUN - run.py: run_task - 395: The evaluation of model internvl2-8b x dataset ChartQA_TEST has finished! \n",
"2024-11-25 19:48:22,217 - RUN - INFO - The evaluation of model internvl2-8b x dataset ChartQA_TEST has finished! \n",
"[2024-11-25 19:48:22,219] INFO - RUN - run.py: run_task - 396: Evaluation Results:\n",
"2024-11-25 19:48:22,219 - RUN - INFO - Evaluation Results:\n",
"[2024-11-25 19:48:22,220] INFO - RUN - run.py: run_task - 402: \n",
"---------- -------\n",
"test_human 53.3333\n",
"Overall 53.3333\n",
"---------- -------\n",
"2024-11-25 19:48:22,220 - RUN - INFO - \n",
"---------- -------\n",
"test_human 53.3333\n",
"Overall 53.3333\n",
"---------- -------\n",
"2024-11-25 19:48:22,228 - evalscope - INFO - **Loading task cfg for summarizer: {'eval_backend': 'VLMEvalKit', 'eval_config': {'data': ['SEEDBench_IMG', 'ChartQA_TEST'], 'limit': 30, 'mode': 'all', 'model': [{'api_base': 'http://localhost:8000/v1/chat/completions', 'key': 'EMPTY', 'name': 'CustomAPIModel', 'temperature': 0.0, 'type': 'internvl2-8b'}], 'reuse': False, 'work_dir': 'outputs', 'judge': 'exact_matching'}}\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
">> Start to get the report with summarizer ...\n",
"\n",
">> The report list: [{'internvl2-8b_SEEDBench_IMG_acc': {'split': 'none', 'Overall': '0.6333333333333333', 'Instance Attributes': '0.8571428571428571', 'Instance Identity': '0.3333333333333333', 'Instance Interaction': '1.0', 'Instance Location': '0.0', 'Instances Counting': '0.5', 'Scene Understanding': '0.75', 'Visual Reasoning': '1.0'}}, {'internvl2-8b_ChartQA_TEST_acc': {'test_human': '53.333333333333336', 'Overall': '53.333333333333336'}}]\n"
]
}
],
"source": [
"task_cfg_dict = {\n",
" 'eval_backend': 'VLMEvalKit',\n",
" 'eval_config': {\n",
" 'data': ['SEEDBench_IMG', 'ChartQA_TEST'],\n",
" 'limit': 30,\n",
" 'mode': 'all',\n",
" 'model': [{\n",
" 'api_base': 'http://localhost:8000/v1/chat/completions',\n",
" 'key': 'EMPTY',\n",
" 'name': 'CustomAPIModel',\n",
" 'temperature': 0.0,\n",
" 'type': 'internvl2-8b'\n",
" }],\n",
" 'reuse': False,\n",
" 'work_dir': 'outputs',\n",
" 'judge': 'exact_matching'\n",
" }\n",
"}\n",
"\n",
"from evalscope.run import run_task\n",
"from evalscope.summarizer import Summarizer\n",
"\n",
"\n",
"def run_eval():\n",
" # 选项 1: python 字典\n",
" task_cfg = task_cfg_dict\n",
"\n",
" # 选项 2: yaml 配置文件\n",
" # task_cfg = 'eval_openai_api.yaml'\n",
"\n",
" run_task(task_cfg=task_cfg)\n",
"\n",
" print('>> Start to get the report with summarizer ...')\n",
" report_list = Summarizer.get_report_from_cfg(task_cfg)\n",
" print(f'\\n>> The report list: {report_list}')\n",
"\n",
"\n",
"run_eval()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. 直接使用VLMEvalKit\n",
"\n",
"直接使用VLMEvalKit需设置`VLMEVALKIT_USE_MODELSCOPE=1`来开启从modelscope下载数据集的能力,目前支持如下视频评测数据集:\n",
"- MVBench_MP4\n",
"- MLVU_OpenEnded\n",
"- MLVU_MCQ\n",
"- LongVideoBench\n",
"- TempCompass_MCQ\n",
"- TempCompass_Captioning\n",
"- TempCompass_YorN\n",
"- Video-MME\n",
"- MVBench\n",
"- MMBench-Video"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 环境准备"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "shellscript"
}
},
"outputs": [],
"source": [
"\n",
"git clone https://github.com/open-compass/VLMEvalKit.git\n",
"cd VLMEvalKit\n",
"pip install -e ."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"VLM 配置:所有 VLMs 都在 `vlmeval/config.py` 中配置。对于某些 VLMs(如 MiniGPT-4、LLaVA-v1-7B),需要额外的配置(在配置文件中配置代码 / 模型权重根目录)。在评估时,你应该使用 `vlmeval/config.py` 中 supported_VLM 指定的模型名称来选择 VLM。确保在开始评估之前,你可以成功使用 VLM 进行推理,使用以下命令 `vlmutil check {MODEL_NAME}`。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "shellscript"
}
},
"outputs": [],
"source": [
"# 执行如下bash命令开始评测:\n",
"!python run.py --data TempCompass --model InternVL2-8B"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.14"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

0 comments on commit 470318b

Please sign in to comment.