-
Notifications
You must be signed in to change notification settings - Fork 71
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
1 changed file
with
313 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,313 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# 使用VLMEvalKit进行多模态模型评估\n", | ||
"\n", | ||
"VLMEvalKit (python 包名为 vlmeval) 是一款专为大型视觉语言模型 (Large Vision-Language Models, LVLMs) 评测而设计的开源工具包。该工具支持在各种基准测试上对大型视觉语言模型进行一键评估,无需进行繁重的数据准备工作,让评估过程更加简便。\n", | ||
"\n", | ||
"以下展示两种方式进行模型评估:\n", | ||
"1. 使用EvalScope封装的VLMEvalKit评测接口\n", | ||
"2. 直接使用VLMEvalKit评测接口" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## 1. 使用EvalScope封装的VLMEvalKit评测接口\n", | ||
"\n", | ||
"[EvalScope](https://github.com/modelscope/evalscope) 是魔搭社区官方推出的模型评估与性能基准测试框架,内置多个常用测试基准和评估指标,如MMLU、CMMLU、C-Eval、GSM8K、ARC、HellaSwag、TruthfulQA、MATH和HumanEval等;支持多种类型的模型评测,包括LLM、多模态LLM、embedding模型和reranker模型。EvalScope还适用于多种评测场景,如端到端RAG评测、竞技场模式和模型推理性能压测等。此外,通过ms-swift训练框架的无缝集成,可一键发起评测,实现了模型训练到评测的全链路支持。\n", | ||
"\n", | ||
"使用指南:[EvalScope使用指南](https://evalscope.readthedocs.io/zh-cn/latest/user_guides/backend/vlmevalkit_backend.html)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### 环境准备" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"vscode": { | ||
"languageId": "shellscript" | ||
} | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"!pip install evalscope[vlmeval] -U\n", | ||
"!pip install ms-swift -U" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### 部署模型" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"vscode": { | ||
"languageId": "shellscript" | ||
} | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"!CUDA_VISIBLE_DEVICES=0 swift deploy --model_type internvl2-8b --port 8000" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 3, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stderr", | ||
"output_type": "stream", | ||
"text": [ | ||
"2024-11-25 19:48:03,083 - evalscope - INFO - ** Args: Task config is provided with dictionary type. **\n", | ||
"2024-11-25 19:48:03,084 - evalscope - INFO - Check VLM Evaluation Kit: Installed\n", | ||
"2024-11-25 19:48:03,085 - evalscope - INFO - *** Run task with config: Arguments(data=['SEEDBench_IMG', 'ChartQA_TEST'], model=['internvl2-8b'], fps=-1, nframe=8, pack=False, use_subtitle=False, work_dir='outputs', mode='all', nproc=16, retry=None, judge='exact_matching', verbose=False, ignore=False, reuse=False, limit=30, config=None, OPENAI_API_KEY='EMPTY', OPENAI_API_BASE=None, LOCAL_LLM=None) \n", | ||
"\n", | ||
"[2024-11-25 19:48:03,085] WARNING - RUN - run.py: run_task - 145: --reuse is not set, will not reuse previous (before one day) temporary files\n", | ||
"2024-11-25 19:48:03,085 - RUN - WARNING - --reuse is not set, will not reuse previous (before one day) temporary files\n", | ||
"[2024-11-25 19:48:07,410] INFO - ChatAPI - gpt.py: __init__ - 135: Using API Base: http://localhost:8000/v1/chat/completions; API Key: EMPTY\n", | ||
"2024-11-25 19:48:07,410 - ChatAPI - INFO - Using API Base: http://localhost:8000/v1/chat/completions; API Key: EMPTY\n", | ||
" 0%| | 0/10 [00:00<?, ?it/s][2024-11-25 19:48:08,634] INFO - ChatAPI - base.py: generate - 248: C. The man's hair is longer than his beard\n", | ||
"2024-11-25 19:48:08,634 - ChatAPI - INFO - C. The man's hair is longer than his beard\n", | ||
"[2024-11-25 19:48:09,379] INFO - ChatAPI - base.py: generate - 248: B. Muddy and murky\n", | ||
"2024-11-25 19:48:09,379 - ChatAPI - INFO - B. Muddy and murky\n", | ||
"[2024-11-25 19:48:15,361] INFO - ChatAPI - base.py: generate - 248: B. In the foreground, left side of the image\n", | ||
"2024-11-25 19:48:15,361 - ChatAPI - INFO - B. In the foreground, left side of the image\n", | ||
"[2024-11-25 19:48:15,362] INFO - ChatAPI - base.py: generate - 248: D. Window\n", | ||
"2024-11-25 19:48:15,362 - ChatAPI - INFO - D. Window\n", | ||
"[2024-11-25 19:48:15,362] INFO - ChatAPI - base.py: generate - 248: C. The clock tower\n", | ||
"2024-11-25 19:48:15,362 - ChatAPI - INFO - C. The clock tower\n", | ||
"[2024-11-25 19:48:15,363] INFO - ChatAPI - base.py: generate - 248: D. Standing\n", | ||
"[2024-11-25 19:48:15,364] INFO - ChatAPI - base.py: generate - 248: D. Bench\n", | ||
"2024-11-25 19:48:15,363 - ChatAPI - INFO - D. Standing\n", | ||
"[2024-11-25 19:48:15,364] INFO - ChatAPI - base.py: generate - 248: A. A couple in the center\n", | ||
"[2024-11-25 19:48:15,366] INFO - ChatAPI - base.py: generate - 248: C. A river\n", | ||
"2024-11-25 19:48:15,364 - ChatAPI - INFO - D. Bench\n", | ||
"[2024-11-25 19:48:15,367] INFO - ChatAPI - base.py: generate - 248: C. No\n", | ||
" 10%|█ | 1/10 [00:07<01:11, 7.95s/it]2024-11-25 19:48:15,364 - ChatAPI - INFO - A. A couple in the center\n", | ||
"2024-11-25 19:48:15,366 - ChatAPI - INFO - C. A river\n", | ||
"2024-11-25 19:48:15,367 - ChatAPI - INFO - C. No\n", | ||
"100%|██████████| 10/10 [00:07<00:00, 1.26it/s]\n", | ||
"[2024-11-25 19:48:15,392] INFO - RUN - run.py: run_task - 340: {'nproc': 16, 'verbose': False, 'retry': 3, 'model': 'exact_matching'}\n", | ||
"2024-11-25 19:48:15,392 - RUN - INFO - {'nproc': 16, 'verbose': False, 'retry': 3, 'model': 'exact_matching'}\n", | ||
"100%|██████████| 10/10 [00:00<00:00, 8507.72it/s]\n", | ||
"[2024-11-25 19:48:15,448] INFO - RUN - run.py: run_task - 395: The evaluation of model internvl2-8b x dataset SEEDBench_IMG has finished! \n", | ||
"2024-11-25 19:48:15,448 - RUN - INFO - The evaluation of model internvl2-8b x dataset SEEDBench_IMG has finished! \n", | ||
"[2024-11-25 19:48:15,449] INFO - RUN - run.py: run_task - 396: Evaluation Results:\n", | ||
"2024-11-25 19:48:15,449 - RUN - INFO - Evaluation Results:\n", | ||
"[2024-11-25 19:48:15,450] INFO - RUN - run.py: run_task - 402: \n", | ||
"-------------------- ------------------\n", | ||
"split none\n", | ||
"Overall 0.6333333333333333\n", | ||
"Instance Attributes 0.8571428571428571\n", | ||
"Instance Identity 0.3333333333333333\n", | ||
"Instance Interaction 1.0\n", | ||
"Instance Location 0.0\n", | ||
"Instances Counting 0.5\n", | ||
"Scene Understanding 0.75\n", | ||
"Visual Reasoning 1.0\n", | ||
"-------------------- ------------------\n", | ||
"2024-11-25 19:48:15,450 - RUN - INFO - \n", | ||
"-------------------- ------------------\n", | ||
"split none\n", | ||
"Overall 0.6333333333333333\n", | ||
"Instance Attributes 0.8571428571428571\n", | ||
"Instance Identity 0.3333333333333333\n", | ||
"Instance Interaction 1.0\n", | ||
"Instance Location 0.0\n", | ||
"Instances Counting 0.5\n", | ||
"Scene Understanding 0.75\n", | ||
"Visual Reasoning 1.0\n", | ||
"-------------------- ------------------\n", | ||
" 0%| | 0/10 [00:00<?, ?it/s][2024-11-25 19:48:16,745] INFO - ChatAPI - base.py: generate - 248: 0.56\n", | ||
"2024-11-25 19:48:16,745 - ChatAPI - INFO - 0.56\n", | ||
"[2024-11-25 19:48:17,034] INFO - ChatAPI - base.py: generate - 248: 80\n", | ||
"2024-11-25 19:48:17,034 - ChatAPI - INFO - 80\n", | ||
"[2024-11-25 19:48:18,640] INFO - ChatAPI - base.py: generate - 248: No\n", | ||
"2024-11-25 19:48:18,640 - ChatAPI - INFO - No\n", | ||
"[2024-11-25 19:48:18,640] INFO - ChatAPI - base.py: generate - 248: Child Labor (Boys, World, 2000-2012) (ILO)\n", | ||
"2024-11-25 19:48:18,640 - ChatAPI - INFO - Child Labor (Boys, World, 2000-2012) (ILO)\n", | ||
" 10%|█ | 1/10 [00:02<00:19, 2.12s/it][2024-11-25 19:48:19,011] INFO - ChatAPI - base.py: generate - 248: 60\n", | ||
"2024-11-25 19:48:19,011 - ChatAPI - INFO - 60\n", | ||
"[2024-11-25 19:48:21,908] INFO - ChatAPI - base.py: generate - 248: 77\n", | ||
"2024-11-25 19:48:21,908 - ChatAPI - INFO - 77\n", | ||
"[2024-11-25 19:48:21,909] INFO - ChatAPI - base.py: generate - 248: 29\n", | ||
"2024-11-25 19:48:21,909 - ChatAPI - INFO - 29\n", | ||
"[2024-11-25 19:48:21,910] INFO - ChatAPI - base.py: generate - 248: Yes\n", | ||
"2024-11-25 19:48:21,910 - ChatAPI - INFO - Yes\n", | ||
"[2024-11-25 19:48:21,911] INFO - ChatAPI - base.py: generate - 248: 2004\n", | ||
"2024-11-25 19:48:21,911 - ChatAPI - INFO - 2004\n", | ||
"[2024-11-25 19:48:21,912] INFO - ChatAPI - base.py: generate - 248: 0.6875\n", | ||
"2024-11-25 19:48:21,912 - ChatAPI - INFO - 0.6875\n", | ||
"100%|██████████| 10/10 [00:05<00:00, 1.86it/s]\n", | ||
"[2024-11-25 19:48:21,931] INFO - RUN - run.py: run_task - 340: {'nproc': 16, 'verbose': False, 'retry': 3, 'model': 'exact_matching'}\n", | ||
"2024-11-25 19:48:21,931 - RUN - INFO - {'nproc': 16, 'verbose': False, 'retry': 3, 'model': 'exact_matching'}\n", | ||
"[2024-11-25 19:48:22,217] INFO - RUN - run.py: run_task - 395: The evaluation of model internvl2-8b x dataset ChartQA_TEST has finished! \n", | ||
"2024-11-25 19:48:22,217 - RUN - INFO - The evaluation of model internvl2-8b x dataset ChartQA_TEST has finished! \n", | ||
"[2024-11-25 19:48:22,219] INFO - RUN - run.py: run_task - 396: Evaluation Results:\n", | ||
"2024-11-25 19:48:22,219 - RUN - INFO - Evaluation Results:\n", | ||
"[2024-11-25 19:48:22,220] INFO - RUN - run.py: run_task - 402: \n", | ||
"---------- -------\n", | ||
"test_human 53.3333\n", | ||
"Overall 53.3333\n", | ||
"---------- -------\n", | ||
"2024-11-25 19:48:22,220 - RUN - INFO - \n", | ||
"---------- -------\n", | ||
"test_human 53.3333\n", | ||
"Overall 53.3333\n", | ||
"---------- -------\n", | ||
"2024-11-25 19:48:22,228 - evalscope - INFO - **Loading task cfg for summarizer: {'eval_backend': 'VLMEvalKit', 'eval_config': {'data': ['SEEDBench_IMG', 'ChartQA_TEST'], 'limit': 30, 'mode': 'all', 'model': [{'api_base': 'http://localhost:8000/v1/chat/completions', 'key': 'EMPTY', 'name': 'CustomAPIModel', 'temperature': 0.0, 'type': 'internvl2-8b'}], 'reuse': False, 'work_dir': 'outputs', 'judge': 'exact_matching'}}\n" | ||
] | ||
}, | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
">> Start to get the report with summarizer ...\n", | ||
"\n", | ||
">> The report list: [{'internvl2-8b_SEEDBench_IMG_acc': {'split': 'none', 'Overall': '0.6333333333333333', 'Instance Attributes': '0.8571428571428571', 'Instance Identity': '0.3333333333333333', 'Instance Interaction': '1.0', 'Instance Location': '0.0', 'Instances Counting': '0.5', 'Scene Understanding': '0.75', 'Visual Reasoning': '1.0'}}, {'internvl2-8b_ChartQA_TEST_acc': {'test_human': '53.333333333333336', 'Overall': '53.333333333333336'}}]\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"task_cfg_dict = {\n", | ||
" 'eval_backend': 'VLMEvalKit',\n", | ||
" 'eval_config': {\n", | ||
" 'data': ['SEEDBench_IMG', 'ChartQA_TEST'],\n", | ||
" 'limit': 30,\n", | ||
" 'mode': 'all',\n", | ||
" 'model': [{\n", | ||
" 'api_base': 'http://localhost:8000/v1/chat/completions',\n", | ||
" 'key': 'EMPTY',\n", | ||
" 'name': 'CustomAPIModel',\n", | ||
" 'temperature': 0.0,\n", | ||
" 'type': 'internvl2-8b'\n", | ||
" }],\n", | ||
" 'reuse': False,\n", | ||
" 'work_dir': 'outputs',\n", | ||
" 'judge': 'exact_matching'\n", | ||
" }\n", | ||
"}\n", | ||
"\n", | ||
"from evalscope.run import run_task\n", | ||
"from evalscope.summarizer import Summarizer\n", | ||
"\n", | ||
"\n", | ||
"def run_eval():\n", | ||
" # 选项 1: python 字典\n", | ||
" task_cfg = task_cfg_dict\n", | ||
"\n", | ||
" # 选项 2: yaml 配置文件\n", | ||
" # task_cfg = 'eval_openai_api.yaml'\n", | ||
"\n", | ||
" run_task(task_cfg=task_cfg)\n", | ||
"\n", | ||
" print('>> Start to get the report with summarizer ...')\n", | ||
" report_list = Summarizer.get_report_from_cfg(task_cfg)\n", | ||
" print(f'\\n>> The report list: {report_list}')\n", | ||
"\n", | ||
"\n", | ||
"run_eval()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## 2. 直接使用VLMEvalKit\n", | ||
"\n", | ||
"直接使用VLMEvalKit需设置`VLMEVALKIT_USE_MODELSCOPE=1`来开启从modelscope下载数据集的能力,目前支持如下视频评测数据集:\n", | ||
"- MVBench_MP4\n", | ||
"- MLVU_OpenEnded\n", | ||
"- MLVU_MCQ\n", | ||
"- LongVideoBench\n", | ||
"- TempCompass_MCQ\n", | ||
"- TempCompass_Captioning\n", | ||
"- TempCompass_YorN\n", | ||
"- Video-MME\n", | ||
"- MVBench\n", | ||
"- MMBench-Video" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### 环境准备" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"vscode": { | ||
"languageId": "shellscript" | ||
} | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"\n", | ||
"git clone https://github.com/open-compass/VLMEvalKit.git\n", | ||
"cd VLMEvalKit\n", | ||
"pip install -e ." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"VLM 配置:所有 VLMs 都在 `vlmeval/config.py` 中配置。对于某些 VLMs(如 MiniGPT-4、LLaVA-v1-7B),需要额外的配置(在配置文件中配置代码 / 模型权重根目录)。在评估时,你应该使用 `vlmeval/config.py` 中 supported_VLM 指定的模型名称来选择 VLM。确保在开始评估之前,你可以成功使用 VLM 进行推理,使用以下命令 `vlmutil check {MODEL_NAME}`。" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"vscode": { | ||
"languageId": "shellscript" | ||
} | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"# 执行如下bash命令开始评测:\n", | ||
"!python run.py --data TempCompass --model InternVL2-8B" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.10.14" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 2 | ||
} |