-
Notifications
You must be signed in to change notification settings - Fork 2
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
58 changed files
with
727 additions
and
89 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,62 @@ | ||
--- | ||
author: shb | ||
icon: book | ||
category: | ||
- 数据集 | ||
date: 2023-07-08 | ||
tag: | ||
- 语言模型 | ||
shortTitle: M3KE数据集分享 | ||
--- | ||
|
||
|
||
# M3KE评估数据集分享 | ||
|
||
M3KE数据集是一种针对大语言模型的多层次、多主题的知识评估数据集,旨在衡量中文大型语言模型在零样本和少样本设置中获取知识的能力。 | ||
|
||
<!-- more --> | ||
|
||
::: tip | ||
|
||
项目地址:https://github.com/tjunlp-lab/M3KE | ||
|
||
项目贡献者/机构:天津大学与华为诺亚方实验室 | ||
|
||
::: | ||
|
||
|
||
## 1 数据集数据 | ||
M3KE 收集了 20,477 个真人标准化考试题目(包含 4 个候选答案),覆盖 71 个任务,包括小学、初中、高中、大学、研究生入学考试题目,涉及人文、历史、政治、法律、教育、心理学、科学、工程技术、艺术等学科。 | ||
|
||
![图1.1 M3KE数据集中任务分布](/assets/images/eval/M3KE_1.png "图1.1 M3KE数据集中任务分布" =430x400) | ||
|
||
## 2 数据集优势 | ||
(1) 契合中国教育体系,覆盖多教育阶段 | ||
研究人员模仿中国学生的教育经历,即小学、初中、高中、大学等主要教育阶段,旨在评估中文大模型在不同教育阶段下的表现。由于每个教育阶段需要掌握的知识点不同(例如,在语文学科中,小学和初中的知识或考点存在明显的差异),因此,M3KE 在不同教育阶段会包含相同的学科。为了提高数据集中学科知识点的覆盖范围,研究人员选择了中国升学考试中的统考试题,包括小升初、中考、高考,研究生入学考试和中国公务员考试等真题题目。 | ||
(2) 覆盖多学科领域 | ||
为提高数据集的学科覆盖率,研究人员基于人文艺术、社会科学和自然科学三大类进行构建,包括:文学、理学,历史、政治、法学、教育学、心理学、科学、工程技术、艺术等学科。为进一步拓展数据集的丰富度,研究人员补充了中医、宗教以及计算机等级考试等任务。 | ||
|
||
![图2.1 M3KE数据集中任务领域和难度的分布](/assets/images/eval/M3KE_2.png "图2.1 M3KE数据集中任务领域和难度的分布" ) | ||
|
||
|
||
|
||
![图2.2 M3KE数据与其他评估数据集对比](/assets/images/eval/M3KE_3.png "图2.2 M3KE数据与其他评估数据集对比") | ||
|
||
## 3 评估结果 | ||
<!-- ### 3.1 Zero-shot/Few-shot 零样本/少样本评估 --> | ||
在零样本设置条件下,模型要求直接回答问题;在少样本设置条件下,会预先给定模型同任务的若干示例,引导模型进行情景学习(In-Context Learning)。在 M3KE 中,所有题目均使用准确率计算得分。 | ||
(1) 不同学科类别下的模型零样本/少样本评估结果 | ||
|
||
![评估结果](/assets/images/eval/M3KE_4.png "图3.1 四个学科分类下各模型的零样本和少样本平均准确率") | ||
|
||
(2) 不同教育阶段下的模型零样本/少样本评估结果 | ||
|
||
![评估结果](/assets/images/eval/M3KE_5.png "图3.2 五个教育水平下各模型的零样本和少样本平均准确率") | ||
|
||
## 4 评估结果分析 | ||
|
||
(1)在零样本评估中(Table 4&6),所有参数小于 10B 的预训练语言模型(未经过微调)准确率都低于随机结果(25%),少样本的设置(Table 5&7)有助于模型性能的提升。但是,GLM130B 在零样本评估的结果好于少样本评估结果,原因可能是 GLM130B 在预训练阶段已经使用了部分指令数据,使其已经具备较好的零样本学习能力。 | ||
|
||
(2)大部分经过微调后的中文大模型仅达到随机结果(25%)水平,即使在小学阶段的测试中(Table 6&7)。这说明较低教育阶段中的知识仍然是当前中文大模型的短板之一。 | ||
|
||
(3)在零样本评估中,BELLE-7B-2M 取得了中文大模型中最好的成绩,但仍然与 GPT-3.5-turbo 有 14.8% 的差距。此外,有监督微调指令的数量也是一个重要的因素,经过两百万指令微调的 BELLE-7B-2M 好于经过二十万指令微调的 BELLE-7B-0.2M(Table 4)。 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,148 @@ | ||
--- | ||
author: 最后的开神-wkyc | ||
icon: blog | ||
date: 2023-11-07 | ||
category: | ||
- rag | ||
tag: | ||
- 检索 | ||
- rag | ||
# sticky: 10 | ||
--- | ||
|
||
# QA类信息文本检索模型和数据集汇总 | ||
测试集格式一致为少量的query和大量的corpus,根据query来搜索corpus,每个query对应一个corpus作为正确的召回结果。 | ||
<!-- more --> | ||
## 1 测试集 | ||
(1)CmedqaRetrieval | ||
链接:https://huggingface.co/datasets/C-MTEB/CmedqaRetrieval | ||
1)corpus:100k | ||
```json | ||
{ | ||
"id": "48d8fad6a4196ed953efb5e71313d3e8", | ||
"text": "睾丸炎,这个情况吃了左氧和诺氟沙星,炎可宁片,病情有所好转,建议继续服用药物到症状消失后三天为止。这个情况在治疗时是不能吃辛辣刺激性的食物。" | ||
} | ||
``` | ||
2)query:4k | ||
```json | ||
{ | ||
"id": "e6a90f19acc12055fb4c76a56d8a9f7e", | ||
"text": "睾丸炎引起的不孕不育王医生:我是六年前因腮腺炎引起睾丸炎因为当时没有治疗好。现在睾丸还会痛,去年做过睾丸穿刺检查睾丸不产生精子。请问可以采用什么方式进行治疗?慢慢的能够恢复正常吗?希望你能给予答复,谢谢。" | ||
} | ||
``` | ||
(2)CovidRetrieval | ||
链接:https://huggingface.co/datasets/C-MTEB/CovidRetrieval | ||
1)corpus:100k | ||
```json | ||
{ | ||
"id": "73b19aa7dc6a0611ffa1a43a2633c692", | ||
"text": "天津再备3座“小汤山”新华社天津2月3日电(记者栗雅婷)天津市新型冠状病毒感染的肺炎疫情防控工作指挥部3日决定,除海河医院外,天津市海滨人民医院、天津医科大学总医院空港医院、天津市津南医院(新址)也将作为收治新型冠状病毒感染的肺炎患者定点医院,4家医院累计床位将达到2130张。截至2月3日10时,天津市共确诊新型冠状病毒感染的肺炎病例56例,而素有“天津小汤山”之称的海河医院共有床位600张。本着有备无患的原则,天津市防控指挥部决定再备3座“小汤山”。其中,天津医科大学总医院空港医院预留床位500张,天津市海滨人民医院预留床位500张,天津市津南医院(新址)预留床位530张。" | ||
} | ||
``` | ||
2)query:946 | ||
```json | ||
{ | ||
"id": "20960d5509bba258d150e6042b0f4589", | ||
"text": "江西省复工“三同时”政策具体指什么?" | ||
} | ||
``` | ||
(3)DuRetrieval | ||
链接:https://huggingface.co/datasets/C-MTEB/DuRetrieval | ||
1)corpus:100k | ||
```json | ||
{ | ||
"id": "2c4fe63d3378ac39907b6b2648eb40c5", | ||
"text": "一年国家法定节假日为11天。根据公布的国家法定节假日调整方案,调整的主要内容包括:元旦放假1天不变;春节放假3天,放假时间为农历正月初一、初二、初三;“五一”国际劳动节1天不变;“十一”国庆节放假3天;清明节、端午节、中秋节增设为国家法定节假日,各放假1天(农历节日如遇闰月,以第一个月为休假日)。3、允许周末上移下错,与法定节假日形成连休。" | ||
} | ||
``` | ||
2)query:2k | ||
```json | ||
{ | ||
"id": "edb58f525bd14724d6f490722fa8a657", | ||
"text": "国家法定节假日共多少天" | ||
} | ||
``` | ||
(4)EcomRetrieval | ||
链接:https://huggingface.co/datasets/C-MTEB/EcomRetrieval | ||
1)corpus:101k | ||
```json | ||
{ | ||
"id": "1", | ||
"text": "红棉优级小粒老黄冰糖1.2kg大罐炖煮煲汤红烧肉酵素柠檬花茶雪梨" | ||
} | ||
``` | ||
2)query:1k | ||
```json | ||
{ | ||
"id": "200000", | ||
"text": "大落地窗" | ||
} | ||
``` | ||
(5)MedicalRetrieval | ||
链接:https://huggingface.co/datasets/C-MTEB/MedicalRetrieval | ||
1)corpus:101k | ||
```json | ||
{ | ||
"id": "30000001", | ||
"text": "您好:脂肪瘤属良性肿瘤但术后容易复发,患者可以采用中草药消除,而且安全,不会对身体产生任何的伤害及毒副作用,治愈的希望也是比较大的。" | ||
} | ||
``` | ||
2)query:1k | ||
```json | ||
{ | ||
"id": "2", | ||
"text": "大人手搜婴儿眼睛红了有什么影响?" | ||
} | ||
``` | ||
(6)MMarcoRetrieval | ||
链接:https://huggingface.co/datasets/C-MTEB/MMarcoRetrieval | ||
1)corpus:107k | ||
```json | ||
{ | ||
"id": "3863", | ||
"text": "1 成年人体内平均约有 10 品脱血液。捐赠期间大约会提供 1 品脱。 2 健康捐献者可以每 56 天捐献一次红细胞,或每 112 天捐献双倍红细胞。一个健康的捐献者可能会相隔 7 天捐献血小板,但每年最多捐献 24 次。" | ||
} | ||
``` | ||
2)query:6.98k | ||
```json | ||
{ | ||
"id": "1215", | ||
"text": "加拿大三级政府及其职责" | ||
} | ||
``` | ||
(7)T2Retrieval | ||
链接:https://huggingface.co/datasets/C-MTEB/T2Retrieval | ||
1)corpus:119k | ||
```json | ||
{ | ||
"id": "159474", | ||
"text": "<br><img><br>【重新获取取件码】<br>1、首先来到丰巢快递柜前,点击屏幕上的【取快递】;<br><img><br>2、然后选择取件码取件;<br><img><br>3、在输入取件码的右下方,有一个【忘记取件码】,点击;<br><img><br>4、然后输入快递使用的手机号码,点击“获取验证码”,验证码输入后,点击【下一步】;<br><img><br>5、可以看到当前柜机中存放的快递信息,点击右上角的【取件】,将快递取出即可。<br><img>" | ||
} | ||
``` | ||
2)query:22.8k | ||
```json | ||
{ | ||
"id": "0", | ||
"text": "蜂巢取快递验证码摁错怎么办" | ||
} | ||
``` | ||
(8)VideoRetrieval | ||
链接:https://huggingface.co/datasets/C-MTEB/VideoRetrieval/viewer/default/queries | ||
1)corpus:101k | ||
```json | ||
{ | ||
"id": "54312", | ||
"text": "小慧广场舞 欢乐的海洋 欢快喜庆的藏族舞32步 附教学" | ||
} | ||
``` | ||
2)query:1k | ||
```json | ||
{ | ||
"id": "440", | ||
"text": "女学校的男生" | ||
} | ||
``` | ||
## 2 QA类embedding模型排行表 | ||
链接:https://huggingface.co/spaces/mteb/leaderboard | ||
|
||
![示意图](/assets/images/rag/information_retrieve.png "图2.1 信息检索QA模型榜单") |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,63 +1,63 @@ | ||
--- | ||
author: shb | ||
icon: palette | ||
category: | ||
- 评估方法 | ||
date: 2023-07-08 | ||
tag: | ||
- 语言模型 | ||
- 评估 | ||
shortTitle: M3KE-大模型中文评估 | ||
--- | ||
|
||
|
||
# M3KE-大模型中文能力综合评估 | ||
|
||
M3KE数据集是一种针对大语言模型的多层次、多主题的知识评估数据集,旨在衡量中文大型语言模型在零样本和少样本设置中获取知识的能力。 | ||
|
||
<!-- more --> | ||
|
||
::: tip | ||
|
||
项目地址:https://github.com/tjunlp-lab/M3KE | ||
|
||
项目贡献者/机构:天津大学与华为诺亚方实验室 | ||
|
||
::: | ||
|
||
|
||
## 1 评估数据 | ||
M3KE 收集了 20,477 个真人标准化考试题目(包含 4 个候选答案),覆盖 71 个任务,包括小学、初中、高中、大学、研究生入学考试题目,涉及人文、历史、政治、法律、教育、心理学、科学、工程技术、艺术等学科。 | ||
|
||
![图1.1 M3KE数据集中任务分布](/assets/images/eval/M3KE_1.png "图1.1 M3KE数据集中任务分布" =430x400) | ||
|
||
## 2 评估优势 | ||
(1) 契合中国教育体系,覆盖多教育阶段 | ||
研究人员模仿中国学生的教育经历,即小学、初中、高中、大学等主要教育阶段,旨在评估中文大模型在不同教育阶段下的表现。由于每个教育阶段需要掌握的知识点不同(例如,在语文学科中,小学和初中的知识或考点存在明显的差异),因此,M3KE 在不同教育阶段会包含相同的学科。为了提高数据集中学科知识点的覆盖范围,研究人员选择了中国升学考试中的统考试题,包括小升初、中考、高考,研究生入学考试和中国公务员考试等真题题目。 | ||
(2) 覆盖多学科领域 | ||
为提高数据集的学科覆盖率,研究人员基于人文艺术、社会科学和自然科学三大类进行构建,包括:文学、理学,历史、政治、法学、教育学、心理学、科学、工程技术、艺术等学科。为进一步拓展数据集的丰富度,研究人员补充了中医、宗教以及计算机等级考试等任务。 | ||
|
||
![图2.1 M3KE数据集中任务领域和难度的分布](/assets/images/eval/M3KE_2.png "图2.1 M3KE数据集中任务领域和难度的分布" ) | ||
|
||
|
||
|
||
![图2.2 M3KE数据与其他评估数据集对比](/assets/images/eval/M3KE_3.png "图2.2 M3KE数据与其他评估数据集对比") | ||
|
||
## 3 评估结果 | ||
<!-- ### 3.1 Zero-shot/Few-shot 零样本/少样本评估 --> | ||
在零样本设置条件下,模型要求直接回答问题;在少样本设置条件下,会预先给定模型同任务的若干示例,引导模型进行情景学习(In-Context Learning)。在 M3KE 中,所有题目均使用准确率计算得分。 | ||
(1) 不同学科类别下的模型零样本/少样本评估结果 | ||
|
||
![评估结果](/assets/images/eval/M3KE_4.png "图3.1 四个学科分类下各模型的零样本和少样本平均准确率") | ||
|
||
(2) 不同教育阶段下的模型零样本/少样本评估结果 | ||
|
||
![评估结果](/assets/images/eval/M3KE_5.png "图3.2 五个教育水平下各模型的零样本和少样本平均准确率") | ||
|
||
## 4 评估结果分析 | ||
|
||
(1)在零样本评估中(Table 4&6),所有参数小于 10B 的预训练语言模型(未经过微调)准确率都低于随机结果(25%),少样本的设置(Table 5&7)有助于模型性能的提升。但是,GLM130B 在零样本评估的结果好于少样本评估结果,原因可能是 GLM130B 在预训练阶段已经使用了部分指令数据,使其已经具备较好的零样本学习能力。 | ||
|
||
(2)大部分经过微调后的中文大模型仅达到随机结果(25%)水平,即使在小学阶段的测试中(Table 6&7)。这说明较低教育阶段中的知识仍然是当前中文大模型的短板之一。 | ||
|
||
(3)在零样本评估中,BELLE-7B-2M 取得了中文大模型中最好的成绩,但仍然与 GPT-3.5-turbo 有 14.8% 的差距。此外,有监督微调指令的数量也是一个重要的因素,经过两百万指令微调的 BELLE-7B-2M 好于经过二十万指令微调的 BELLE-7B-0.2M(Table 4)。 | ||
--- | ||
author: shb | ||
icon: palette | ||
category: | ||
- 评估方法 | ||
date: 2023-07-08 | ||
tag: | ||
- 语言模型 | ||
- 评估 | ||
shortTitle: M3KE-大模型中文评估 | ||
--- | ||
|
||
|
||
# M3KE-大模型中文能力综合评估 | ||
|
||
M3KE数据集是一种针对大语言模型的多层次、多主题的知识评估数据集,旨在衡量中文大型语言模型在零样本和少样本设置中获取知识的能力。 | ||
|
||
<!-- more --> | ||
|
||
::: tip | ||
|
||
项目地址:https://github.com/tjunlp-lab/M3KE | ||
|
||
项目贡献者/机构:天津大学与华为诺亚方实验室 | ||
|
||
::: | ||
|
||
|
||
## 1 评估数据 | ||
M3KE 收集了 20,477 个真人标准化考试题目(包含 4 个候选答案),覆盖 71 个任务,包括小学、初中、高中、大学、研究生入学考试题目,涉及人文、历史、政治、法律、教育、心理学、科学、工程技术、艺术等学科。 | ||
|
||
![图1.1 M3KE数据集中任务分布](/assets/images/eval/M3KE_1.png "图1.1 M3KE数据集中任务分布" =430x400) | ||
|
||
## 2 评估优势 | ||
(1) 契合中国教育体系,覆盖多教育阶段 | ||
研究人员模仿中国学生的教育经历,即小学、初中、高中、大学等主要教育阶段,旨在评估中文大模型在不同教育阶段下的表现。由于每个教育阶段需要掌握的知识点不同(例如,在语文学科中,小学和初中的知识或考点存在明显的差异),因此,M3KE 在不同教育阶段会包含相同的学科。为了提高数据集中学科知识点的覆盖范围,研究人员选择了中国升学考试中的统考试题,包括小升初、中考、高考,研究生入学考试和中国公务员考试等真题题目。 | ||
(2) 覆盖多学科领域 | ||
为提高数据集的学科覆盖率,研究人员基于人文艺术、社会科学和自然科学三大类进行构建,包括:文学、理学,历史、政治、法学、教育学、心理学、科学、工程技术、艺术等学科。为进一步拓展数据集的丰富度,研究人员补充了中医、宗教以及计算机等级考试等任务。 | ||
|
||
![图2.1 M3KE数据集中任务领域和难度的分布](/assets/images/eval/M3KE_2.png "图2.1 M3KE数据集中任务领域和难度的分布" ) | ||
|
||
|
||
|
||
![图2.2 M3KE数据与其他评估数据集对比](/assets/images/eval/M3KE_3.png "图2.2 M3KE数据与其他评估数据集对比") | ||
|
||
## 3 评估结果 | ||
<!-- ### 3.1 Zero-shot/Few-shot 零样本/少样本评估 --> | ||
在零样本设置条件下,模型要求直接回答问题;在少样本设置条件下,会预先给定模型同任务的若干示例,引导模型进行情景学习(In-Context Learning)。在 M3KE 中,所有题目均使用准确率计算得分。 | ||
(1) 不同学科类别下的模型零样本/少样本评估结果 | ||
|
||
![评估结果](/assets/images/eval/M3KE_4.png "图3.1 四个学科分类下各模型的零样本和少样本平均准确率") | ||
|
||
(2) 不同教育阶段下的模型零样本/少样本评估结果 | ||
|
||
![评估结果](/assets/images/eval/M3KE_5.png "图3.2 五个教育水平下各模型的零样本和少样本平均准确率") | ||
|
||
## 4 评估结果分析 | ||
|
||
(1)在零样本评估中(Table 4&6),所有参数小于 10B 的预训练语言模型(未经过微调)准确率都低于随机结果(25%),少样本的设置(Table 5&7)有助于模型性能的提升。但是,GLM130B 在零样本评估的结果好于少样本评估结果,原因可能是 GLM130B 在预训练阶段已经使用了部分指令数据,使其已经具备较好的零样本学习能力。 | ||
|
||
(2)大部分经过微调后的中文大模型仅达到随机结果(25%)水平,即使在小学阶段的测试中(Table 6&7)。这说明较低教育阶段中的知识仍然是当前中文大模型的短板之一。 | ||
|
||
(3)在零样本评估中,BELLE-7B-2M 取得了中文大模型中最好的成绩,但仍然与 GPT-3.5-turbo 有 14.8% 的差距。此外,有监督微调指令的数量也是一个重要的因素,经过两百万指令微调的 BELLE-7B-2M 好于经过二十万指令微调的 BELLE-7B-0.2M(Table 4)。 |
Oops, something went wrong.