-
Notifications
You must be signed in to change notification settings - Fork 0
/
Benchmark.html
222 lines (198 loc) · 9.69 KB
/
Benchmark.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Open-LLM-Leaderboard</title>
<link rel="stylesheet" href="index.css">
<script src="scripts.js"></script>
<link rel="stylesheet" href="tabels.css">
</head>
<body>
<header>
<h1>Open-LLM-Leaderboard:</h1>
<h1 class="h1-2">From Multi-choice to Openstyle Questions for LLMs Evaluation, Benchmark, and Arena</h1>
<span style="color:#924684; font-size: 13pt; font-family: Roboto, Helvetica, Arial, Heveltica Neue, sans-serif">
<span class="author-block">
<a style="text-decoration: none" target="_blank" href="https://github.com/aidarmyrzakhan">Aidar Myrzakhan</a><sup>*</sup>,
</span>
<span class="author-block">
<a style="text-decoration: none" target="_blank" href="https://www.linkedin.com/in/sondos-mahmoud-bsharat-212303203/"> Sondos Mahmoud Bsharat</a><sup>*</sup>,
</span>
<span class="author-block">
<a style="text-decoration: none" target="_blank" href="https://zhiqiangshen.com/">Zhiqiang Shen</a><sup>*</sup>
</span>
<br>
<span class="author-block"><p class="contribution"><sup>*</sup>joint first author & equal contribution</p></span>
<img src="images/vilab.PNG" width="19" height="15" class="center">
<span class="author-block">
<a style="text-decoration: none;color:#924684 " target="_blank" href="https://github.com/VILA-Lab"><b>VILA Lab</b></a>
</span>,
<img src="images/mbz.PNG" width="20" height="15" class="center">
<span class="author-block"><a style="text-decoration: none;color:#924684 " target="_blank" href="https://mbzuai.ac.ae/"><b>Mohamed bin Zayed University of AI (MBZUAI)</b></a></span>
</span>
<div class="second">
<nav>
<ul class="first">
<li><a href="https://arxiv.org/pdf/2406.07545" class="nav-link"><img src="images/arxiv-icon-removebg-preview.png" alt="Paper Icon">Paper</a></li>
<li><a href="https://github.com/VILA-Lab/Open-LLM-Leaderboard" class="nav-link"><img src="images/github-logo_icon-icons.com_73546.png">Github</a></li>
<li><a href="https://huggingface.co/spaces/Open-Style/OSQ-Leaderboard" class="nav-link"><img src="images/hf-logo.png" alt="Paper Icon">Hugging Face</a></li>
</ul>
</nav>
</div>
<div class="main">
<nav class="main-nav">
<ul class="second">
<li><a href="index.html" class="nav-link2" >Home</a></li>
<li><a href="leaderboard.html" class="nav-link2">Open-LLM-Leaderboard</a></li>
<li><a href="Benchmark.html" class="nav-link2 current">OSQ-Benchmark</a></li>
</ul>
</nav>
</div>
</header>
<div class="key-findings Introduction">
<h3 class="widget-title">
<span class="section">
<a style="text-decoration: none" target="_blank" href="#home1">
<b>
<em>
Benchmark
</em>
</b>
</a>
</span>
<img src="images/data_distribution.png" width="60%" height="80%" class="m">
</h3>
<p>
<em>
<b style="color: #581e4d;">The Open-style Question Benchmark (OSQ-bench)</b> is at the forefront of refining how large language
models (LLMs) are evaluated. Moving away from traditional multiple-choice questions (MCQs),
OSQ-bench introduces open-style questions to eliminate common biases and enhance the assessment's
accuracy. This section details the benchmark's design, showcasing the extensive range and
quality of questions it includes and explaining the substantial benefits this format offers.
Discover how OSQ-bench is setting new standards in measuring LLMs' true comprehension and
reasoning abilities, making it a cornerstone for future advancements in AI evaluation.
</em>
</p>
<br><br><br><be><br><br>
</div> <!-- This closes the "key-findings Introduction", but there was already a missing closing tag issue -->
<!-- Repeat the structure for other findings sections -->
<div class="key-findings">
<h2>Statistics and Distributions</h2>
<ul>
<li><b>Total Questions Evaluated:</b> 42,000 questions from 9 different datasets.</li>
<li><b>Questions Suitable for Open-style:</b>Over 24,000 questions.</li>
<li><b>Domains Covered:</b>Includes a variety of fields like humanities, social sciences, STEM, literature comprehension,
and more, indicating a comprehensive coverage.</li>
</ul>
<div class="Intro-box">
<div class="table-container">
<table border="1">
<caption>Table 2: Statistics on open-style questions across different datasets.</caption>
<thead>
<tr>
<th>Benchmarks</th>
<th>#Evaluated</th>
<th>#Open-Style</th>
<th>Average Question Length</th>
</tr>
</thead>
<tbody>
<tr>
<td>MMLU</td>
<td>14,042</td>
<td>7,879</td>
<td>36.6</td>
</tr>
<tr>
<td>ARC</td>
<td>3,548</td>
<td>3,241</td>
<td>21.1</td>
</tr>
<tr>
<td>MedMCQA</td>
<td>4,183</td>
<td>2,336</td>
<td>14.1</td>
</tr>
<tr>
<td>Race</td>
<td>4,934</td>
<td>3,528</td>
<td>10.0</td>
</tr>
<tr>
<td>OpenbookQA</td>
<td>1,000</td>
<td>494</td>
<td>10.3</td>
</tr>
<tr>
<td>WinoGrande</td>
<td>1,267</td>
<td>1,267</td>
<td>19.1</td>
</tr>
<tr>
<td>HellaSwag</td>
<td>10,042</td>
<td>3,945</td>
<td>40.1</td>
</tr>
<tr>
<td>PiQA</td>
<td>1,838</td>
<td>700</td>
<td>7.1</td>
</tr>
<tr>
<td>Overall</td>
<td>42,075</td>
<td>24,104</td>
<td>19.05</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
<div class="key-findings">
<h2>Diversity</h2>
<ul>
<li><b>Question Types:</b> Assesses LLMs
on a broad range of question types from various datasets, reflecting diverse intellectual and
practical domains.</li>
<li><b>Interdisciplinary Nature:</b> Emphasizes the
interdisciplinary approach of the dataset, featuring categories such as natural sciences, technology,
and humanities.</li>
</ul>
</div>
<div class="key-findings">
<h2>Quality</h2>
<ul>
<li><b>Source of Questions:</b>Derived from recognized and reputable datasets, ensuring quality and
relevance.</li>
<li><b>Filtering Accuracy:</b>Low false positive rate (~5%), demonstrating the effectiveness of the
filtering process in selecting suitable questions for open-style answering.</li>
</ul>
</div>
<div class="key-findings">
<h2>Property and Advantage</h2>
<ul>
<li><b>Debiased Evaluation:</b> Focuses on providing a debiased approach by shifting away from MCQs, which are susceptible to selection bias and random guessing.</li>
<li><b>Efficiency and Cost-effectiveness:</b> Offers a quicker and less expensive evaluation method through automated systems, reducing reliance on extensive human assessment.</li>
<li><b>Real-time Performance Tracking:</b> The benchmark includes the Open-LLM-Leaderboard, enabling ongoing monitoring and comparison of LLM performance across different metrics and conditions.</li>
</ul>
</div>
</body>
<footer style="background-color: #e0d9d9; text-align: center; padding: 20px; font-size: 14px; color: #666;">
<p>© 2024 by Aidar Myrzakhan, Sondos Mahmoud Bsharat, Zhiqiang Shen. All rights reserved.</p>
<p>Disclaimer: The information provided on this website is for educational and research purposes only.</p>
<p>
For more information, visit our
<a href="https://github.com/VILA-Lab/Open-LLM-Leaderboard">GitHub</a> or
<a href="https://huggingface.co/spaces/Open-Style/OSQ-Leaderboard">Hugging Face</a>.
</p>
</footer>
</html>