-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathindex.html
332 lines (293 loc) · 14.9 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta name="description"
content="DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph">
<meta name="keywords" content="DARG">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph</title>
<link href="https://fonts.googleapis.com/css?family=Google+Sans|Noto+Sans|Castoro"
rel="stylesheet">
<link rel="stylesheet" href="./static/css/bulma.min.css">
<link rel="stylesheet" href="./static/css/bulma-carousel.min.css">
<link rel="stylesheet" href="./static/css/bulma-slider.min.css">
<link rel="stylesheet" href="./static/css/fontawesome.all.min.css">
<link rel="stylesheet"
href="https://cdn.jsdelivr.net/gh/jpswalsh/academicons@1/css/academicons.min.css">
<link rel="stylesheet" href="./static/css/index.css">
<link rel="icon" href="./static/images/salt-logo.png">
<script src="https://ajax.googleapis.com/ajax/libs/jquery/3.5.1/jquery.min.js"></script>
<script defer src="./static/js/fontawesome.all.min.js"></script>
<script src="./static/js/bulma-carousel.min.js"></script>
<script src="./static/js/bulma-slider.min.js"></script>
<script src="./static/js/index.js"></script>
<style>
/* Three image containers (use 25% for four, and 50% for two, etc) */
.imgcolumn {
float: left;
width: 50%;
padding: 10px
}
/* Clear floats after image containers */
.imgrow::after {
content: "";
clear: both;
display: table;
}
table.customTable {
width: 50%;
background-color: #FFFFFF;
border-collapse: collapse;
border-width: 2px;
border-color: rgb(214, 236, 244);
border-style: solid;
color: #000000;
margin-left: auto;
margin-right: auto;
}
table.customTable td {
border-width: 2px;
border-color: rgb(214, 236, 244);
border-style: solid;
padding: 5px;
text-align: center;
vertical-align: middle;
}
table.customTable th {
border-width: 2px;
border-color: rgb(214, 236, 244);
border-style: solid;
padding: 5px;
}
table.customTable thead {
background-color: rgb(214, 236, 244);
}
</style>
</head>
<body>
<section class="hero">
<div class="hero-body">
<div class="container is-max-desktop">
<div class="columns is-centered">
<div class="column has-text-centered">
<h1 class="title is-1 publication-title">DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph</h1>
<div class="is-size-5 publication-authors">
<span class="author-block">
<a href="https://zzh-sjtu.github.io/zhehaozhang.github.io/">Zhehao Zhang</a><sup>1</sup>,</span>
<span class="author-block">
<a href="https://cs.stanford.edu/people/jiaaoc/">Jiaao Chen</a><sup>2</sup>,</span>
<span class="author-block">
<a href="https://cs.stanford.edu/~diyiy/">Diyi Yang</a><sup>3</sup>
</span>
</div>
<div class="is-size-5 publication-authors">
<span class="author-block"><sup>1</sup>Dartmouth College,</span>
<span class="author-block"><sup>2</sup>Georgia Tech,</span>
<span class="author-block"><sup>3</sup>Stanford University,</span>
</div>
<div class="is-size-5 publication-authors">
<span class="author-block">
<img src="./static/images/Dartmouth-College-Logo.png" width="200" align="absmiddle" />
</span>
<span class="author-block">
<img src="./static/images/GeorgiaTech_RGB.png" width="200" align="absmiddle"/>
</span>
<span class="author-block">
<img src="./static/images/stanford-university-logo-2.png" style="margin-right: 50px;" width="200" align="absmiddle"/>
</span><!---->
</div>
<div class="column has-text-centered">
<div class="publication-links">
<!-- PDF Link. -->
<span class="link-block">
<a href="https://arxiv.org/abs/2406.17271"
class="external-link button is-normal is-rounded is-dark">
<span class="icon">
<i class="fas fa-file-pdf"></i>
</span>
<span>Paper</span>
</a>
</span>
<!-- Code Link. -->
<span class="link-block">
<a href="https://github.com/SALT-NLP/DARG"
class="external-link button is-normal is-rounded is-dark">
<span class="icon">
<i class="fab fa-github"></i>
</span>
<span>Code</span>
</a>
</span>
</div>
</div>
</div>
</div>
</div>
</div>
</section>
<section class="section">
<div class="container is-max-desktop">
<!-- Abstract. -->
<div class="columns is-centered has-text-centered">
<div class="column is-four-fifths">
<h2 class="title is-3">Abstract</h2>
<div class="content has-text-justified">
<p>
The current paradigm of evaluating Large Language Models (LLMs) through static benchmarks comes with significant limitations, such as vulnerability to data contamination and a lack of adaptability to the evolving capabilities of LLMs. Therefore, evaluation methods that can adapt and generate evaluation data with controlled complexity are urgently needed. In this work, we introduce Dynamic Evaluation of LLMs via Adaptive Reasoning Graph Evolvement (DARG) to dynamically extend current benchmarks with controlled complexity and diversity. Specifically, we first extract the reasoning graphs of data points in current benchmarks and then perturb the reasoning graphs to generate novel testing data. Such newly generated test samples can have different levels of complexity while maintaining linguistic diversity similar to the original benchmarks. We further use a code-augmented LLM to ensure the label correctness of newly generated data. We apply our DARG framework to diverse reasoning tasks in four domains with 15 state-of-the-art LLMs. Experimental results show that almost all LLMs experience a performance decrease with increased complexity and certain LLMs exhibit significant drops. Additionally, we find that LLMs exhibit more biases when being evaluated via the data generated by DARG with higher complexity levels. These observations provide useful insights into how to dynamically and adaptively evaluate LLMs.
</p>
</div>
</div>
</div>
<!--/ Abstract. -->
</div>
</section>
<section class="section">
<div class="container is-max-desktop">
<div class="columns is-centered">
<div class="column">
<div class="content">
<h2 class="title is-3">Overall Framework</h2>
<p>Our DARG framework we first construct the reasoning graphs for data points in given benchmarks using LLMs (e.g., computational reasoning graphs for solving a math problem are shown in the following figure). Next, we perform fine-grained graph perturbations based on various dimensions of the reasoning graph. Afterwards, we convert the reasoning graph back into the description that adapts the linguistic diversity as the original data. In order to ensure the correctness of the reasoning graph construction and graph-to-text generation, we use tool-augmented LLMs to verify the quality of reasoning graphs and generated text to produce valid test examples.</p>
<img src="./static/images/framework.png" class="example-image" alt="Example image."/>
</div>
</div>
</div>
</section>
<section class="section">
<div class="container is-max-desktop">
<div class="columns is-centered">
<div class="column">
<div class="content">
<h2 class="title is-3">MATH Reasoning</h2>
<p>We evaluate 15 SOTA LLMs on GSM8K using DARG with reasoning graphs of increased width, depth, and numerical complexity. Almost all LLMs' performances drop, while closed-source models and larger models show more resilience to complexity increases.</p>
<center><img src="./static/images/gsm8k_result_figure.png" class="example-image" alt="Example image." style="width: 80%;"/></center>
<center><img src="./static/images/gsm8k_table.png" class="example-image" alt="Example image." style="width: 60%;" /></center>
<center><img src="./static/images/radar.png" class="example-image" alt="Example image." style="width: 80%;" /></center>
<p>This radar map shows different LLMs' resilience to complexity increases, measured by the Complexity-Induced Accuracy Retention Rate (CIARR), which calculates the average percentage retention in accuracy per complexity increment as the average ratio of accuracy at each subsequent complexity level to the previous level.</p>
</div>
</div>
</div>
</section>
<section class="section">
<div class="container is-max-desktop">
<div class="columns is-centered">
<div class="column">
<div class="content">
<h2 class="title is-3">Social Reasoning</h2>
<p>We evaluate SOTA LLMs on the Bias Benchmark for QA (BBQ) using DARG with reasoning graphs that have an increased number of attribute nodes and modified attributes' polarity. The metrics are accuracy, bias score, and Overall Avoidance Rate, which measures how often LLMs are overly sensitive to contexts involving protected groups, often choosing 'Cannot be determined.' even when clear evidence supports an answer. LLMs perform worse as complexity increases and show increasing biases towards protected groups.</p>
<center><img src="./static/images/bbq_cot_results.png" class="example-image" alt="Example image." style="width: 80%;" /></center>
</div>
</div>
</div>
</section>
<section class="section">
<div class="container is-max-desktop">
<div class="columns is-centered">
<div class="column is-full">
<div class="content">
<h2 class="title is-3">Spatial Reasoning</h2>
<p>We evaluate SOTA LLMs on the BIG-Bench Hard (BBH) Navigate dataset, a spatial reasoning dataset that involves giving the LLM
navigation steps to determine if the agent returns to the starting point. As the depth of the reasoning graph
increases, most LLMs' overall accuracy drops, with a significant decline in accuracy on positive cases (where the label is 'Yes')
while the accuracy on negative cases remains comparatively stable, indicating biases.</p>
<div class="columns">
<div class="column is-one-third">
<img src="./static/images/navigate_overall.png" class="example-image" alt="Overall performance image." />
</div>
<div class="column is-one-third">
<img src="./static/images/navigate_negative.png" class="example-image"
alt="Negative case performance image." />
</div>
<div class="column is-one-third">
<img src="./static/images/navigate_positive.png" class="example-image"
alt="Positive case performance image." />
</div>
</div>
</div>
</div>
</div>
</div>
</section>
<section class="section">
<div class="container is-max-desktop">
<div class="columns is-centered">
<div class="column">
<div class="content">
<h2 class="title is-3">Symbolic Reasoning</h2>
<p>We evaluate SOTA LLMs on the BIG-Bench Hard (BBH) Dyck Language dataset, a symbolic reasoning dataset that requires the model to predict the sequence of closing parentheses for a Dyck-4 word missing its last few closing parentheses. As the depth of the reasoning graph's input and output parts increases, all LLMs' performances tend to decrease.</p>
<img src="./static/images/BBH_dyck_results.png" class="example-image" alt="Example image."/>
</div>
</div>
</div>
</section>
<section class="section">
<div class="container is-max-desktop">
<div class="columns is-centered">
<div class="column">
<div class="content">
<h2 class="title is-3">Finetune LLMs with DARG generated data</h2>
<p>We compare Llama2-7B and Mistral-7B finetuned with DARG generated data and the origical GSM8K's training data, both models finetuned with DARG generated data can outperform the one finetuned with an equivalent amount of GSM8K's
original training data. This demonstrates DARG's potential not only to dynamically generate new test samples but also
to produce training data that enables LLMs to adapt to various complexity levels.</p>
<center><img src="./static/images/finetuned_results.png" class="example-image" alt="Example image." style="width: 60%;" /></center>
</div>
</div>
</div>
</section>
<section class="section" id="BibTeX">
<div class="container is-max-desktop content">
<h2 class="title">BibTeX</h2>
<pre><code>@misc{zhang2024dargdynamicevaluationlarge,
title={DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph},
author={Zhehao Zhang and Jiaao Chen and Diyi Yang},
year={2024},
eprint={2406.17271},
archivePrefix={arXiv},
primaryClass={cs.CL}
url={https://arxiv.org/abs/2406.17271},
}</code></pre>
</div>
</section>
<section class="section" id="Acknowledgement">
<div class="container is-max-desktop content">
<h2 class="title">Usage and License Notices</h2>
<p>
The data, code and model checkpoint are intended and licensed for research use only. Please do not use them for any malicious purposes.
</p>
<p>
The benchmark is built on top of the C4 dataset, under the ODC Attribution License (ODC-By).
</p>
<p>
This website is licensed under a <a rel="license"
href="http://creativecommons.org/licenses/by-sa/4.0/">Creative
Commons Attribution-ShareAlike 4.0 International License</a>.
</p>
<p>
This source code of this website is borrowed from <a
href="https://github.com/nerfies/nerfies.github.io">Nerfies</a>.
</p>
</div>
</section>
<!--
<footer class="footer">
<div class="container">
<div class="columns is-centered">
<div class="column is-8">
<div class="content">
<p>
This website is licensed under a <a rel="license"
href="http://creativecommons.org/licenses/by-sa/4.0/">Creative
Commons Attribution-ShareAlike 4.0 International License</a>.
</p>
<p>
This source code of this website is borrowed from <a
href="https://github.com/nerfies/nerfies.github.io">Nerfies</a>.
</p>
</div>
</div>
</div>
</div>
</footer> -->
</body>
</html>