index.html

<!DOCTYPE html>
<html>
<head>
  <meta charset="utf-8">
  <meta name="description"
        content="DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph">
  <meta name="keywords" content="DARG">
  <meta name="viewport" content="width=device-width, initial-scale=1">
  <title>DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph</title>

  <link href="https://fonts.googleapis.com/css?family=Google+Sans|Noto+Sans|Castoro"
        rel="stylesheet">

  <link rel="stylesheet" href="./static/css/bulma.min.css">
  <link rel="stylesheet" href="./static/css/bulma-carousel.min.css">
  <link rel="stylesheet" href="./static/css/bulma-slider.min.css">
  <link rel="stylesheet" href="./static/css/fontawesome.all.min.css">
  <link rel="stylesheet"
        href="https://cdn.jsdelivr.net/gh/jpswalsh/academicons@1/css/academicons.min.css">
  <link rel="stylesheet" href="./static/css/index.css">
  <link rel="icon" href="./static/images/salt-logo.png">

  <script src="https://ajax.googleapis.com/ajax/libs/jquery/3.5.1/jquery.min.js"></script>
  <script defer src="./static/js/fontawesome.all.min.js"></script>
  <script src="./static/js/bulma-carousel.min.js"></script>
  <script src="./static/js/bulma-slider.min.js"></script>
  <script src="./static/js/index.js"></script>

  <style>
    /* Three image containers (use 25% for four, and 50% for two, etc) */
    .imgcolumn {
      float: left;
      width: 50%;
      padding: 10px
    }

    /* Clear floats after image containers */
    .imgrow::after {
      content: "";
      clear: both;
      display: table;
    }

    table.customTable {
      width: 50%;
      background-color: #FFFFFF;
      border-collapse: collapse;
      border-width: 2px;
      border-color: rgb(214, 236, 244);
      border-style: solid;
      color: #000000;
      margin-left: auto;
      margin-right: auto;
    }
    
    table.customTable td {
      border-width: 2px;
      border-color: rgb(214, 236, 244);
      border-style: solid;
      padding: 5px;
      text-align: center; 
      vertical-align: middle;
    }

    table.customTable th {
      border-width: 2px;
      border-color: rgb(214, 236, 244);
      border-style: solid;
      padding: 5px;
    }
    
    table.customTable thead {
      background-color: rgb(214, 236, 244);
    }
    </style>
</head>
<body>
  
  
<section class="hero">
  <div class="hero-body">
    <div class="container is-max-desktop">
      <div class="columns is-centered">
        <div class="column has-text-centered">
          <h1 class="title is-1 publication-title">DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph</h1>
          <div class="is-size-5 publication-authors">
            <span class="author-block">
              <a href="https://zzh-sjtu.github.io/zhehaozhang.github.io/">Zhehao Zhang</a><sup>1</sup>,</span>
            <span class="author-block">
              <a href="https://cs.stanford.edu/people/jiaaoc/">Jiaao Chen</a><sup>2</sup>,</span>
            <span class="author-block">
              <a href="https://cs.stanford.edu/~diyiy/">Diyi Yang</a><sup>3</sup>
            </span>
          </div>

          <div class="is-size-5 publication-authors">
            <span class="author-block"><sup>1</sup>Dartmouth College,</span>
            <span class="author-block"><sup>2</sup>Georgia Tech,</span>
            <span class="author-block"><sup>3</sup>Stanford University,</span>
          </div>

          <div class="is-size-5 publication-authors">
            <span class="author-block">
              <img src="./static/images/Dartmouth-College-Logo.png" width="200" align="absmiddle" />
            </span>
            <span class="author-block">
              <img src="./static/images/GeorgiaTech_RGB.png" width="200" align="absmiddle"/>  
            </span>
            <span class="author-block">
              <img src="./static/images/stanford-university-logo-2.png" style="margin-right: 50px;" width="200" align="absmiddle"/>  
            </span><!---->
          </div>

          <div class="column has-text-centered">
            <div class="publication-links">
              <!-- PDF Link. -->
              <span class="link-block">
                <a href="https://arxiv.org/abs/2406.17271"
                   class="external-link button is-normal is-rounded is-dark">
                  <span class="icon">
                      <i class="fas fa-file-pdf"></i>
                  </span>
                  <span>Paper</span>
                </a>
              </span>
              <!-- Code Link. -->
              <span class="link-block">
                <a href="https://github.com/SALT-NLP/DARG"
                   class="external-link button is-normal is-rounded is-dark">
                  <span class="icon">
                      <i class="fab fa-github"></i>
                  </span>
                  <span>Code</span>
                  </a>
              </span>
            </div>
          </div>
        </div>
      </div>
    </div>
  </div>
</section>


<section class="section">
  <div class="container is-max-desktop">
    <!-- Abstract. -->
    <div class="columns is-centered has-text-centered">
      <div class="column is-four-fifths">
        <h2 class="title is-3">Abstract</h2>
        <div class="content has-text-justified">
          <p>
            The current paradigm of evaluating Large Language Models (LLMs) through static benchmarks comes with significant limitations, such as vulnerability to data contamination and a lack of adaptability to the evolving capabilities of LLMs. Therefore, evaluation methods that can adapt and generate evaluation data with controlled complexity are urgently needed. In this work, we introduce Dynamic Evaluation of LLMs via Adaptive Reasoning Graph Evolvement (DARG) to dynamically extend current benchmarks with controlled complexity and diversity. Specifically, we first extract the reasoning graphs of data points in current benchmarks and then perturb the reasoning graphs to generate novel testing data. Such newly generated test samples can have different levels of complexity while maintaining linguistic diversity similar to the original benchmarks. We further use a code-augmented LLM to ensure the label correctness of newly generated data. We apply our DARG framework to diverse reasoning tasks in four domains with 15 state-of-the-art LLMs. Experimental results show that almost all LLMs experience a performance decrease with increased complexity and certain LLMs exhibit significant drops. Additionally, we find that LLMs exhibit more biases when being evaluated via the data generated by DARG with higher complexity levels. These observations provide useful insights into how to dynamically and adaptively evaluate LLMs.
          </p>
        </div>
      </div>
    </div>
    <!--/ Abstract. -->
  </div>
</section>

<section class="section">
  <div class="container is-max-desktop">

    <div class="columns is-centered">
      <div class="column">
        <div class="content">
          <h2 class="title is-3">Overall Framework</h2>
          <p>Our DARG framework we first construct the reasoning graphs for data points in given benchmarks using LLMs (e.g., computational reasoning graphs for solving a math problem are shown in the following figure). Next, we perform fine-grained graph perturbations based on various dimensions of the reasoning graph. Afterwards, we convert the reasoning graph back into the description that adapts the linguistic diversity as the original data. In order to ensure the correctness of the reasoning graph construction and graph-to-text generation, we use tool-augmented LLMs to verify the quality of reasoning graphs and generated text to produce valid test examples.</p>
          <img src="./static/images/framework.png" class="example-image" alt="Example image."/>
        </div>
      </div>
    </div>
</section>

<section class="section">
  <div class="container is-max-desktop">

    <div class="columns is-centered">
      <div class="column">
        <div class="content">
          <h2 class="title is-3">MATH Reasoning</h2>
          <p>We evaluate 15 SOTA LLMs on GSM8K using DARG with reasoning graphs of increased width, depth, and numerical complexity. Almost all LLMs' performances drop, while closed-source models and larger models show more resilience to complexity increases.</p>
          <center><img src="./static/images/gsm8k_result_figure.png" class="example-image" alt="Example image." style="width: 80%;"/></center>
          <center><img src="./static/images/gsm8k_table.png" class="example-image" alt="Example image." style="width: 60%;" /></center>
          <center><img src="./static/images/radar.png" class="example-image" alt="Example image." style="width: 80%;" /></center>
          <p>This radar map shows different LLMs' resilience to complexity increases, measured by the Complexity-Induced Accuracy Retention Rate (CIARR), which calculates the average percentage retention in accuracy per complexity increment as the average ratio of accuracy at each subsequent complexity level to the previous level.</p>
        </div>
      </div>
    </div>
</section>

<section class="section">
  <div class="container is-max-desktop">

    <div class="columns is-centered">
      <div class="column">
        <div class="content">
          <h2 class="title is-3">Social Reasoning</h2>
          <p>We evaluate SOTA LLMs on the Bias Benchmark for QA (BBQ) using DARG with reasoning graphs that have an increased number of attribute nodes and modified attributes' polarity. The metrics are accuracy, bias score, and Overall Avoidance Rate, which measures how often LLMs are overly sensitive to contexts involving protected groups, often choosing 'Cannot be determined.' even when clear evidence supports an answer. LLMs perform worse as complexity increases and show increasing biases towards protected groups.</p>
          <center><img src="./static/images/bbq_cot_results.png" class="example-image" alt="Example image." style="width: 80%;" /></center>
        </div>
      </div>
    </div>
</section>

<section class="section">
  <div class="container is-max-desktop">
    <div class="columns is-centered">
      <div class="column is-full">
        <div class="content">
          <h2 class="title is-3">Spatial Reasoning</h2>
          <p>We evaluate SOTA LLMs on the BIG-Bench Hard (BBH) Navigate dataset, a spatial reasoning dataset that involves giving the LLM
            navigation steps to determine if the agent returns to the starting point. As the depth of the reasoning graph
            increases, most LLMs' overall accuracy drops, with a significant decline in accuracy on positive cases (where the label is 'Yes')
            while the accuracy on negative cases remains comparatively stable, indicating biases.</p>
          <div class="columns">
            <div class="column is-one-third">
              <img src="./static/images/navigate_overall.png" class="example-image" alt="Overall performance image." />
            </div>
            <div class="column is-one-third">
              <img src="./static/images/navigate_negative.png" class="example-image"
                alt="Negative case performance image." />
            </div>
            <div class="column is-one-third">
              <img src="./static/images/navigate_positive.png" class="example-image"
                alt="Positive case performance image." />
            </div>
          </div>
        </div>
      </div>
    </div>
  </div>
</section>


<section class="section">
  <div class="container is-max-desktop">

    <div class="columns is-centered">
      <div class="column">
        <div class="content">
          <h2 class="title is-3">Symbolic Reasoning</h2>
          <p>We evaluate SOTA LLMs on the BIG-Bench Hard (BBH) Dyck Language dataset, a symbolic reasoning dataset that requires the model to predict the sequence of closing parentheses for a Dyck-4 word missing its last few closing parentheses. As the depth of the reasoning graph's input and output parts increases, all LLMs' performances tend to decrease.</p>
          <img src="./static/images/BBH_dyck_results.png" class="example-image" alt="Example image."/>
        </div>
      </div>
    </div>
</section>


<section class="section">
  <div class="container is-max-desktop">

    <div class="columns is-centered">
      <div class="column">
        <div class="content">
          <h2 class="title is-3">Finetune LLMs with DARG generated data</h2>
          <p>We compare Llama2-7B and Mistral-7B finetuned with DARG generated data and the origical GSM8K's training data, both models finetuned with DARG generated data can outperform the one finetuned with an equivalent amount of GSM8K's
          original training data. This demonstrates DARG's potential not only to dynamically generate new test samples but also
          to produce training data that enables LLMs to adapt to various complexity levels.</p>
          <center><img src="./static/images/finetuned_results.png" class="example-image" alt="Example image." style="width: 60%;" /></center>
        </div>
      </div>
    </div>
</section>

<section class="section" id="BibTeX">
  <div class="container is-max-desktop content">
    <h2 class="title">BibTeX</h2>
    <pre><code>@misc{zhang2024dargdynamicevaluationlarge,
    title={DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph},
    author={Zhehao Zhang and Jiaao Chen and Diyi Yang},
    year={2024},
    eprint={2406.17271},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
    url={https://arxiv.org/abs/2406.17271},
    }</code></pre>
  </div>
</section>


<section class="section" id="Acknowledgement">
  <div class="container is-max-desktop content">
    <h2 class="title">Usage and License Notices</h2>
    <p>
      The data, code and model checkpoint are intended and licensed for research use only. Please do not use them for any malicious purposes.
    </p>
    <p>
      The benchmark is built on top of the C4 dataset, under the ODC Attribution License (ODC-By). 
    </p>
    <p>
      This website is licensed under a <a rel="license"
                                          href="http://creativecommons.org/licenses/by-sa/4.0/">Creative
      Commons Attribution-ShareAlike 4.0 International License</a>.
    </p>
    <p>
      This source code of this website is borrowed from <a
        href="https://github.com/nerfies/nerfies.github.io">Nerfies</a>.
    </p>
  </div>
</section>


<!-- 
<footer class="footer">
  <div class="container">
    <div class="columns is-centered">
      <div class="column is-8">
        <div class="content">
          <p>
            This website is licensed under a <a rel="license"
                                                href="http://creativecommons.org/licenses/by-sa/4.0/">Creative
            Commons Attribution-ShareAlike 4.0 International License</a>.
          </p>
          <p>
            This source code of this website is borrowed from <a
              href="https://github.com/nerfies/nerfies.github.io">Nerfies</a>.
          </p>
        </div>
      </div>
    </div>
  </div>
</footer> -->

</body>
</html>