evalperf.html

<!doctype html>
<html>
<link rel="preconnect" href="https://fonts.googleapis.com" />
<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin />
<link href="https://fonts.googleapis.com/css2?family=JetBrains+Mono:wght@100;400&display=swap" rel="stylesheet" />


<head>
  <meta charset="UTF-8" />
  <title>
    EvalPerf: Evaluating Language Models for Efficient Code Generation
  </title>
  <script src="https://cdnjs.cloudflare.com/ajax/libs/PapaParse/5.3.0/papaparse.min.js"></script>
  <script src="https://cdn.jsdelivr.net/npm/echarts@5.3.3/dist/echarts.min.js"></script>
  <link rel="icon" href="https://images.emojiterra.com/google/noto-emoji/unicode-15/color/1024px/1f9d1-1f4bb.png" />
  <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/bootstrap@5.0.0/dist/css/bootstrap.min.css" />

  <link href="https://cdn.jsdelivr.net/npm/prismjs@v1.x/themes/prism.css" rel="stylesheet" />
  <script src="https://cdn.jsdelivr.net/npm/prismjs@v1.x/components/prism-core.min.js"></script>
  <script src="https://cdn.jsdelivr.net/npm/prismjs@v1.x/plugins/autoloader/prism-autoloader.min.js"></script>
  <script src="https://cdn.jsdelivr.net/npm/prismjs-bibtex@2.1.0/prism-bibtex.min.js"></script>

  <style>
    body {
      font-family: "JetBrains Mono", monospace;
      background-color: #ffffff;
      color: #000000;
    }

    th,
    td {
      text-align: center;
      width: fit-content;
      font-size: larger;
    }

    .form-check {
      padding-right: 0.5rem;
      margin-bottom: 0.25rem;
    }

    .form-check-label {
      white-space: nowrap;
      overflow: hidden;
      text-overflow: ellipsis;
      font-size: 0.875rem;
    }

    #xAxisSelectors,
    #yAxisSelectors {
      min-width: min-content;
    }
  </style>
</head>

<body>
  <div id="content" class="container d-flex flex-column align-items-center gap-3">
    <h1 class="text-nowrap mt-5" style="font-size: xx-large;">
      <b>Evaluating LLMs for Efficient Code Generation</b>
    </h1>
    <div class="d-flex flex-row justify-content-center gap-3">
      <a href="https://openreview.net/forum?id=IBCBMeAhmC"><img
          src="https://img.shields.io/badge/Paper-COLM'24-a55fed.svg?style=for-the-badge"></a>
      <a href="https://github.com/evalplus/evalplus"><img
          src="https://img.shields.io/badge/github-%23121011.svg?style=for-the-badge&logo=github&logoColor=white"
          alt="github" class="img-fluid" /></a>
      <a href="https://pypi.org/project/evalplus"><img alt="PyPI - Version"
          src="https://img.shields.io/pypi/v/evalplus?style=for-the-badge&labelColor=black" class="img-fluid" />
      </a>
    </div>
    <div class="container d-flex flex-row flex-nowrap fs-5">
      <div class="container d-flex flex-column align-items-center">
        <div>
          <p>🚀 LLM-oriented code efficiency evaluation requires:</p>
          <ul>
            <li><strong>Performance-exercising tasks & inputs --</strong> "all complexities are equal when N is small"
            </li>
            <li><strong>Meaningful compound metric --</strong> avg. speedup does not fit multi-task evaluation
            </li>
          </ul>
          <p>🛍️ Based on <a href="https://jw-liu.xyz/assets/pdf/jiawei-colm-evalperf-poster.pdf">our methodology</a>,
            the
            EvalPerf dataset (current version 20240328) includes:</p>
          <ul>
            <li>118 performance-exercising tasks</li>
            <li>Each task is equipped with a <i>computationally challenging test input</i> generated by the SaS
              generator</li>
            <li>Differential Performance Score (DPS): <i>"DPS=80"</i> means <i>"submissions can outperform 80% LLM
                solutions"</i></li>
          </ul>

          <p>🦾 The reliability of EvalPerf comes from:</p>
          <ul>
            <li><b>Correctness ablation:</b> Pairwise comparison of LLMs' code efficiency over common passing tasks</li>
            <li><b>Anti-flakiness:</b> (1) long computation -> low runtime variation (Paper Fig. 6); (2) #instructions
              as primitive metric; & (3) DPS compares the given solution with reference solutions on the same test
              bed. -- These leads to low cross-platform variation (Paper Tab. 2)
            </li>
          </ul>

          Check out our <a href="https://jw-liu.xyz/assets/pdf/jiawei-colm-evalperf-poster.pdf">COLM'24 poster</a> and
          the <a href="https://github.com/evalplus/evalplus/blob/master/docs/evalperf.md">latest experimental
            configurations</a> for more details!
        </div>

        <div class="col-md-12 overflow-auto">
          <pre style="padding-top: 0; padding-bottom: 0;">
          <code class="language-bash">
pip install --upgrade "evalplus[perf,vllm] @ git+https://github.com/evalplus/evalplus"
# Or `pip install "evalplus[perf,vllm]" --upgrade` for the latest stable release

sudo sh -c 'echo 0 > /proc/sys/kernel/perf_event_paranoid' # Enable perf
evalplus.evalperf --model "ise-uiuc/Magicoder-S-DS-6.7B" --backend vllm</code>
          </pre>
        </div>

        <div>
          <b>Recommended comparison format:</b>
          <ul>
            <li>📊 <a href="#leaderboard">Win-rate ranking</a> -- <i>Each race round compares two models' DPS
                based on common passing tasks</i>
            </li>
            <li>🔥 <a href="#heatmapChart">Pairwise DPS in a Heatmap</a> -- <i>Computing DPS for 2 compared
                models on their common passing tasks</i></li>
          </ul>
        </div>

        <div class="container d-flex flex-column align-items-center gap-3 mt-5">
          <h3>Win-rate Leaderboard</h3>
          <p class="align-self-start">📊 Ranking metrics: WR (Win-Rate; %) based on task- and model-wise competiton
            (i.e., pairwise DPS).</p>
          <p class="align-self-start">📝 Notes: the default prompt does not emphasize efficiency requirements as our
            work shows such emphasis
            might degrade both efficiency and correctness for some weak models. Yet, "(⏩)" marks models using
            performance-encouraging prompts as they might be able to accurately understand such needs.</p>
          <div class="align-self-start d-inline-flex gap-3">
            <p>📐 Show more metrics: </p>
            <div class="form-check">
              <input class="form-check-input" type="checkbox" id="passAt1Checkbox">
              <label class="form-check-label fs-5" for="passAt1Checkbox">pass@1</label>
            </div>
            <div class="form-check">
              <input class="form-check-input" type="checkbox" id="dpsCheckbox">
              <label class="form-check-label fs-5" for="dpsCheckbox">DPS</label>
            </div>
          </div>
          <table id="leaderboard"
            class="table table-responsive table-striped table-bordered flex-shrink-1 border border-5">
          </table>
          <p class="align-self-start">🏪 The detailed model generation data and results are available at our page <a
              href="https://github.com/evalplus/evalplus.github.io/tree/main/results/evalperf">repository</a>.</p>
          <p class="align-self-start">💸 We use 50 samples (half) for o1 model series for cost saving; also because it's
            easy to sample desired
            amount of correct samples from strong models using less tries.</p>

          <br>
          <h3>Heatmap of Pairwise DPS Comparison</h3>
          <p>What's DPS? Differential Performance Score (DPS) is a LeetCode-inspired metric, which shows the overall
            code efficiency ranking percentile (0-100%) based on the LLM-generated code. For example, "DPS=80" means the
            LLM's "submissions can outperform/match 80% LLM solutions."</p>
          <div class="row w-100">
            <div class="col-12">
              <div class="card">
                <div class="card-header text-center">
                  <h5 class="mb-0">Model Selection</h5>
                </div>
                <div class="card-body p-0">
                  <div class="overflow-auto" style="max-height: 300px;">
                    <div class="row mx-0">
                      <div class="col-6 border-end">
                        <div class="p-2">
                          <label class="form-label">Examinee (Left)</label>
                          <div id="yAxisSelectors" class="d-flex flex-column gap-2">
                            <!-- Checkboxes will be inserted here by JavaScript -->
                          </div>
                        </div>
                      </div>
                      <div class="col-6">
                        <div class="p-2">
                          <label class="form-label">Reference (Bottom)</label>
                          <div id="xAxisSelectors" class="d-flex flex-column gap-2">
                            <!-- Checkboxes will be inserted here by JavaScript -->
                          </div>
                        </div>
                      </div>
                    </div>
                  </div>
                </div>
              </div>
            </div>
            <div class="col-12">
              <center>
                💡 Tips: float the mouse over the heatmap to see detailed DPS of the compared two models.
              </center>
              <div id="heatmapChart" style="width: 100%; height: 500px;"></div>
            </div>
          </div>

          <h3>Adding and visualizing new model results?</h3>
          <div class="col overflow-auto">
            <pre><code class="language-bash">
git clone git@github.com:evalplus/evalplus.github.io.git
cd evalplus.github.io && git pull
cp ${PATH_TO}/${MODEL}_temp_1.0_evalperf_results.brief.json results/evalperf
python results/evalperf/stats.py && python -m http.server 8000
# Open the displayed address in your browser
</code>
</pre>
          </div>

          <h2 id="sponsor" class="text-nowrap mt-5">🖊️ Citation</h2>
          <div class="col-md-12 overflow-auto">
            <pre style="padding-top: 0; padding-bottom: 0;">
          <code class="language-bibtex">
@inproceedings{evalperf,
  title = {Evaluating Language Models for Efficient Code Generation},
  author = {Liu, Jiawei and Xie, Songrun and Wang, Junhao and Wei, Yuxiang and Ding, Yifeng and Zhang, Lingming},
  booktitle = {First Conference on Language Modeling},
  year = {2024},
  url = {https://openreview.net/forum?id=IBCBMeAhmC},
}</code>
          </pre>
          </div>
          <h2 id="sponsor" class="text-nowrap mt-5">🤗 Acknowledgment</h2>
          <p>
            We thank
            <a href="https://openai.com/form/researcher-access-program/">OpenAI Researcher Access Program</a> for
            providing part of the compute.
          </p>
        </div>
      </div>
    </div>

    <script>
      const metricTable = document.getElementById("leaderboard");
      const linkMapping = new Map([]);
      const hfLinkPrefix = "https://huggingface.co/";
      const dataUrlPrefix = "results/evalperf";
      const passCheckBox = document.getElementById("passAt1Checkbox");
      const dpsCheckBox = document.getElementById("dpsCheckbox");

      // Load data
      var data = null;
      var heatmapTable = null;
      var dataUrl = dataUrlPrefix + "/COMBINED-RESULTS.json";
      var xhr = new XMLHttpRequest();
      xhr.open("GET", dataUrl, false); // false makes the request synchronous
      xhr.send();

      if (xhr.status === 200) {
        var results = JSON.parse(xhr.responseText);
        data = new Map(Object.entries(results));
        // convert each value to Map
        data.forEach((value, modelId) => {
          data.set(modelId, new Map(Object.entries(value)));
        });
        data.forEach((value, modelId) => {
          // add link to model
          if (modelId.includes("--")) {
            modelId = modelId.split("--");
            modelOrg = modelId[0];
            modelId = modelId[1];
            url = hfLinkPrefix + modelOrg + "/" + modelId;
            linkMapping.set(modelId, url);
          } else if (modelId.startsWith("o1-") || modelId.startsWith("gpt-")) {
            linkMapping.set(
              modelId,
              "https://platform.openai.com/docs/models",
            );
          } else if (modelId.startsWith("claude-3-")) {
            linkMapping.set(
              modelId,
              "https://www.anthropic.com/news/claude-3-family",
            );
          } else if (modelId.startsWith("gemini-1.5-pro")) {
            linkMapping.set(
              modelId,
              "https://blog.google/technology/ai/google-gemini-next-generation-model-february-2024/#sundar-note",
            );
          } else if (modelId.startsWith("gemini-1.5-flash")) {
            linkMapping.set(
              modelId,
              "https://deepmind.google/technologies/gemini/flash/",
            );
          } else if (modelId.startsWith("deepseek-chat")) {
            linkMapping.set(modelId, "https://chat.deepseek.com/")
          } else if (modelId == "heatmap_data") {
            heatmapTable = value;
            data.delete(modelId);
          }
        });
      } else {
        alert(
          "Failed to load data from " + dataUrl + ". Please try again later.",
        );
      }
      const globalData = data;
      const HeatmapTable = heatmapTable;
      const winrate_tag = "🏆 Model WR";

      // each row represents a model
      const theaders = [
        "#", // rank
        "Model", // model name
        "DPS",
        "pass@1",
        "Task WR", // task winrate
        winrate_tag, // computed over the same set of passing solutions
      ];

      const displayTable = (table) => {
        var thead = document.createElement("thead");
        var headerRow = document.createElement("tr");
        // headers
        theaders.forEach(function (header) {
          if (header == "DPS" && !dpsCheckBox.checked) {
            return;
          }
          if (header == "pass@1" && !passCheckBox.checked) {
            return;
          }
          var th = document.createElement("th");
          th.classList.add("text-nowrap");
          if (header == "Model") {
            th.style.textAlign = "left";
          }
          th.textContent = header;

          if (header == winrate_tag) {
            th.style.backgroundColor = "#EEFFEE";
          }

          headerRow.appendChild(th);
        });
        thead.appendChild(headerRow);
        table.appendChild(thead);

        // convert data to array of Map
        data = Array.from(globalData);
        data = data.map(
          ([modelId, value]) => new Map([["modelId", modelId], ...value]),
        )
        data.sort((a, b) => b.get("model_win_rate") - a.get("model_win_rate"));

        var tbody = document.createElement("tbody");
        // add rank
        var rank = 0;
        var last_best = null;
        var n_last_best = 1;
        data.forEach((row) => {
          var dataRow = document.createElement("tr");
          // rank
          var rankCell = document.createElement("td");
          dataRow.appendChild(rankCell);
          var modelCell = document.createElement("td");
          var modelLink = document.createElement("a");
          var modelId = row.get('modelId');
          var modelName = modelId;
          if (modelId.includes("--")) {
            modelName = modelId.split("--")[1];
          }
          var cur_model_wr = row.get('model_win_rate').toFixed(3);
          if (last_best != cur_model_wr) {
            rank += n_last_best;
            last_best = cur_model_wr;
            rankCell.textContent = rank;
            n_last_best = 1;
          } else {
            n_last_best += 1;
          }
          if (rank == 1) {
            modelLink.textContent = "🥇 " + modelName;
          } else if (rank == 2) {
            modelLink.textContent = "🥈 " + modelName;
          } else if (rank == 3) {
            modelLink.textContent = "🥉 " + modelName;
          } else {
            modelLink.textContent = modelName;
          }
          if (linkMapping.has(modelName)) {
            modelLink.href = linkMapping.get(modelName);
          }
          modelLink.classList.add("link-underline-primary");
          modelLink.classList.add("text-nowrap");
          modelCell.appendChild(modelLink);
          modelCell.style.textAlign = "left";
          dataRow.appendChild(modelCell);

          if (dpsCheckBox.checked) {
            dpsRow = document.createElement("td");
            dpsRow.textContent = row.get("dps").toFixed(1);
            dataRow.appendChild(dpsRow);
          }
          if (passCheckBox.checked) {
            passRow = document.createElement("td");
            passRow.textContent = row.get("pass@1").toFixed(1);
            dataRow.appendChild(passRow);
          }

          taskWinRateRow = document.createElement("td");
          taskWinRateRow.textContent = (row.get('task_win_rate') * 100).toFixed(1);
          // center-align
          dataRow.appendChild(taskWinRateRow);


          modelWinRateRow = document.createElement("td");
          modelWinRateRow.textContent = (row.get('model_win_rate') * 100).toFixed(1);
          modelWinRateRow.style.backgroundColor = "#EEFFEE";
          // center-align
          dataRow.appendChild(modelWinRateRow);
          tbody.appendChild(dataRow);
        });
        table.appendChild(tbody);
      };

      const clearTable = () => {
        metricTable.innerHTML = "";
      };

      const main = () => {
        clearTable();
        displayTable(metricTable);
      };

      passCheckBox.addEventListener("change", main);
      dpsCheckBox.addEventListener("change", main);

      main();

      initializeHeatmap();

      function initializeHeatmap() {
        const heatmapChart = echarts.init(document.getElementById('heatmapChart'));
        const xAxisSelectors = document.getElementById('xAxisSelectors');
        const yAxisSelectors = document.getElementById('yAxisSelectors');

        const modelData = Array.from(globalData).map(([modelId, value]) => ({
          id: modelId,
          name: modelId.includes('--') ? modelId.split('--')[1] : modelId,
          winrate: parseFloat(value.get('model_win_rate')),
        }));

        // sort by general winrate
        modelData.sort((a, b) => b.winrate - a.winrate);

        const defaultDisplayNum = 7;

        let selectedXModels = modelData.slice(0, defaultDisplayNum).map(m => m.id);
        let selectedYModels = modelData.slice(0, defaultDisplayNum).map(m => m.id);

        function createCheckboxes() {
          xAxisSelectors.innerHTML = '';
          yAxisSelectors.innerHTML = '';

          modelData.forEach(model => {
            const xDiv = document.createElement('div');
            xDiv.className = 'form-check';
            xDiv.innerHTML = `
          <div class="d-flex align-items-center" style="min-width: 0;">
            <input class="form-check-input flex-shrink-0" type="checkbox" id="x-${model.id}"
                  ${selectedXModels.includes(model.id) ? 'checked' : ''} >
            <label class="form-check-label flex-grow-1 fs-6" for="x-${model.id}" title="${model.name}">
              &nbsp${model.name}
            </label>
          </div>
        `;
            xAxisSelectors.appendChild(xDiv);

            const yDiv = document.createElement('div');
            yDiv.className = 'form-check';
            yDiv.innerHTML = `
          <div class="d-flex align-items-center" style="min-width: 0;">
            <input class="form-check-input flex-shrink-0" type="checkbox" id="y-${model.id}"
                  ${selectedYModels.includes(model.id) ? 'checked' : ''} >
            <label class="form-check-label flex-grow-1 fs-6" for="y-${model.id}" title="${model.name}">
              &nbsp${model.name}
            </label>
          </div>
        `;
            yAxisSelectors.appendChild(yDiv);
          });

          document.querySelectorAll('#xAxisSelectors input[type="checkbox"]').forEach(checkbox => {
            checkbox.addEventListener('change', function () {
              const modelId = this.id.slice(2);
              if (this.checked) {
                selectedXModels.push(modelId);
              } else {
                selectedXModels = selectedXModels.filter(id => id !== modelId);
              }
              updateCheckboxStates();
              updateHeatmap();
            });
          });

          document.querySelectorAll('#yAxisSelectors input[type="checkbox"]').forEach(checkbox => {
            checkbox.addEventListener('change', function () {
              const modelId = this.id.slice(2);
              if (this.checked) {
                selectedYModels.push(modelId);
              } else {
                selectedYModels = selectedYModels.filter(id => id !== modelId);
              }
              updateCheckboxStates();
              updateHeatmap();
            });
          });
        }

        function updateCheckboxStates() {
        }

        function updateHeatmap() {
          const xModels = modelData.filter(m => selectedXModels.includes(m.id));
          const yModels = modelData.filter(m => selectedYModels.includes(m.id)).reverse();

          const heatmapData = [];
          yModels.forEach((model1, i) => {
            xModels.forEach((model2, j) => {
              const dps_pair = HeatmapTable.get(model1.id)[model2.id];
              const sc1 = parseFloat(dps_pair[0]).toFixed(1);
              const sc2 = parseFloat(dps_pair[1]).toFixed(1);
              heatmapData.push([j, i, sc1, sc2, model1.id !== model2.id ? parseFloat((sc1 - sc2).toFixed(1)) : '']);
            });
          });

          const option = {
            tooltip: {
              position: 'top',
              formatter: function (params) {
                const model1 = yModels[params.data[1]].name;
                const model2 = xModels[params.data[0]].name;
                const sc1 = params.data[2];
                const sc2 = params.data[3];

                winstyle = "color:Green;"
                losestyle = "color:Tomato;"
                if (sc1 > sc2) {
                  style1 = winstyle;
                  style2 = losestyle;
                } else if (sc2 > sc1) {
                  style1 = losestyle;
                  style2 = winstyle;
                }
                return `DPS over <u>common passing tasks</u>:<br>` +
                  `<span style="${style1}">${model1} (left): ${sc1}</span><br>` +
                  `<span style="${style2}">${model2} (bottom): ${sc2}</span>`;
                // return `${model1} vs ${model2}<br>DPS Difference: ${diff}`;
              }
            },
            grid: {
              top: '10%',
              bottom: '20%',
              left: '25%',
              right: '0%'
            },
            xAxis: {
              type: 'category',
              data: xModels.map(m => m.name),
              splitArea: { show: true },
              axisLabel: {
                rotate: 25,
                interval: 0,
              },
              axisTick: {
                alignWithLabel: true,
              },
            },
            yAxis: {
              type: 'category',
              data: yModels.map(m => m.name),
              splitArea: { show: true },
              axisLabel: {
                fontWeight: 'bold',
              },
            },
            visualMap: {
              type: 'continuous',
              min: -15,
              max: 15,
              calculable: true,
              orient: 'horizontal',
              left: 'center',
              top: 0,
              itemHeight: '350',
              inRange: {
                color: ['#d73027', '#f46d43', '#fdae61', '#fee090', '#f8f8f8', '#e0f3f8', '#abd9e9', '#74add1', '#4575b4'],
              },
            },
            series: [{
              name: 'DPS Difference',
              type: 'heatmap',
              data: heatmapData,
              label: {
                show: true,
                formatter: function (params) {
                  const val = params.data[params.data.length - 1];
                  return val > 0 ? '+' + val : val;
                },
                textStyle: {
                  fontSize: 16,
                }
              },
              emphasis: {
                itemStyle: {
                  shadowBlur: 10,
                  shadowColor: 'rgba(0, 0, 0, 0.5)'
                }
              }
            }]
          };

          heatmapChart.setOption(option);
        }

        createCheckboxes();
        updateHeatmap();

        window.addEventListener('resize', function () {
          heatmapChart.resize();
        });
      }
    </script>
</body>

</html>