From 5d1cec8dab14aa55cde0217caa3c414997aefcbc Mon Sep 17 00:00:00 2001
From: Automated <actions@users.noreply.github.com>
Date: Tue, 24 Dec 2024 09:08:30 +0000
Subject: [PATCH] Latest data: Tue Dec 24 09:08:30 UTC 2024

---
 index.html | 594 +++++++++++++++++++++++++++++++----------------------
 1 file changed, 352 insertions(+), 242 deletions(-)
diff --git a/index.html b/index.html
index cade7137..37cd33b4 100644
--- a/index.html
+++ b/index.html
@@ -19,7 +19,7 @@
           <td>
             <h1 class="text-4xl pt-4 font-bold"><span class="underline">Vincent's</span> Arxiv FrontPage</h1>
             <br>
-            <p>Generated on 2024-12-23.</p><br/>
+            <p>Generated on 2024-12-24.</p><br/>
             <p class="text-sm text-gray-500 pt-2">This frontpage is made by scraping arxiv and by running a sentence-model that detects if the abstract describes a paper about a topic of interest. One cool feature: it all pretty much runs via Github Actions. </p>
             <br>
           </td>
@@ -29,6 +29,182 @@ <h1 class="text-4xl pt-4 font-bold"><span class="underline">Vincent's</span> Arx
             <h2 class="text-2xl tracking-tight pt-4 font-bold">New Datasets</h2>
           </td>
         </tr><tr>
+          <td class="inline-block">
+            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-23</p>
+          </td>
+          <td>
+            <div x-data="{open: false}">
+              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
+                ANID: How Far Are We? Evaluating the Discrepancies Between AI-synthesized Images and Natural Images through Multimodal Guidance
+              </span>
+              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
+                <div class="text-center pt-2"></div>
+                <p class="pt-2">
+                  <p>In the rapidly evolving field of Artificial Intelligence Generated Content (AIGC), one of the key challenges is distinguishing AI-synthesized images from natural images.Despite the remarkable capabilities of advanced AI generative models in producing visually compelling images, significant discrepancies remain when these images are compared to natural ones.To systematically investigate and quantify these discrepancies, we introduce an AI-Natural Image Discrepancy Evaluation benchmark aimed at addressing the critical question: \textit{how far are AI-generated images (AIGIs) from truly realistic images?}<span class='px-1 mx-1 bg-yellow-200'>We have constructed a large-scale multimodal dataset, the Distinguishing Natural and AI-generated Images (DNAI) dataset, which includes over 440,000 AIGI samples generated by 8 representative models using both unimodal and multimodal prompts, such as Text-to-Image (T2I), Image-to-Image (I2I), and Text \textit{vs.} <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.745</span></span>Image-to-Image (TI2I).Our fine-grained assessment framework provides a comprehensive evaluation of the DNAI dataset across five key dimensions: naive visual feature quality, semantic alignment in multimodal generation, aesthetic appeal, downstream task applicability, and coordinated human validation.Extensive evaluation results highlight significant discrepancies across these dimensions, underscoring the necessity of aligning quantitative metrics with human judgment to achieve a holistic understanding of AI-generated image quality.Code is available at \href{https://github.com/ryliu68/ANID}{https://github.com/ryliu68/ANID}.</p>
+                </p>
+              <p class="pb-2 pt-2 text-center">
+                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.17632v1' target="_blank">
+                  link
+                </a>
+              </p>
+            </div>
+          </div>
+        </td>
+      </tr><tr>
+          <td class="inline-block">
+            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-23</p>
+          </td>
+          <td>
+            <div x-data="{open: false}">
+              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
+                SCBench: A Sports Commentary Benchmark for Video LLMs
+              </span>
+              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
+                <div class="text-center pt-2"></div>
+                <p class="pt-2">
+                  <p>Recently, significant advances have been made in Video Large Language Models (Video LLMs) in both academia and industry.However, methods to evaluate and benchmark the performance of different Video LLMs, especially their fine-grained, temporal visual capabilities, remain very limited.On one hand, current benchmarks use relatively simple videos (e.g., subtitled movie clips) where the model can understand the entire video by processing just a few frames.On the other hand, their datasets lack diversity in task format, comprising only QA or multi-choice QA, which overlooks the models' capacity for generating in-depth and precise texts.Sports videos, which feature intricate visual information, sequential events, and emotionally charged commentary, present a critical challenge for Video LLMs, making sports commentary an ideal benchmarking task.Inspired by these challenges, we propose a novel task: sports video commentary generation, developed $\textbf{SCBench}$ for Video LLMs.To construct such a benchmark, we introduce (1) $\textbf{SCORES}$, a six-dimensional metric specifically designed for our task, upon which we propose a GPT-based evaluation method, and (2) $\textbf{CommentarySet}$, a dataset consisting of 5,775 annotated video clips and ground-truth labels tailored to our metric.Based on SCBench, we conduct comprehensive evaluations on multiple Video LLMs (e.g. VILA, Video-LLaVA, etc.) and chain-of-thought baseline methods.Our results found that InternVL-Chat-2 achieves the best performance with 5.44, surpassing the second-best by 1.04.Our work provides a fresh perspective for future research, aiming to enhance models' overall capabilities in complex visual understanding tasks.<span class='px-1 mx-1 bg-yellow-200'>Our dataset will be released soon. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.978</span></span></p>
+                </p>
+              <p class="pb-2 pt-2 text-center">
+                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.17637v1' target="_blank">
+                  link
+                </a>
+              </p>
+            </div>
+          </div>
+        </td>
+      </tr><tr>
+          <td class="inline-block">
+            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-23</p>
+          </td>
+          <td>
+            <div x-data="{open: false}">
+              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
+                Chumor 2.0: Towards Benchmarking Chinese Humor Understanding
+              </span>
+              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
+                <div class="text-center pt-2"></div>
+                <p class="pt-2">
+                  <p>Existing humor datasets and evaluations predominantly focus on English, leaving limited resources for culturally nuanced humor in non-English languages like Chinese.<span class='px-1 mx-1 bg-yellow-200'>To address this gap, we construct Chumor, the first Chinese humor explanation dataset that exceeds the size of existing humor datasets. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.821</span></span>Chumor is sourced from Ruo Zhi Ba, a Chinese Reddit-like platform known for sharing intellectually challenging and culturally specific jokes.We test ten LLMs through direct and chain-of-thought prompting, revealing that Chumor poses significant challenges to existing LLMs, with their accuracy slightly above random and far below human.In addition, our analysis highlights that human-annotated humor explanations are significantly better than those generated by GPT-4o and ERNIE-4-turbo.<span class='px-1 mx-1 bg-yellow-200'>We release Chumor at https://huggingface.co/datasets/dnaihao/Chumor, our project page is at https://dnaihao.github.io/Chumor-dataset/, our leaderboard is at https://huggingface.co/spaces/dnaihao/Chumor, and our codebase is at https://github.com/dnaihao/Chumor-dataset. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.899</span></span></p>
+                </p>
+              <p class="pb-2 pt-2 text-center">
+                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.17729v1' target="_blank">
+                  link
+                </a>
+              </p>
+            </div>
+          </div>
+        </td>
+      </tr><tr>
+          <td class="inline-block">
+            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-23</p>
+          </td>
+          <td>
+            <div x-data="{open: false}">
+              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
+                GauSim: Registering Elastic Objects into Digital World by Gaussian Simulator
+              </span>
+              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
+                <div class="text-center pt-2"></div>
+                <p class="pt-2">
+                  <p>In this work, we introduce GauSim, a novel neural network-based simulator designed to capture the dynamic behaviors of real-world elastic objects represented through Gaussian kernels.Unlike traditional methods that treat kernels as particles within particle-based simulations, we leverage continuum mechanics, modeling each kernel as a continuous piece of matter to account for realistic deformations without idealized assumptions.To improve computational efficiency and fidelity, we employ a hierarchical structure that organizes kernels into Center of Mass Systems (CMS) with explicit formulations, enabling a coarse-to-fine simulation approach.This structure significantly reduces computational overhead while preserving detailed dynamics.In addition, GauSim incorporates explicit physics constraints, such as mass and momentum conservation, ensuring interpretable results and robust, physically plausible simulations.<span class='px-1 mx-1 bg-yellow-200'>To validate our approach, we present a new dataset, READY, containing multi-view videos of real-world elastic deformations. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.705</span></span>Experimental results demonstrate that GauSim achieves superior performance compared to existing physics-driven baselines, offering a practical and accurate solution for simulating complex dynamic behaviors.Code and model will be released.Project page: https://www.mmlab-ntu.com/project/gausim/index.html .</p>
+                </p>
+              <p class="pb-2 pt-2 text-center">
+                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.17804v1' target="_blank">
+                  link
+                </a>
+              </p>
+            </div>
+          </div>
+        </td>
+      </tr><tr>
+          <td class="inline-block">
+            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-23</p>
+          </td>
+          <td>
+            <div x-data="{open: false}">
+              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
+                Cross-View Referring Multi-Object Tracking
+              </span>
+              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
+                <div class="text-center pt-2"></div>
+                <p class="pt-2">
+                  <p>Referring Multi-Object Tracking (RMOT) is an important topic in the current tracking field.Its task form is to guide the tracker to track objects that match the language description.Current research mainly focuses on referring multi-object tracking under single-view, which refers to a view sequence or multiple unrelated view sequences.However, in the single-view, some appearances of objects are easily invisible, resulting in incorrect matching of objects with the language description.In this work, we propose a new task, called Cross-view Referring Multi-Object Tracking (CRMOT).It introduces the cross-view to obtain the appearances of objects from multiple views, avoiding the problem of the invisible appearances of objects in RMOT task.CRMOT is a more challenging task of accurately tracking the objects that match the language description and maintaining the identity consistency of objects in each cross-view.To advance CRMOT task, we construct a cross-view referring multi-object tracking benchmark based on CAMPUS and DIVOTrack datasets, named CRTrack.Specifically, it provides 13 different scenes and 221 language descriptions.Furthermore, we propose an end-to-end cross-view referring multi-object tracking method, named CRTracker.Extensive experiments on the CRTrack benchmark verify the effectiveness of our method.<span class='px-1 mx-1 bg-yellow-200'>The dataset and code are available at https://github.com/chen-si-jia/CRMOT. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.852</span></span></p>
+                </p>
+              <p class="pb-2 pt-2 text-center">
+                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.17807v1' target="_blank">
+                  link
+                </a>
+              </p>
+            </div>
+          </div>
+        </td>
+      </tr><tr>
+          <td class="inline-block">
+            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-23</p>
+          </td>
+          <td>
+            <div x-data="{open: false}">
+              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
+                Dora: Sampling and Benchmarking for 3D Shape Variational Auto-Encoders
+              </span>
+              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
+                <div class="text-center pt-2"></div>
+                <p class="pt-2">
+                  <p>Recent 3D content generation pipelines commonly employ Variational Autoencoders (VAEs) to encode shapes into compact latent representations for diffusion-based generation.However, the widely adopted uniform point sampling strategy in Shape VAE training often leads to a significant loss of geometric details, limiting the quality of shape reconstruction and downstream generation tasks.We present Dora-VAE, a novel approach that enhances VAE reconstruction through our proposed sharp edge sampling strategy and a dual cross-attention mechanism.By identifying and prioritizing regions with high geometric complexity during training, our method significantly improves the preservation of fine-grained shape features.Such sampling strategy and the dual attention mechanism enable the VAE to focus on crucial geometric details that are typically missed by uniform sampling approaches.To systematically evaluate VAE reconstruction quality, we additionally propose Dora-bench, a benchmark that quantifies shape complexity through the density of sharp edges, introducing a new metric focused on reconstruction accuracy at these salient geometric features.Extensive experiments on the Dora-bench demonstrate that Dora-VAE achieves comparable reconstruction quality to the state-of-the-art dense XCube-VAE while requiring a latent space at least 8$\times$ smaller (1,280 vs. > 10,000 codes).<span class='px-1 mx-1 bg-yellow-200'>We will release our code and benchmark dataset to facilitate future research in 3D shape modeling. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.731</span></span></p>
+                </p>
+              <p class="pb-2 pt-2 text-center">
+                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.17808v1' target="_blank">
+                  link
+                </a>
+              </p>
+            </div>
+          </div>
+        </td>
+      </tr><tr>
+          <td class="inline-block">
+            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-23</p>
+          </td>
+          <td>
+            <div x-data="{open: false}">
+              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
+                ChatGarment: Garment Estimation, Generation and Editing via Large Language Models
+              </span>
+              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
+                <div class="text-center pt-2"></div>
+                <p class="pt-2">
+                  <p>We introduce ChatGarment, a novel approach that leverages large vision-language models (VLMs) to automate the estimation, generation, and editing of 3D garments from images or text descriptions.Unlike previous methods that struggle in real-world scenarios or lack interactive editing capabilities, ChatGarment can estimate sewing patterns from in-the-wild images or sketches, generate them from text descriptions, and edit garments based on user instructions, all within an interactive dialogue.These sewing patterns can then be draped into 3D garments, which are easily animatable and simulatable.This is achieved by finetuning a VLM to directly generate a JSON file that includes both textual descriptions of garment types and styles, as well as continuous numerical attributes.This JSON file is then used to create sewing patterns through a programming parametric model.To support this, we refine the existing programming model, GarmentCode, by expanding its garment type coverage and simplifying its structure for efficient VLM fine-tuning.<span class='px-1 mx-1 bg-yellow-200'>Additionally, we construct a large-scale dataset of image-to-sewing-pattern and text-to-sewing-pattern pairs through an automated data pipeline. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.727</span></span>Extensive evaluations demonstrate ChatGarment's ability to accurately reconstruct, generate, and edit garments from multimodal inputs, highlighting its potential to revolutionize workflows in fashion and gaming applications.<span class='px-1 mx-1 bg-yellow-200'>Code and data will be available at https://chatgarment.github.io/. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.812</span></span></p>
+                </p>
+              <p class="pb-2 pt-2 text-center">
+                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.17811v1' target="_blank">
+                  link
+                </a>
+              </p>
+            </div>
+          </div>
+        </td>
+      </tr><tr>
+          <td class="inline-block">
+            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-23</p>
+          </td>
+          <td>
+            <div x-data="{open: false}">
+              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
+                FaceLift: Single Image to 3D Head with View Generation and GS-LRM
+              </span>
+              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
+                <div class="text-center pt-2"></div>
+                <p class="pt-2">
+                  <p>We present FaceLift, a feed-forward approach for rapid, high-quality, 360-degree head reconstruction from a single image.Our pipeline begins by employing a multi-view latent diffusion model that generates consistent side and back views of the head from a single facial input.These generated views then serve as input to a GS-LRM reconstructor, which produces a comprehensive 3D representation using Gaussian splats.<span class='px-1 mx-1 bg-yellow-200'>To train our system, we develop a dataset of multi-view renderings using synthetic 3D human head as-sets. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.757</span></span>The diffusion-based multi-view generator is trained exclusively on synthetic head images, while the GS-LRM reconstructor undergoes initial training on Objaverse followed by fine-tuning on synthetic head data.FaceLift excels at preserving identity and maintaining view consistency across views.Despite being trained solely on synthetic data, FaceLift demonstrates remarkable generalization to real-world images.Through extensive qualitative and quantitative evaluations, we show that FaceLift outperforms state-of-the-art methods in 3D head reconstruction, highlighting its practical applicability and robust performance on real-world images.In addition to single image reconstruction, FaceLift supports video inputs for 4D novel view synthesis and seamlessly integrates with 2D reanimation techniques to enable 3D facial animation.Project page: https://weijielyu.github.io/FaceLift.</p>
+                </p>
+              <p class="pb-2 pt-2 text-center">
+                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.17812v1' target="_blank">
+                  link
+                </a>
+              </p>
+            </div>
+          </div>
+        </td>
+      </tr><tr>
           <td class="inline-block">
             <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-19</p>
           </td>
@@ -733,21 +909,26 @@ <h2 class="text-2xl tracking-tight pt-4 font-bold">New Datasets</h2>
           </div>
         </td>
       </tr><tr>
+          <td></td>
+          <td>
+            <h2 class="text-2xl tracking-tight pt-4 font-bold">Data Quality</h2>
+          </td>
+        </tr><tr>
           <td class="inline-block">
-            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-12</p>
+            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-17</p>
           </td>
           <td>
             <div x-data="{open: false}">
               <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
-                OpenNER 1.0: Standardized Open-Access Named Entity Recognition Datasets in 50+ Languages
+                Label Errors in the Tobacco3482 Dataset
               </span>
               <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                 <div class="text-center pt-2"></div>
                 <p class="pt-2">
-                  <p>We present OpenNER 1.0, a standardized collection of openly available named entity recognition (NER) datasets.<span class='px-1 mx-1 bg-yellow-200'>OpenNER contains 34 datasets spanning 51 languages, annotated in varying named entity ontologies. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.838</span></span><span class='px-1 mx-1 bg-yellow-200'>We correct annotation format issues, standardize the original datasets into a uniform representation, map entity type names to be more consistent across corpora, and provide the collection in a structure that enables research in multilingual and multi-ontology NER. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.731</span></span>We provide baseline models using three pretrained multilingual language models to compare the performance of recent models and facilitate future research in NER.</p>
+                  <p>Tobacco3482 is a widely used document classification benchmark dataset.<span class='px-1 mx-1 bg-yellow-200'>However, our manual inspection of the entire dataset uncovers widespread ontological issues, especially large amounts of annotation label problems in the dataset. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.665</span></span><span class='px-1 mx-1 bg-yellow-200'>We establish data label guidelines and find that 11.7% of the dataset is improperly annotated and should either have an unknown label or a corrected label, and 16.7% of samples in the dataset have multiple valid labels. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.812</span></span><span class='px-1 mx-1 bg-yellow-200'>We then analyze the mistakes of a top-performing model and find that 35% of the model's mistakes can be directly attributed to these label issues, highlighting the inherent problems with using a noisily labeled dataset as a benchmark. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.747</span></span>Supplementary material, including dataset annotations and code, is available at https://github.com/gordon-lim/tobacco3482-mistakes/.</p>
                 </p>
               <p class="pb-2 pt-2 text-center">
-                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.09587v1' target="_blank">
+                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.13140v1' target="_blank">
                   link
                 </a>
               </p>
@@ -756,20 +937,20 @@ <h2 class="text-2xl tracking-tight pt-4 font-bold">New Datasets</h2>
         </td>
       </tr><tr>
           <td class="inline-block">
-            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-12</p>
+            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-16</p>
           </td>
           <td>
             <div x-data="{open: false}">
               <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
-                Neural LightRig: Unlocking Accurate Object Normal and Material Estimation with Multi-Light Diffusion
+                RepFace: Refining Closed-Set Noise with Progressive Label Correction for Face Recognition
               </span>
               <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                 <div class="text-center pt-2"></div>
                 <p class="pt-2">
-                  <p>Recovering the geometry and materials of objects from a single image is challenging due to its under-constrained nature.In this paper, we present Neural LightRig, a novel framework that boosts intrinsic estimation by leveraging auxiliary multi-lighting conditions from 2D diffusion priors.Specifically, 1) we first leverage illumination priors from large-scale diffusion models to build our multi-light diffusion model on a synthetic relighting dataset with dedicated designs.This diffusion model generates multiple consistent images, each illuminated by point light sources in different directions.2) By using these varied lighting images to reduce estimation uncertainty, we train a large G-buffer model with a U-Net backbone to accurately predict surface normals and materials.Extensive experiments validate that our approach significantly outperforms state-of-the-art methods, enabling accurate surface normal and PBR material estimation with vivid relighting effects.<span class='px-1 mx-1 bg-yellow-200'>Code and dataset are available on our project page at https://projects.zxhezexin.com/neural-lightrig. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.726</span></span></p>
+                  <p>Face recognition has made remarkable strides, driven by the expanding scale of datasets, advancements in various backbone and discriminative losses.However, face recognition performance is heavily affected by the label noise, especially closed-set noise.<span class='px-1 mx-1 bg-yellow-200'>While numerous studies have focused on handling label noise, addressing closed-set noise still poses challenges. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.769</span></span>This paper identifies this challenge as training isn't robust to noise at the early-stage training, and necessitating an appropriate learning strategy for samples with low confidence, which are often misclassified as closed-set noise in later training phases.To address these issues, we propose a new framework to stabilize the training at early stages and split the samples into clean, ambiguous and noisy groups which are devised with separate training strategies.Initially, we employ generated auxiliary closed-set noisy samples to enable the model to identify noisy data at the early stages of training.Subsequently, we introduce how samples are split into clean, ambiguous and noisy groups by their similarity to the positive and nearest negative centers.<span class='px-1 mx-1 bg-yellow-200'>Then we perform label fusion for ambiguous samples by incorporating accumulated model predictions. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.602</span></span><span class='px-1 mx-1 bg-yellow-200'>Finally, we apply label smoothing within the closed set, adjusting the label to a point between the nearest negative class and the initially assigned label. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.624</span></span>Extensive experiments validate the effectiveness of our method on mainstream face datasets, achieving state-of-the-art results.The code will be released upon acceptance.</p>
                 </p>
               <p class="pb-2 pt-2 text-center">
-                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.09593v1' target="_blank">
+                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.12031v1' target="_blank">
                   link
                 </a>
               </p>
@@ -778,20 +959,20 @@ <h2 class="text-2xl tracking-tight pt-4 font-bold">New Datasets</h2>
         </td>
       </tr><tr>
           <td class="inline-block">
-            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-12</p>
+            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-11</p>
           </td>
           <td>
             <div x-data="{open: false}">
               <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
-                RatBodyFormer: Rodent Body Surface from Keypoints
+                CAT: Class Aware Adaptive Thresholding for Semi-Supervised Domain Generalization
               </span>
               <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                 <div class="text-center pt-2"></div>
                 <p class="pt-2">
-                  <p>Rat behavior modeling goes to the heart of many scientific studies, yet the textureless body surface evades automatic analysis as it literally has no keypoints that detectors can find.The movement of the body surface, however, is a rich source of information for deciphering the rat behavior.We introduce two key contributions to automatically recover densely 3D sampled rat body surface points, passively.<span class='px-1 mx-1 bg-yellow-200'>The first is RatDome, a novel multi-camera system for rat behavior capture, and a large-scale dataset captured with it that consists of pairs of 3D keypoints and 3D body surface points. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.744</span></span>The second is RatBodyFormer, a novel network to transform detected keypoints to 3D body surface points.RatBodyFormer is agnostic to the exact locations of the 3D body surface points in the training data and is trained with masked-learning.We experimentally validate our framework with a number of real-world experiments.Our results collectively serve as a novel foundation for automated rat behavior analysis and will likely have far-reaching implications for biomedical and neuroscientific research.</p>
+                  <p>Domain Generalization (DG) seeks to transfer knowledge from multiple source domains to unseen target domains, even in the presence of domain shifts.Achieving effective generalization typically requires a large and diverse set of labeled source data to learn robust representations that can generalize to new, unseen domains.However, obtaining such high-quality labeled data is often costly and labor-intensive, limiting the practical applicability of DG.To address this, we investigate a more practical and challenging problem: semi-supervised domain generalization (SSDG) under a label-efficient paradigm.In this paper, we propose a novel method, CAT, which leverages semi-supervised learning with limited labeled data to achieve competitive generalization performance under domain shifts.<span class='px-1 mx-1 bg-yellow-200'>Our method addresses key limitations of previous approaches, such as reliance on fixed thresholds and sensitivity to noisy pseudo-labels. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.602</span></span>CAT combines adaptive thresholding with noisy label refinement techniques, creating a straightforward yet highly effective solution for SSDG tasks.<span class='px-1 mx-1 bg-yellow-200'>Specifically, our approach uses flexible thresholding to generate high-quality pseudo-labels with higher class diversity while refining noisy pseudo-labels to improve their reliability. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.679</span></span>Extensive experiments across multiple benchmark datasets demonstrate the superior performance of our method, highlighting its effectiveness in achieving robust generalization under domain shift.</p>
                 </p>
               <p class="pb-2 pt-2 text-center">
-                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.09599v1' target="_blank">
+                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.08479v1' target="_blank">
                   link
                 </a>
               </p>
@@ -800,20 +981,20 @@ <h2 class="text-2xl tracking-tight pt-4 font-bold">New Datasets</h2>
         </td>
       </tr><tr>
           <td class="inline-block">
-            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-12</p>
+            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-10</p>
           </td>
           <td>
             <div x-data="{open: false}">
               <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
-                Hidden Biases of End-to-End Driving Datasets
+                Defending Against Neural Network Model Inversion Attacks via Data Poisoning
               </span>
               <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                 <div class="text-center pt-2"></div>
                 <p class="pt-2">
-                  <p>End-to-end driving systems have made rapid progress, but have so far not been applied to the challenging new CARLA Leaderboard 2.0.Further, while there is a large body of literature on end-to-end architectures and training strategies, the impact of the training dataset is often overlooked.In this work, we make a first attempt at end-to-end driving for Leaderboard 2.0.Instead of investigating architectures, we systematically analyze the training dataset, leading to new insights: (1) Expert style significantly affects downstream policy performance.(2) In complex data sets, the frames should not be weighted on the basis of simplistic criteria such as class frequencies.(3) Instead, estimating whether a frame changes the target labels compared to previous frames can reduce the size of the dataset without removing important information.By incorporating these findings, our model ranks first and second respectively on the map and sensors tracks of the 2024 CARLA Challenge, and sets a new state-of-the-art on the Bench2Drive test routes.Finally, we uncover a design flaw in the current evaluation metrics and propose a modification for future challenges.<span class='px-1 mx-1 bg-yellow-200'>Our dataset, code, and pre-trained models are publicly available at https://github.com/autonomousvision/carla_garage. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.716</span></span></p>
+                  <p>Model inversion attacks pose a significant privacy threat to machine learning models by reconstructing sensitive data from their outputs.While various defenses have been proposed to counteract these attacks, they often come at the cost of the classifier's utility, thus creating a challenging trade-off between privacy protection and model utility.Moreover, most existing defenses require retraining the classifier for enhanced robustness, which is impractical for large-scale, well-established models.This paper introduces a novel defense mechanism to better balance privacy and utility, particularly against adversaries who employ a machine learning model (i.e., inversion model) to reconstruct private data.Drawing inspiration from data poisoning attacks, which can compromise the performance of machine learning models, we propose a strategy that leverages data poisoning to contaminate the training data of inversion models, thereby preventing model inversion attacks.   Two defense methods are presented.The first, termed label-preserving poisoning attacks for all output vectors (LPA), involves subtle perturbations to all output vectors while preserving their labels.<span class='px-1 mx-1 bg-yellow-200'>Our findings demonstrate that these minor perturbations, introduced through a data poisoning approach, significantly increase the difficulty of data reconstruction without compromising the utility of the classifier. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.602</span></span>Subsequently, we introduce a second method, label-flipping poisoning for partial output vectors (LFP), which selectively perturbs a small subset of output vectors and alters their labels during the process.Empirical results indicate that LPA is notably effective, outperforming the current state-of-the-art defenses.Our data poisoning-based defense provides a new retraining-free defense paradigm that preserves the victim classifier's utility.</p>
                 </p>
               <p class="pb-2 pt-2 text-center">
-                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.09602v1' target="_blank">
+                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.07575v1' target="_blank">
                   link
                 </a>
               </p>
@@ -822,20 +1003,20 @@ <h2 class="text-2xl tracking-tight pt-4 font-bold">New Datasets</h2>
         </td>
       </tr><tr>
           <td class="inline-block">
-            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-12</p>
+            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-03</p>
           </td>
           <td>
             <div x-data="{open: false}">
               <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
-                Do Multimodal Large Language Models See Like Humans?
+                Class-wise Autoencoders Measure Classification Difficulty And Detect Label Mistakes
               </span>
               <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                 <div class="text-center pt-2"></div>
                 <p class="pt-2">
-                  <p>Multimodal Large Language Models (MLLMs) have achieved impressive results on various vision tasks, leveraging recent advancements in large language models.However, a critical question remains unaddressed: do MLLMs perceive visual information similarly to humans?Current benchmarks lack the ability to evaluate MLLMs from this perspective.To address this challenge, we introduce HVSBench, a large-scale benchmark designed to assess the alignment between MLLMs and the human visual system (HVS) on fundamental vision tasks that mirror human vision.<span class='px-1 mx-1 bg-yellow-200'>HVSBench curated over 85K multimodal samples, spanning 13 categories and 5 fields in HVS, including Prominence, Subitizing, Prioritizing, Free-Viewing, and Searching. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.744</span></span>Extensive experiments demonstrate the effectiveness of our benchmark in providing a comprehensive evaluation of MLLMs.Specifically, we evaluate 13 MLLMs, revealing that even the best models show significant room for improvement, with most achieving only moderate results.Our experiments reveal that HVSBench presents a new and significant challenge for cutting-edge MLLMs.We believe that HVSBench will facilitate research on human-aligned and explainable MLLMs, marking a key step in understanding how MLLMs perceive and process visual information.</p>
+                  <p>We introduce a new framework for analyzing classification datasets based on the ratios of reconstruction errors between autoencoders trained on individual classes.This analysis framework enables efficient characterization of datasets on the sample, class, and entire dataset levels.We define reconstruction error ratios (RERs) that probe classification difficulty and allow its decomposition into (1) finite sample size and (2) Bayes error and decision-boundary complexity.Through systematic study across 19 popular visual datasets, we find that our RER-based dataset difficulty probe strongly correlates with error rate for state-of-the-art (SOTA) classification models.<span class='px-1 mx-1 bg-yellow-200'>By interpreting sample-level classification difficulty as a label mistakenness score, we further find that RERs achieve SOTA performance on mislabel detection tasks on hard datasets under symmetric and asymmetric label noise. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.664</span></span>Our code is publicly available at https://github.com/voxel51/reconstruction-error-ratios.</p>
                 </p>
               <p class="pb-2 pt-2 text-center">
-                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.09603v1' target="_blank">
+                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.02596v1' target="_blank">
                   link
                 </a>
               </p>
@@ -843,21 +1024,26 @@ <h2 class="text-2xl tracking-tight pt-4 font-bold">New Datasets</h2>
           </div>
         </td>
       </tr><tr>
+          <td></td>
+          <td>
+            <h2 class="text-2xl tracking-tight pt-4 font-bold">Benchmarks</h2>
+          </td>
+        </tr><tr>
           <td class="inline-block">
-            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-12</p>
+            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-23</p>
           </td>
           <td>
             <div x-data="{open: false}">
               <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
-                Learning Camera Movement Control from Real-World Drone Videos
+                Can Stability be Detrimental? Better Generalization through Gradient Descent Instabilities
               </span>
               <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                 <div class="text-center pt-2"></div>
                 <p class="pt-2">
-                  <p>This study seeks to automate camera movement control for filming existing subjects into attractive videos, contrasting with the creation of non-existent content by directly generating the pixels.We select drone videos as our test case due to their rich and challenging motion patterns, distinctive viewing angles, and precise controls.Existing AI videography methods struggle with limited appearance diversity in simulation training, high costs of recording expert operations, and difficulties in designing heuristic-based goals to cover all scenarios.To avoid these issues, we propose a scalable method that involves collecting real-world training data to improve diversity, extracting camera trajectories automatically to minimize annotation costs, and training an effective architecture that does not rely on heuristics.Specifically, we collect 99k high-quality trajectories by running 3D reconstruction on online videos, connecting camera poses from consecutive frames to formulate 3D camera paths, and using Kalman filter to identify and remove low-quality data.Moreover, we introduce DVGFormer, an auto-regressive transformer that leverages the camera path and images from all past frames to predict camera movement in the next frame.We evaluate our system across 38 synthetic natural scenes and 7 real city 3D scans.We show that our system effectively learns to perform challenging camera movements such as navigating through obstacles, maintaining low altitude to increase perceived speed, and orbiting towers and buildings, which are very useful for recording high-quality videos.<span class='px-1 mx-1 bg-yellow-200'>Data and code are available at dvgformer.github.io. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.714</span></span></p>
+                  <p>Traditional analyses of gradient descent optimization show that, when the largest eigenvalue of the loss Hessian - often referred to as the sharpness - is below a critical learning-rate threshold, then training is 'stable' and training loss decreases monotonically.Recent studies, however, have suggested that the majority of modern deep neural networks achieve good performance despite operating outside this stable regime.In this work, we demonstrate that such instabilities, induced by large learning rates, move model parameters toward flatter regions of the loss landscape.Our crucial insight lies in noting that, during these instabilities, the orientation of the Hessian eigenvectors rotate.This, we conjecture, allows the model to explore regions of the loss landscape that display more desirable geometrical properties for generalization, such as flatness.These rotations are a consequence of network depth, and we prove that for any network with depth > 1, unstable growth in parameters cause rotations in the principal components of the Hessian, which promote exploration of the parameter space away from unstable directions.Our empirical studies reveal an implicit regularization effect in gradient descent with large learning rates operating beyond the stability threshold.<span class='px-1 mx-1 bg-yellow-200'>We find these lead to excellent generalization performance on modern benchmark datasets. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.669</span></span></p>
                 </p>
               <p class="pb-2 pt-2 text-center">
-                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.09620v1' target="_blank">
+                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.17613v1' target="_blank">
                   link
                 </a>
               </p>
@@ -866,20 +1052,20 @@ <h2 class="text-2xl tracking-tight pt-4 font-bold">New Datasets</h2>
         </td>
       </tr><tr>
           <td class="inline-block">
-            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-12</p>
+            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-23</p>
           </td>
           <td>
             <div x-data="{open: false}">
               <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
-                OmniDrag: Enabling Motion Control for Omnidirectional Image-to-Video Generation
+                Graph Neural Networks Are Evolutionary Algorithms
               </span>
               <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                 <div class="text-center pt-2"></div>
                 <p class="pt-2">
-                  <p>As virtual reality gains popularity, the demand for controllable creation of immersive and dynamic omnidirectional videos (ODVs) is increasing.While previous text-to-ODV generation methods achieve impressive results, they struggle with content inaccuracies and inconsistencies due to reliance solely on textual inputs.Although recent motion control techniques provide fine-grained control for video generation, directly applying these methods to ODVs often results in spatial distortion and unsatisfactory performance, especially with complex spherical motions.To tackle these challenges, we propose OmniDrag, the first approach enabling both scene- and object-level motion control for accurate, high-quality omnidirectional image-to-video generation.Building on pretrained video diffusion models, we introduce an omnidirectional control module, which is jointly fine-tuned with temporal attention layers to effectively handle complex spherical motion.In addition, we develop a novel spherical motion estimator that accurately extracts motion-control signals and allows users to perform drag-style ODV generation by simply drawing handle and target points.<span class='px-1 mx-1 bg-yellow-200'>We also present a new dataset, named Move360, addressing the scarcity of ODV data with large scene and object motions. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.834</span></span>Experiments demonstrate the significant superiority of OmniDrag in achieving holistic scene-level and fine-grained object-level control for ODV generation.The project page is available at https://lwq20020127.github.io/OmniDrag.</p>
+                  <p>In this paper, we reveal the intrinsic duality between graph neural networks (GNNs) and evolutionary algorithms (EAs), bridging two traditionally distinct fields.Building on this insight, we propose Graph Neural Evolution (GNE), a novel evolutionary algorithm that models individuals as nodes in a graph and leverages designed frequency-domain filters to balance global exploration and local exploitation.Through the use of these filters, GNE aggregates high-frequency (diversity-enhancing) and low-frequency (stability-promoting) information, transforming EAs into interpretable and tunable mechanisms in the frequency domain.Extensive experiments on benchmark functions demonstrate that GNE consistently outperforms state-of-the-art algorithms such as GA, DE, CMA-ES, SDAES, and RL-SHADE, excelling in complex landscapes, optimal solution shifts, and noisy environments.<span class='px-1 mx-1 bg-yellow-200'>Its robustness, adaptability, and superior convergence highlight its practical and theoretical value. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.639</span></span>Beyond optimization, GNE establishes a conceptual and mathematical foundation linking EAs and GNNs, offering new perspectives for both fields.Its framework encourages the development of task-adaptive filters and hybrid approaches for EAs, while its insights can inspire advances in GNNs, such as improved global information propagation and mitigation of oversmoothing.GNE's versatility extends to solving challenges in machine learning, including hyperparameter tuning and neural architecture search, as well as real-world applications in engineering and operations research.By uniting the dynamics of EAs with the structural insights of GNNs, this work provides a foundation for interdisciplinary innovation, paving the way for scalable and interpretable solutions to complex optimization problems.</p>
                 </p>
               <p class="pb-2 pt-2 text-center">
-                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.09623v1' target="_blank">
+                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.17629v1' target="_blank">
                   link
                 </a>
               </p>
@@ -887,26 +1073,21 @@ <h2 class="text-2xl tracking-tight pt-4 font-bold">New Datasets</h2>
           </div>
         </td>
       </tr><tr>
-          <td></td>
-          <td>
-            <h2 class="text-2xl tracking-tight pt-4 font-bold">Data Quality</h2>
-          </td>
-        </tr><tr>
           <td class="inline-block">
-            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-17</p>
+            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-23</p>
           </td>
           <td>
             <div x-data="{open: false}">
               <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
-                Label Errors in the Tobacco3482 Dataset
+                SMAC-Hard: Enabling Mixed Opponent Strategy Script and Self-play on SMAC
               </span>
               <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                 <div class="text-center pt-2"></div>
                 <p class="pt-2">
-                  <p>Tobacco3482 is a widely used document classification benchmark dataset.<span class='px-1 mx-1 bg-yellow-200'>However, our manual inspection of the entire dataset uncovers widespread ontological issues, especially large amounts of annotation label problems in the dataset. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.665</span></span><span class='px-1 mx-1 bg-yellow-200'>We establish data label guidelines and find that 11.7% of the dataset is improperly annotated and should either have an unknown label or a corrected label, and 16.7% of samples in the dataset have multiple valid labels. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.812</span></span><span class='px-1 mx-1 bg-yellow-200'>We then analyze the mistakes of a top-performing model and find that 35% of the model's mistakes can be directly attributed to these label issues, highlighting the inherent problems with using a noisily labeled dataset as a benchmark. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.747</span></span>Supplementary material, including dataset annotations and code, is available at https://github.com/gordon-lim/tobacco3482-mistakes/.</p>
+                  <p>The availability of challenging simulation environments is pivotal for advancing the field of Multi-Agent Reinforcement Learning (MARL).In cooperative MARL settings, the StarCraft Multi-Agent Challenge (SMAC) has gained prominence as a benchmark for algorithms following centralized training with decentralized execution paradigm.<span class='px-1 mx-1 bg-yellow-200'>However, with continual advancements in SMAC, many algorithms now exhibit near-optimal performance, complicating the evaluation of their true effectiveness. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.658</span></span>To alleviate this problem, in this work, we highlight a critical issue: the default opponent policy in these environments lacks sufficient diversity, leading MARL algorithms to overfit and exploit unintended vulnerabilities rather than learning robust strategies.To overcome these limitations, we propose SMAC-HARD, a novel benchmark designed to enhance training robustness and evaluation comprehensiveness.SMAC-HARD supports customizable opponent strategies, randomization of adversarial policies, and interfaces for MARL self-play, enabling agents to generalize to varying opponent behaviors and improve model stability.Furthermore, we introduce a black-box testing framework wherein agents are trained without exposure to the edited opponent scripts but are tested against these scripts to evaluate the policy coverage and adaptability of MARL algorithms.We conduct extensive evaluations of widely used and state-of-the-art algorithms on SMAC-HARD, revealing the substantial challenges posed by edited and mixed strategy opponents.Additionally, the black-box strategy tests illustrate the difficulty of transferring learned policies to unseen adversaries.We envision SMAC-HARD as a critical step toward benchmarking the next generation of MARL algorithms, fostering progress in self-play methods for multi-agent systems.Our code is available at https://github.com/devindeng94/smac-hard.</p>
                 </p>
               <p class="pb-2 pt-2 text-center">
-                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.13140v1' target="_blank">
+                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.17707v1' target="_blank">
                   link
                 </a>
               </p>
@@ -915,20 +1096,20 @@ <h2 class="text-2xl tracking-tight pt-4 font-bold">Data Quality</h2>
         </td>
       </tr><tr>
           <td class="inline-block">
-            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-16</p>
+            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-23</p>
           </td>
           <td>
             <div x-data="{open: false}">
               <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
-                RepFace: Refining Closed-Set Noise with Progressive Label Correction for Face Recognition
+                Knowledge Editing through Chain-of-Thought
               </span>
               <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                 <div class="text-center pt-2"></div>
                 <p class="pt-2">
-                  <p>Face recognition has made remarkable strides, driven by the expanding scale of datasets, advancements in various backbone and discriminative losses.However, face recognition performance is heavily affected by the label noise, especially closed-set noise.<span class='px-1 mx-1 bg-yellow-200'>While numerous studies have focused on handling label noise, addressing closed-set noise still poses challenges. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.769</span></span>This paper identifies this challenge as training isn't robust to noise at the early-stage training, and necessitating an appropriate learning strategy for samples with low confidence, which are often misclassified as closed-set noise in later training phases.To address these issues, we propose a new framework to stabilize the training at early stages and split the samples into clean, ambiguous and noisy groups which are devised with separate training strategies.Initially, we employ generated auxiliary closed-set noisy samples to enable the model to identify noisy data at the early stages of training.Subsequently, we introduce how samples are split into clean, ambiguous and noisy groups by their similarity to the positive and nearest negative centers.<span class='px-1 mx-1 bg-yellow-200'>Then we perform label fusion for ambiguous samples by incorporating accumulated model predictions. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.602</span></span><span class='px-1 mx-1 bg-yellow-200'>Finally, we apply label smoothing within the closed set, adjusting the label to a point between the nearest negative class and the initially assigned label. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.624</span></span>Extensive experiments validate the effectiveness of our method on mainstream face datasets, achieving state-of-the-art results.The code will be released upon acceptance.</p>
+                  <p>Large Language Models (LLMs) have demonstrated exceptional capabilities across a wide range of natural language processing (NLP) tasks.However, keeping these models up-to-date with evolving world knowledge remains a significant challenge due to the high costs of frequent retraining.To address this challenge, knowledge editing techniques have emerged to update LLMs with new information without rebuilding the model from scratch.Among these, the in-context editing paradigm stands out for its effectiveness in integrating new knowledge while preserving the model's original capabilities.Despite its potential, existing in-context knowledge editing methods are often task-specific, focusing primarily on multi-hop QA tasks using structured knowledge triples.Moreover, their reliance on few-shot prompting for task decomposition makes them unstable and less effective in generalizing across diverse tasks.   In response to these limitations, we propose EditCoT, a novel knowledge editing framework that flexibly and efficiently updates LLMs across various tasks without retraining.EditCoT works by generating a chain-of-thought (CoT) for a given input and then iteratively refining this CoT process using a CoT editor based on updated knowledge.We evaluate EditCoT across a diverse range of benchmarks, covering multiple languages and tasks.<span class='px-1 mx-1 bg-yellow-200'>The results demonstrate that our approach achieves state-of-the-art performance while offering superior generalization, effectiveness, and stability compared to existing methods, marking a significant advancement in the field of knowledge updating. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.636</span></span>Code and data are available at: https://github.com/bebr2/EditCoT.</p>
                 </p>
               <p class="pb-2 pt-2 text-center">
-                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.12031v1' target="_blank">
+                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.17727v1' target="_blank">
                   link
                 </a>
               </p>
@@ -937,20 +1118,20 @@ <h2 class="text-2xl tracking-tight pt-4 font-bold">Data Quality</h2>
         </td>
       </tr><tr>
           <td class="inline-block">
-            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-11</p>
+            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-23</p>
           </td>
           <td>
             <div x-data="{open: false}">
               <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
-                CAT: Class Aware Adaptive Thresholding for Semi-Supervised Domain Generalization
+                RepoTransBench: A Real-World Benchmark for Repository-Level Code Translation
               </span>
               <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                 <div class="text-center pt-2"></div>
                 <p class="pt-2">
-                  <p>Domain Generalization (DG) seeks to transfer knowledge from multiple source domains to unseen target domains, even in the presence of domain shifts.Achieving effective generalization typically requires a large and diverse set of labeled source data to learn robust representations that can generalize to new, unseen domains.However, obtaining such high-quality labeled data is often costly and labor-intensive, limiting the practical applicability of DG.To address this, we investigate a more practical and challenging problem: semi-supervised domain generalization (SSDG) under a label-efficient paradigm.In this paper, we propose a novel method, CAT, which leverages semi-supervised learning with limited labeled data to achieve competitive generalization performance under domain shifts.<span class='px-1 mx-1 bg-yellow-200'>Our method addresses key limitations of previous approaches, such as reliance on fixed thresholds and sensitivity to noisy pseudo-labels. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.602</span></span>CAT combines adaptive thresholding with noisy label refinement techniques, creating a straightforward yet highly effective solution for SSDG tasks.<span class='px-1 mx-1 bg-yellow-200'>Specifically, our approach uses flexible thresholding to generate high-quality pseudo-labels with higher class diversity while refining noisy pseudo-labels to improve their reliability. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.679</span></span>Extensive experiments across multiple benchmark datasets demonstrate the superior performance of our method, highlighting its effectiveness in achieving robust generalization under domain shift.</p>
+                  <p>Repository-level code translation refers to translating an entire code repository from one programming language to another while preserving the functionality of the source repository.<span class='px-1 mx-1 bg-yellow-200'>Many benchmarks have been proposed to evaluate the performance of such code translators. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.612</span></span><span class='px-1 mx-1 bg-yellow-200'>However, previous benchmarks mostly provide fine-grained samples, focusing at either code snippet, function, or file-level code translation. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.685</span></span>Such benchmarks do not accurately reflect real-world demands, where entire repositories often need to be translated, involving longer code length and more complex functionalities.To address this gap, we propose a new benchmark, named RepoTransBench, which is a real-world repository-level code translation benchmark with an automatically executable test suite.We conduct experiments on RepoTransBench to evaluate the translation performance of 11 advanced LLMs.We find that the Success@1 score (test success in one attempt) of the best-performing LLM is only 7.33%.To further explore the potential of LLMs for repository-level code translation, we provide LLMs with error-related feedback to perform iterative debugging and observe an average 7.09% improvement on Success@1.However, even with this improvement, the Success@1 score of the best-performing LLM is only 21%, which may not meet the need for reliable automatic repository-level code translation.Finally, we conduct a detailed error analysis and highlight current LLMs' deficiencies in repository-level code translation, which could provide a reference for further improvements.</p>
                 </p>
               <p class="pb-2 pt-2 text-center">
-                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.08479v1' target="_blank">
+                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.17744v1' target="_blank">
                   link
                 </a>
               </p>
@@ -959,20 +1140,20 @@ <h2 class="text-2xl tracking-tight pt-4 font-bold">Data Quality</h2>
         </td>
       </tr><tr>
           <td class="inline-block">
-            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-10</p>
+            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-23</p>
           </td>
           <td>
             <div x-data="{open: false}">
               <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
-                Defending Against Neural Network Model Inversion Attacks via Data Poisoning
+                Group Testing with General Correlation Using Hypergraphs
               </span>
               <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                 <div class="text-center pt-2"></div>
                 <p class="pt-2">
-                  <p>Model inversion attacks pose a significant privacy threat to machine learning models by reconstructing sensitive data from their outputs.While various defenses have been proposed to counteract these attacks, they often come at the cost of the classifier's utility, thus creating a challenging trade-off between privacy protection and model utility.Moreover, most existing defenses require retraining the classifier for enhanced robustness, which is impractical for large-scale, well-established models.This paper introduces a novel defense mechanism to better balance privacy and utility, particularly against adversaries who employ a machine learning model (i.e., inversion model) to reconstruct private data.Drawing inspiration from data poisoning attacks, which can compromise the performance of machine learning models, we propose a strategy that leverages data poisoning to contaminate the training data of inversion models, thereby preventing model inversion attacks.   Two defense methods are presented.The first, termed label-preserving poisoning attacks for all output vectors (LPA), involves subtle perturbations to all output vectors while preserving their labels.<span class='px-1 mx-1 bg-yellow-200'>Our findings demonstrate that these minor perturbations, introduced through a data poisoning approach, significantly increase the difficulty of data reconstruction without compromising the utility of the classifier. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.602</span></span>Subsequently, we introduce a second method, label-flipping poisoning for partial output vectors (LFP), which selectively perturbs a small subset of output vectors and alters their labels during the process.Empirical results indicate that LPA is notably effective, outperforming the current state-of-the-art defenses.Our data poisoning-based defense provides a new retraining-free defense paradigm that preserves the victim classifier's utility.</p>
+                  <p>Group testing, a problem with diverse applications across multiple disciplines, traditionally assumes independence across nodes' states.Recent research, however, focuses on real-world scenarios that often involve correlations among nodes, challenging the simplifying assumptions made in existing models.In this work, we consider a comprehensive model for arbitrary statistical correlation among nodes' states.To capture and leverage these correlations effectively, we model the problem by hypergraphs, inspired by [GLS22], augmented by a probability mass function on the hyper-edges.   Using this model, we first design a novel greedy adaptive algorithm capable of conducting informative tests and dynamically updating the distribution.<span class='px-1 mx-1 bg-yellow-200'>Performance analysis provides upper bounds on the number of tests required, which depend solely on the entropy of the underlying probability distribution and the average number of infections. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.625</span></span>We demonstrate that the algorithm recovers or improves upon all previously known results for group testing settings with correlation.Additionally, we provide families of graphs where the algorithm is order-wise optimal and give examples where the algorithm or its analysis is not tight.We then generalize the proposed framework of group testing with general correlation in two directions, namely noisy group testing and semi-non-adaptive group testing.In both settings, we provide novel theoretical bounds on the number of tests required.</p>
                 </p>
               <p class="pb-2 pt-2 text-center">
-                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.07575v1' target="_blank">
+                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.17751v1' target="_blank">
                   link
                 </a>
               </p>
@@ -981,20 +1162,20 @@ <h2 class="text-2xl tracking-tight pt-4 font-bold">Data Quality</h2>
         </td>
       </tr><tr>
           <td class="inline-block">
-            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-03</p>
+            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-23</p>
           </td>
           <td>
             <div x-data="{open: false}">
               <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
-                Class-wise Autoencoders Measure Classification Difficulty And Detect Label Mistakes
+                In Case You Missed It: ARC 'Challenge' Is Not That Challenging
               </span>
               <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                 <div class="text-center pt-2"></div>
                 <p class="pt-2">
-                  <p>We introduce a new framework for analyzing classification datasets based on the ratios of reconstruction errors between autoencoders trained on individual classes.This analysis framework enables efficient characterization of datasets on the sample, class, and entire dataset levels.We define reconstruction error ratios (RERs) that probe classification difficulty and allow its decomposition into (1) finite sample size and (2) Bayes error and decision-boundary complexity.Through systematic study across 19 popular visual datasets, we find that our RER-based dataset difficulty probe strongly correlates with error rate for state-of-the-art (SOTA) classification models.<span class='px-1 mx-1 bg-yellow-200'>By interpreting sample-level classification difficulty as a label mistakenness score, we further find that RERs achieve SOTA performance on mislabel detection tasks on hard datasets under symmetric and asymmetric label noise. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.664</span></span>Our code is publicly available at https://github.com/voxel51/reconstruction-error-ratios.</p>
+                  <p>ARC Challenge appears more difficult than ARC Easy for modern LLMs primarily due to an evaluation setup that prevents direct comparison of answer choices rather than inherent complexity.Although some researchers have quietly shifted to a more appropriate scheme over the last year, the implications of this change have yet to be widely acknowledged.<span class='px-1 mx-1 bg-yellow-200'>We highlight this overlooked shift, show how similar evaluation practices falsely imply reasoning deficits in other benchmarks, and demonstrate that fairer methods dramatically reduce performance gaps (e.g. on SIQA) and even yield superhuman results (OpenBookQA). <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.605</span></span>In doing so, we reveal how evaluation shapes perceived difficulty and offer guidelines to ensure that multiple-choice evaluations accurately reflect actual model capabilities.</p>
                 </p>
               <p class="pb-2 pt-2 text-center">
-                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.02596v1' target="_blank">
+                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.17758v1' target="_blank">
                   link
                 </a>
               </p>
@@ -1002,11 +1183,50 @@ <h2 class="text-2xl tracking-tight pt-4 font-bold">Data Quality</h2>
           </div>
         </td>
       </tr><tr>
-          <td></td>
+          <td class="inline-block">
+            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-23</p>
+          </td>
           <td>
-            <h2 class="text-2xl tracking-tight pt-4 font-bold">Benchmarks</h2>
+            <div x-data="{open: false}">
+              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
+                Large Motion Video Autoencoding with Cross-modal Video VAE
+              </span>
+              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
+                <div class="text-center pt-2"></div>
+                <p class="pt-2">
+                  <p>Learning a robust video Variational Autoencoder (VAE) is essential for reducing video redundancy and facilitating efficient video generation.Directly applying image VAEs to individual frames in isolation can result in temporal inconsistencies and suboptimal compression rates due to a lack of temporal compression.Existing Video VAEs have begun to address temporal compression; however, they often suffer from inadequate reconstruction performance.In this paper, we present a novel and powerful video autoencoder capable of high-fidelity video encoding.First, we observe that entangling spatial and temporal compression by merely extending the image VAE to a 3D VAE can introduce motion blur and detail distortion artifacts.Thus, we propose temporal-aware spatial compression to better encode and decode the spatial information.Additionally, we integrate a lightweight motion compression model for further temporal compression.Second, we propose to leverage the textual information inherent in text-to-video datasets and incorporate text guidance into our model.This significantly enhances reconstruction quality, particularly in terms of detail preservation and temporal stability.Third, we further improve the versatility of our model through joint training on both images and videos, which not only enhances reconstruction quality but also enables the model to perform both image and video autoencoding.<span class='px-1 mx-1 bg-yellow-200'>Extensive evaluations against strong recent baselines demonstrate the superior performance of our method. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.825</span></span>The project website can be found at~\href{https://yzxing87.github.io/vae/}{https://yzxing87.github.io/vae/}.</p>
+                </p>
+              <p class="pb-2 pt-2 text-center">
+                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.17805v1' target="_blank">
+                  link
+                </a>
+              </p>
+            </div>
+          </div>
+        </td>
+      </tr><tr>
+          <td class="inline-block">
+            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-23</p>
           </td>
-        </tr><tr>
+          <td>
+            <div x-data="{open: false}">
+              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
+                Cross-View Referring Multi-Object Tracking
+              </span>
+              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
+                <div class="text-center pt-2"></div>
+                <p class="pt-2">
+                  <p>Referring Multi-Object Tracking (RMOT) is an important topic in the current tracking field.Its task form is to guide the tracker to track objects that match the language description.Current research mainly focuses on referring multi-object tracking under single-view, which refers to a view sequence or multiple unrelated view sequences.However, in the single-view, some appearances of objects are easily invisible, resulting in incorrect matching of objects with the language description.In this work, we propose a new task, called Cross-view Referring Multi-Object Tracking (CRMOT).It introduces the cross-view to obtain the appearances of objects from multiple views, avoiding the problem of the invisible appearances of objects in RMOT task.CRMOT is a more challenging task of accurately tracking the objects that match the language description and maintaining the identity consistency of objects in each cross-view.To advance CRMOT task, we construct a cross-view referring multi-object tracking benchmark based on CAMPUS and DIVOTrack datasets, named CRTrack.Specifically, it provides 13 different scenes and 221 language descriptions.Furthermore, we propose an end-to-end cross-view referring multi-object tracking method, named CRTracker.<span class='px-1 mx-1 bg-yellow-200'>Extensive experiments on the CRTrack benchmark verify the effectiveness of our method. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.788</span></span>The dataset and code are available at https://github.com/chen-si-jia/CRMOT.</p>
+                </p>
+              <p class="pb-2 pt-2 text-center">
+                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.17807v1' target="_blank">
+                  link
+                </a>
+              </p>
+            </div>
+          </div>
+        </td>
+      </tr><tr>
           <td class="inline-block">
             <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-19</p>
           </td>
@@ -1402,28 +1622,6 @@ <h2 class="text-2xl tracking-tight pt-4 font-bold">Benchmarks</h2>
             </div>
           </div>
         </td>
-      </tr><tr>
-          <td class="inline-block">
-            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-17</p>
-          </td>
-          <td>
-            <div x-data="{open: false}">
-              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
-                Boosting Test Performance with Importance Sampling--a Subpopulation Perspective
-              </span>
-              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
-                <div class="text-center pt-2"></div>
-                <p class="pt-2">
-                  <p>Despite empirical risk minimization (ERM) is widely applied in the machine learning community, its performance is limited on data with spurious correlation or subpopulation that is introduced by hidden attributes.<span class='px-1 mx-1 bg-yellow-200'>Existing literature proposed techniques to maximize group-balanced or worst-group accuracy when such correlation presents, yet, at the cost of lower average accuracy. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.658</span></span>In addition, many existing works conduct surveys on different subpopulation methods without revealing the inherent connection between these methods, which could hinder the technology advancement in this area.In this paper, we identify important sampling as a simple yet powerful tool for solving the subpopulation problem.On the theory side, we provide a new systematic formulation of the subpopulation problem and explicitly identify the assumptions that are not clearly stated in the existing works.This helps to uncover the cause of the dropped average accuracy.We provide the first theoretical discussion on the connections of existing methods, revealing the core components that make them different.On the application side, we demonstrate a single estimator is enough to solve the subpopulation problem.In particular, we introduce the estimator in both attribute-known and -unknown scenarios in the subpopulation setup, offering flexibility in practical use cases.<span class='px-1 mx-1 bg-yellow-200'>And empirically, we achieve state-of-the-art performance on commonly used benchmark datasets. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.704</span></span></p>
-                </p>
-              <p class="pb-2 pt-2 text-center">
-                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.13003v1' target="_blank">
-                  link
-                </a>
-              </p>
-            </div>
-          </div>
-        </td>
       </tr><tr>
           <td class="inline-block">
             <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-17</p>
@@ -1733,109 +1931,26 @@ <h2 class="text-2xl tracking-tight pt-4 font-bold">Benchmarks</h2>
           </div>
         </td>
       </tr><tr>
-          <td class="inline-block">
-            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-16</p>
-          </td>
-          <td>
-            <div x-data="{open: false}">
-              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
-                Witty: An Efficient Solver for Computing Minimum-Size Decision Trees
-              </span>
-              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
-                <div class="text-center pt-2"></div>
-                <p class="pt-2">
-                  <p>Decision trees are a classic model for summarizing and classifying data.To enhance interpretability and generalization properties, it has been proposed to favor small decision trees.Accordingly, in the minimum-size decision tree training problem (MSDT), the input is a set of training examples in $\mathbb{R}^d$ with class labels and we aim to find a decision tree that classifies all training examples correctly and has a minimum number of nodes.MSDT is NP-hard and therefore presumably not solvable in polynomial time.Nevertheless, Komusiewicz et al.[ICML '23] developed a promising algorithmic paradigm called witness trees which solves MSDT efficiently if the solution tree is small.In this work, we test this paradigm empirically.<span class='px-1 mx-1 bg-yellow-200'>We provide an implementation, augment it with extensive heuristic improvements, and scrutinize it on standard benchmark instances. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.695</span></span>The augmentations achieve a mean 324-fold (median 84-fold) speedup over the naive implementation.Compared to the state of the art they achieve a mean 32-fold (median 7-fold) speedup over the dynamic programming based MurTree solver[Demirovi\'c et al., J. Mach.Learn.<span class='px-1 mx-1 bg-yellow-200'>Res. '22] and a mean 61-fold (median 25-fold) speedup over SAT-based implementations [Janota and Morgado, SAT '20]. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.663</span></span>As a theoretical result we obtain an improved worst-case running-time bound for MSDT.</p>
-                </p>
-              <p class="pb-2 pt-2 text-center">
-                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.11954v1' target="_blank">
-                  link
-                </a>
-              </p>
-            </div>
-          </div>
-        </td>
-      </tr><tr>
-          <td class="inline-block">
-            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-16</p>
-          </td>
-          <td>
-            <div x-data="{open: false}">
-              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
-                Memory-Reduced Meta-Learning with Guaranteed Convergence
-              </span>
-              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
-                <div class="text-center pt-2"></div>
-                <p class="pt-2">
-                  <p>The optimization-based meta-learning approach is gaining increased traction because of its unique ability to quickly adapt to a new task using only small amounts of data.However, existing optimization-based meta-learning approaches, such as MAML, ANIL and their variants, generally employ backpropagation for upper-level gradient estimation, which requires using historical lower-level parameters/gradients and thus increases computational and memory overhead in each iteration.In this paper, we propose a meta-learning algorithm that can avoid using historical parameters/gradients and significantly reduce memory costs in each iteration compared to existing optimization-based meta-learning approaches.<span class='px-1 mx-1 bg-yellow-200'>In addition to memory reduction, we prove that our proposed algorithm converges sublinearly with the iteration number of upper-level optimization, and the convergence error decays sublinearly with the batch size of sampled tasks. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.622</span></span>In the specific case in terms of deterministic meta-learning, we also prove that our proposed algorithm converges to an exact solution.Moreover, we quantify that the computational complexity of the algorithm is on the order of $\mathcal{O}(\epsilon^{-1})$, which matches existing convergence results on meta-learning even without using any historical parameters/gradients.<span class='px-1 mx-1 bg-yellow-200'>Experimental results on meta-learning benchmarks confirm the efficacy of our proposed algorithm. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.682</span></span></p>
-                </p>
-              <p class="pb-2 pt-2 text-center">
-                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.12030v1' target="_blank">
-                  link
-                </a>
-              </p>
-            </div>
-          </div>
-        </td>
-      </tr><tr>
-          <td class="inline-block">
-            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-16</p>
-          </td>
+          <td></td>
           <td>
-            <div x-data="{open: false}">
-              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
-                Exploring Semantic Consistency and Style Diversity for Domain Generalized Semantic Segmentation
-              </span>
-              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
-                <div class="text-center pt-2"></div>
-                <p class="pt-2">
-                  <p>Domain Generalized Semantic Segmentation (DGSS) seeks to utilize source domain data exclusively to enhance the generalization of semantic segmentation across unknown target domains.Prevailing studies predominantly concentrate on feature normalization and domain randomization, these approaches exhibit significant limitations.Feature normalization-based methods tend to confuse semantic features in the process of constraining the feature space distribution, resulting in classification misjudgment.Domain randomization-based methods frequently incorporate domain-irrelevant noise due to the uncontrollability of style transformations, resulting in segmentation ambiguity.To address these challenges, we introduce a novel framework, named SCSD for Semantic Consistency prediction and Style Diversity generalization.It comprises three pivotal components: Firstly, a Semantic Query Booster is designed to enhance the semantic awareness and discrimination capabilities of object queries in the mask decoder, enabling cross-domain semantic consistency prediction.Secondly, we develop a Text-Driven Style Transform module that utilizes domain difference text embeddings to controllably guide the style transformation of image features, thereby increasing inter-domain style diversity.Lastly, to prevent the collapse of similar domain feature spaces, we introduce a Style Synergy Optimization mechanism that fortifies the separation of inter-domain features and the aggregation of intra-domain features by synergistically weighting style contrastive loss and style aggregation loss.<span class='px-1 mx-1 bg-yellow-200'>Extensive experiments demonstrate that the proposed SCSD significantly outperforms existing state-of-theart methods. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.605</span></span>Notably, SCSD trained on GTAV achieved an average of 49.11 mIoU on the four unseen domain datasets, surpassing the previous state-of-the-art method by +4.08 mIoU. Code is available at https://github.com/nhw649/SCSD.</p>
-                </p>
-              <p class="pb-2 pt-2 text-center">
-                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.12050v1' target="_blank">
-                  link
-                </a>
-              </p>
-            </div>
-          </div>
-        </td>
-      </tr><tr>
-          <td class="inline-block">
-            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-16</p>
+            <h2 class="text-2xl tracking-tight pt-4 font-bold">LLMs</h2>
           </td>
-          <td>
-            <div x-data="{open: false}">
-              <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
-                IDArb: Intrinsic Decomposition for Arbitrary Number of Input Views and Illuminations
-              </span>
-              <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
-                <div class="text-center pt-2"></div>
-                <p class="pt-2">
-                  <p>Capturing geometric and material information from images remains a fundamental challenge in computer vision and graphics.Traditional optimization-based methods often require hours of computational time to reconstruct geometry, material properties, and environmental lighting from dense multi-view inputs, while still struggling with inherent ambiguities between lighting and material.On the other hand, learning-based approaches leverage rich material priors from existing 3D object datasets but face challenges with maintaining multi-view consistency.In this paper, we introduce IDArb, a diffusion-based model designed to perform intrinsic decomposition on an arbitrary number of images under varying illuminations.Our method achieves accurate and multi-view consistent estimation on surface normals and material properties.This is made possible through a novel cross-view, cross-domain attention module and an illumination-augmented, view-adaptive training strategy.Additionally, we introduce ARB-Objaverse, a new dataset that provides large-scale multi-view intrinsic data and renderings under diverse lighting conditions, supporting robust training.<span class='px-1 mx-1 bg-yellow-200'>Extensive experiments demonstrate that IDArb outperforms state-of-the-art methods both qualitatively and quantitatively. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.602</span></span>Moreover, our approach facilitates a range of downstream tasks, including single-image relighting, photometric stereo, and 3D reconstruction, highlighting its broad applications in realistic 3D content creation.</p>
-                </p>
-              <p class="pb-2 pt-2 text-center">
-                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.12083v1' target="_blank">
-                  link
-                </a>
-              </p>
-            </div>
-          </div>
-        </td>
-      </tr><tr>
+        </tr><tr>
           <td class="inline-block">
-            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-16</p>
+            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-23</p>
           </td>
           <td>
             <div x-data="{open: false}">
               <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
-                No More Tuning: Prioritized Multi-Task Learning with Lagrangian Differential Multiplier Methods
+                Emerging Security Challenges of Large Language Models
               </span>
               <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                 <div class="text-center pt-2"></div>
                 <p class="pt-2">
-                  <p>Given the ubiquity of multi-task in practical systems, Multi-Task Learning (MTL) has found widespread application across diverse domains.In real-world scenarios, these tasks often have different priorities.For instance, In web search, relevance is often prioritized over other metrics, such as click-through rates or user engagement.Existing frameworks pay insufficient attention to the prioritization among different tasks, which typically adjust task-specific loss function weights to differentiate task priorities.However, this approach encounters challenges as the number of tasks grows, leading to exponential increases in hyper-parameter tuning complexity.Furthermore, the simultaneous optimization of multiple objectives can negatively impact the performance of high-priority tasks due to interference from lower-priority tasks.   In this paper, we introduce a novel multi-task learning framework employing Lagrangian Differential Multiplier Methods for step-wise multi-task optimization.It is designed to boost the performance of high-priority tasks without interference from other tasks.Its primary advantage lies in its ability to automatically optimize multiple objectives without requiring balancing hyper-parameters for different tasks, thereby eliminating the need for manual tuning.<span class='px-1 mx-1 bg-yellow-200'>Additionally, we provide theoretical analysis demonstrating that our method ensures optimization guarantees, enhancing the reliability of the process. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.621</span></span>We demonstrate its effectiveness through experiments on multiple public datasets and its application in Taobao search, a large-scale industrial search ranking system, resulting in significant improvements across various business metrics.</p>
+                  <p><span class='px-1 mx-1 bg-yellow-200'>Large language models (LLMs) have achieved record adoption in a short period of time across many different sectors including high importance areas such as education [4] and healthcare <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.688</span></span>[23].<span class='px-1 mx-1 bg-yellow-200'>LLMs are open-ended models trained on diverse data without being tailored for specific downstream tasks, enabling broad applicability across various domains. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.673</span></span>They are commonly used for text generation, but also widely used to assist with code generation [3], and even analysis of security information, as Microsoft Security Copilot demonstrates [18].Traditional Machine Learning (ML) models are vulnerable to adversarial attacks [9].<span class='px-1 mx-1 bg-yellow-200'>So the concerns on the potential security implications of such wide scale adoption of LLMs have led to the creation of this working group on the security of LLMs. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.775</span></span><span class='px-1 mx-1 bg-yellow-200'>During the Dagstuhl seminar on "Network Attack Detection and Defense - AI-Powered Threats and Responses", the working group discussions focused on the vulnerability of LLMs to adversarial attacks, rather than their potential use in generating malware or enabling cyberattacks. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.699</span></span><span class='px-1 mx-1 bg-yellow-200'>Although we note the potential threat represented by the latter, the role of the LLMs in such uses is mostly as an accelerator for development, similar to what it is in benign use. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.78</span></span><span class='px-1 mx-1 bg-yellow-200'>To make the analysis more specific, the working group employed ChatGPT as a concrete example of an LLM and addressed the following points, which also form the structure of this report: 1. How do LLMs differ in vulnerabilities from traditional ML models? <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.726</span></span>2.<span class='px-1 mx-1 bg-yellow-200'>What are the attack objectives in LLMs? <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.727</span></span><span class='px-1 mx-1 bg-yellow-200'>3. How complex it is to assess the risks posed by the vulnerabilities of LLMs? <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.743</span></span>4.<span class='px-1 mx-1 bg-yellow-200'>What is the supply chain in LLMs, how data flow in and out of systems and what are the security implications? <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.752</span></span>We conclude with an overview of open challenges and outlook.</p>
                 </p>
               <p class="pb-2 pt-2 text-center">
-                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.12092v1' target="_blank">
+                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.17614v1' target="_blank">
                   link
                 </a>
               </p>
@@ -1843,26 +1958,21 @@ <h2 class="text-2xl tracking-tight pt-4 font-bold">Benchmarks</h2>
           </div>
         </td>
       </tr><tr>
-          <td></td>
-          <td>
-            <h2 class="text-2xl tracking-tight pt-4 font-bold">LLMs</h2>
-          </td>
-        </tr><tr>
           <td class="inline-block">
-            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-19</p>
+            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-23</p>
           </td>
           <td>
             <div x-data="{open: false}">
               <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
-                Nano-ESG: Extracting Corporate Sustainability Information from News Articles
+                Dynamic safety cases for frontier AI
               </span>
               <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                 <div class="text-center pt-2"></div>
                 <p class="pt-2">
-                  <p>Determining the sustainability impact of companies is a highly complex subject which has garnered more and more attention over the past few years.Today, investors largely rely on sustainability-ratings from established rating-providers in order to analyze how responsibly a company acts.However, those ratings have recently been criticized for being hard to understand and nearly impossible to reproduce.   An independent way to find out about the sustainability practices of companies lies in the rich landscape of news article data.In this paper, we explore a different approach to identify key opportunities and challenges of companies in the sustainability domain.We present a novel dataset of more than 840,000 news articles which were gathered for major German companies between January 2023 and September 2024.By applying a mixture of Natural Language Processing techniques, we first identify relevant articles, before summarizing them and extracting their sustainability-related sentiment and aspect using Large Language Models (LLMs).<span class='px-1 mx-1 bg-yellow-200'>Furthermore, we conduct an evaluation of the obtained data and determine that the LLM-produced answers are accurate. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.636</span></span>We release both datasets at https://github.com/Bailefan/Nano-ESG.</p>
+                  <p>Frontier artificial intelligence (AI) systems present both benefits and risks to society.Safety cases - structured arguments supported by evidence - are one way to help ensure the safe development and deployment of these systems.Yet the evolving nature of AI capabilities, as well as changes in the operational environment and understanding of risk, necessitates mechanisms for continuously updating these safety cases.Typically, in other sectors, safety cases are produced pre-deployment and do not require frequent updates post-deployment, which can be a manual, costly process.<span class='px-1 mx-1 bg-yellow-200'>This paper proposes a Dynamic Safety Case Management System (DSCMS) to support both the initial creation of a safety case and its systematic, semi-automated revision over time. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.637</span></span>Drawing on methods developed in the autonomous vehicles (AV) sector - state-of-the-art Checkable Safety Arguments (CSA) combined with Safety Performance Indicators (SPIs) recommended by UL 4600, a DSCMS helps developers maintain alignment between system safety claims and the latest system state.We demonstrate this approach on a safety case template for offensive cyber capabilities and suggest ways it can be integrated into governance structures for safety-critical decision-making.<span class='px-1 mx-1 bg-yellow-200'>While the correctness of the initial safety argument remains paramount - particularly for high-severity risks - a DSCMS provides a framework for adapting to new insights and strengthening incident response. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.613</span></span>We outline challenges and further work towards development and implementation of this approach as part of continuous safety assurance of frontier AI systems.</p>
                 </p>
               <p class="pb-2 pt-2 text-center">
-                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.15093v1' target="_blank">
+                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.17618v1' target="_blank">
                   link
                 </a>
               </p>
@@ -1871,20 +1981,20 @@ <h2 class="text-2xl tracking-tight pt-4 font-bold">LLMs</h2>
         </td>
       </tr><tr>
           <td class="inline-block">
-            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-19</p>
+            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-23</p>
           </td>
           <td>
             <div x-data="{open: false}">
               <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
-                Review-Then-Refine: A Dynamic Framework for Multi-Hop Question Answering with Temporal Adaptability
+                Tracking the Feature Dynamics in LLM Training: A Mechanistic Study
               </span>
               <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                 <div class="text-center pt-2"></div>
                 <p class="pt-2">
-                  <p>Retrieve-augmented generation (RAG) frameworks have emerged as a promising solution to multi-hop question answering(QA) tasks since it enables large language models (LLMs) to incorporate external knowledge and mitigate their inherent knowledge deficiencies.Despite this progress, existing RAG frameworks, which usually follows the retrieve-then-read paradigm, often struggle with multi-hop QA with temporal information since it has difficulty retrieving and synthesizing accurate time-related information.<span class='px-1 mx-1 bg-yellow-200'>To address the challenge, this paper proposes a novel framework called review-then-refine, which aims to enhance LLM performance in multi-hop QA scenarios with temporal information. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.669</span></span>Our approach begins with a review phase, where decomposed sub-queries are dynamically rewritten with temporal information, allowing for subsequent adaptive retrieval and reasoning process.In addition, we implement adaptive retrieval mechanism to minimize unnecessary retrievals, thus reducing the potential for hallucinations.<span class='px-1 mx-1 bg-yellow-200'>In the subsequent refine phase, the LLM synthesizes the retrieved information from each sub-query along with its internal knowledge to formulate a coherent answer. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.642</span></span><span class='px-1 mx-1 bg-yellow-200'>Extensive experimental results across multiple datasets demonstrate the effectiveness of our proposed framework, highlighting its potential to significantly improve multi-hop QA capabilities in LLMs. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.707</span></span></p>
+                  <p>Understanding training dynamics and feature evolution is crucial for the mechanistic interpretability of large language models (LLMs).Although sparse autoencoders (SAEs) have been used to identify features within LLMs, a clear picture of how these features evolve during training remains elusive.In this study, we: (1) introduce SAE-Track, a method to efficiently obtain a continual series of SAEs; (2) formulate the process of feature formation and conduct a mechanistic analysis; and (3) analyze and visualize feature drift during training.<span class='px-1 mx-1 bg-yellow-200'>Our work provides new insights into the dynamics of features in LLMs, enhancing our understanding of training mechanisms and feature evolution. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.693</span></span></p>
                 </p>
               <p class="pb-2 pt-2 text-center">
-                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.15101v1' target="_blank">
+                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.17626v1' target="_blank">
                   link
                 </a>
               </p>
@@ -1893,20 +2003,20 @@ <h2 class="text-2xl tracking-tight pt-4 font-bold">LLMs</h2>
         </td>
       </tr><tr>
           <td class="inline-block">
-            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-19</p>
+            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-23</p>
           </td>
           <td>
             <div x-data="{open: false}">
               <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
-                Knowing Where to Focus: Attention-Guided Alignment for Text-based Person Search
+                SCBench: A Sports Commentary Benchmark for Video LLMs
               </span>
               <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                 <div class="text-center pt-2"></div>
                 <p class="pt-2">
-                  <p>In the realm of Text-Based Person Search (TBPS), mainstream methods aim to explore more efficient interaction frameworks between text descriptions and visual data.However, recent approaches encounter two principal challenges.Firstly, the widely used random-based Masked Language Modeling (MLM) considers all the words in the text equally during training.However, massive semantically vacuous words ('with', 'the', etc.) be masked fail to contribute efficient interaction in the cross-modal MLM and hampers the representation alignment.Secondly, manual descriptions in TBPS datasets are tedious and inevitably contain several inaccuracies.To address these issues, we introduce an Attention-Guided Alignment (AGA) framework featuring two innovative components: Attention-Guided Mask (AGM)<span class='px-1 mx-1 bg-yellow-200'>Modeling and Text Enrichment Module (TEM). <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.603</span></span>AGM dynamically masks semantically meaningful words by aggregating the attention weight derived from the text encoding process, thereby cross-modal MLM can capture information related to the masked word from text context and images and align their representations.Meanwhile, TEM alleviates low-quality representations caused by repetitive and erroneous text descriptions by replacing those semantically meaningful words with MLM's prediction.It not only enriches text descriptions but also prevents overfitting.Extensive experiments across three challenging benchmarks demonstrate the effectiveness of our AGA, achieving new state-of-the-art results with Rank-1 accuracy reaching 78.36%, 67.31%, and 67.4% on CUHK-PEDES, ICFG-PEDES, and RSTPReid, respectively.</p>
+                  <p><span class='px-1 mx-1 bg-yellow-200'>Recently, significant advances have been made in Video Large Language Models (Video LLMs) in both academia and industry. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.629</span></span><span class='px-1 mx-1 bg-yellow-200'>However, methods to evaluate and benchmark the performance of different Video LLMs, especially their fine-grained, temporal visual capabilities, remain very limited. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.683</span></span>On one hand, current benchmarks use relatively simple videos (e.g., subtitled movie clips) where the model can understand the entire video by processing just a few frames.On the other hand, their datasets lack diversity in task format, comprising only QA or multi-choice QA, which overlooks the models' capacity for generating in-depth and precise texts.<span class='px-1 mx-1 bg-yellow-200'>Sports videos, which feature intricate visual information, sequential events, and emotionally charged commentary, present a critical challenge for Video LLMs, making sports commentary an ideal benchmarking task. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.657</span></span><span class='px-1 mx-1 bg-yellow-200'>Inspired by these challenges, we propose a novel task: sports video commentary generation, developed $\textbf{SCBench}$ for Video LLMs. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.611</span></span>To construct such a benchmark, we introduce (1) $\textbf{SCORES}$, a six-dimensional metric specifically designed for our task, upon which we propose a GPT-based evaluation method, and (2) $\textbf{CommentarySet}$, a dataset consisting of 5,775 annotated video clips and ground-truth labels tailored to our metric.<span class='px-1 mx-1 bg-yellow-200'>Based on SCBench, we conduct comprehensive evaluations on multiple Video LLMs (e.g. VILA, Video-LLaVA, etc.) and chain-of-thought baseline methods. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.632</span></span>Our results found that InternVL-Chat-2 achieves the best performance with 5.44, surpassing the second-best by 1.04.Our work provides a fresh perspective for future research, aiming to enhance models' overall capabilities in complex visual understanding tasks.Our dataset will be released soon.</p>
                 </p>
               <p class="pb-2 pt-2 text-center">
-                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.15106v1' target="_blank">
+                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.17637v1' target="_blank">
                   link
                 </a>
               </p>
@@ -1915,20 +2025,20 @@ <h2 class="text-2xl tracking-tight pt-4 font-bold">LLMs</h2>
         </td>
       </tr><tr>
           <td class="inline-block">
-            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-19</p>
+            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-23</p>
           </td>
           <td>
             <div x-data="{open: false}">
               <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
-                Associative memory inspires improvements for in-context learning using a novel attention residual stream architecture
+                Detecting anxiety and depression in dialogues: a multi-label and explainable approach
               </span>
               <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                 <div class="text-center pt-2"></div>
                 <p class="pt-2">
-                  <p><span class='px-1 mx-1 bg-yellow-200'>Large language models (LLMs) demonstrate an impressive ability to utilise information within the context of their input sequences to appropriately respond to data unseen by the LLM during its training procedure. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.684</span></span>This ability is known as in-context learning (ICL).<span class='px-1 mx-1 bg-yellow-200'>Humans and non-human animals demonstrate similar abilities, however their neural architectures differ substantially from LLMs. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.705</span></span><span class='px-1 mx-1 bg-yellow-200'>Despite this, a critical component within LLMs, the attention mechanism, resembles modern associative memory models, widely used in and influenced by the computational neuroscience community to model biological memory systems. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.68</span></span>Using this connection, we introduce an associative memory model capable of performing ICL.We use this as inspiration for a novel residual stream architecture which allows information to directly flow between attention heads.We test this architecture during training within a two-layer Transformer and show its ICL abilities manifest more quickly than without this modification.We then apply our architecture in small language models with 8 million parameters, focusing on attention head values, with results also indicating improved ICL performance at this larger and more naturalistic scale.</p>
+                  <p>Anxiety and depression are the most common mental health issues worldwide, affecting a non-negligible part of the population.Accordingly, stakeholders, including governments' health systems, are developing new strategies to promote early detection and prevention from a holistic perspective (i.e., addressing several disorders simultaneously).In this work, an entirely novel system for the multi-label classification of anxiety and depression is proposed.The input data consists of dialogues from user interactions with an assistant chatbot.Another relevant contribution lies in using Large Language Models (LLMs) for feature extraction, provided the complexity and variability of language.<span class='px-1 mx-1 bg-yellow-200'>The combination of LLMs, given their high capability for language understanding, and Machine Learning (ML) models, provided their contextual knowledge about the classification problem thanks to the labeled data, constitute a promising approach towards mental health assessment. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.629</span></span>To promote the solution's trustworthiness, reliability, and accountability, explainability descriptions of the model's decision are provided in a graphical dashboard.Experimental results on a real dataset attain 90 % accuracy, improving those in the prior literature.The ultimate objective is to contribute in an accessible and scalable way before formal treatment occurs in the healthcare systems.</p>
                 </p>
               <p class="pb-2 pt-2 text-center">
-                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.15113v1' target="_blank">
+                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.17651v1' target="_blank">
                   link
                 </a>
               </p>
@@ -1937,20 +2047,20 @@ <h2 class="text-2xl tracking-tight pt-4 font-bold">LLMs</h2>
         </td>
       </tr><tr>
           <td class="inline-block">
-            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-19</p>
+            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-23</p>
           </td>
           <td>
             <div x-data="{open: false}">
               <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
-                Qwen2.5 Technical Report
+                Generating Completions for Fragmented Broca's Aphasic Sentences Using Large Language Models
               </span>
               <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                 <div class="text-center pt-2"></div>
                 <p class="pt-2">
-                  <p><span class='px-1 mx-1 bg-yellow-200'>In this report, we introduce Qwen2.5, a comprehensive series of large language models (LLMs) designed to meet diverse needs. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.621</span></span>Compared to previous iterations, Qwen 2.5 has been significantly improved during both the pre-training and post-training stages.In terms of pre-training, we have scaled the high-quality pre-training datasets from the previous 7 trillion tokens to 18 trillion tokens.This provides a strong foundation for common sense, expert knowledge, and reasoning capabilities.In terms of post-training, we implement intricate supervised finetuning with over 1 million samples, as well as multistage reinforcement learning.Post-training techniques enhance human preference, and notably improve long text generation, structural data analysis, and instruction following.<span class='px-1 mx-1 bg-yellow-200'>To handle diverse and varied use cases effectively, we present Qwen2.5 LLM series in rich sizes. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.689</span></span>Open-weight offerings include base and instruction-tuned models, with quantized versions available.In addition, for hosted solutions, the proprietary models currently include two mixture-of-experts (MoE) variants: Qwen2.5-Turbo and Qwen2.5-Plus, both available from Alibaba Cloud Model Studio.Qwen2.5 has demonstrated top-tier performance on a wide range of benchmarks evaluating language understanding, reasoning, mathematics, coding, human preference alignment, etc.Specifically, the open-weight flagship Qwen2.5-72B-Instruct outperforms a number of open and proprietary models and demonstrates competitive performance to the state-of-the-art open-weight model, Llama-3-405B-Instruct, which is around 5 times larger.Qwen2.5-Turbo and Qwen2.5-Plus offer superior cost-effectiveness while performing competitively against GPT-4o-mini and GPT-4o respectively.Additionally, as the foundation, Qwen2.5 models have been instrumental in training specialized models such as Qwen2.5-Math, Qwen2.5-Coder, QwQ, and multimodal models.</p>
+                  <p>Broca's aphasia is a type of aphasia characterized by non-fluent, effortful and fragmented speech production with relatively good comprehension.<span class='px-1 mx-1 bg-yellow-200'>Since traditional aphasia treatment methods are often time-consuming, labour-intensive, and do not reflect real-world conversations, applying natural language processing based approaches such as Large Language Models (LLMs) could potentially contribute to improving existing treatment approaches. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.65</span></span>To address this issue, we explore the use of sequence-to-sequence LLMs for completing fragmented Broca's aphasic sentences.We first generate synthetic Broca's aphasic data using a rule-based system designed to mirror the linguistic characteristics of Broca's aphasic speech.<span class='px-1 mx-1 bg-yellow-200'>Using this synthetic data, we then fine-tune four pre-trained LLMs on the task of completing fragmented sentences. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.629</span></span>We evaluate our fine-tuned models on both synthetic and authentic Broca's aphasic data.<span class='px-1 mx-1 bg-yellow-200'>We demonstrate LLMs' capability for reconstructing fragmented sentences, with the models showing improved performance with longer input utterances. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.661</span></span><span class='px-1 mx-1 bg-yellow-200'>Our result highlights the LLMs' potential in advancing communication aids for individuals with Broca's aphasia and possibly other clinical populations. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.678</span></span></p>
                 </p>
               <p class="pb-2 pt-2 text-center">
-                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.15115v1' target="_blank">
+                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.17669v1' target="_blank">
                   link
                 </a>
               </p>
@@ -1959,20 +2069,20 @@ <h2 class="text-2xl tracking-tight pt-4 font-bold">LLMs</h2>
         </td>
       </tr><tr>
           <td class="inline-block">
-            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-19</p>
+            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-23</p>
           </td>
           <td>
             <div x-data="{open: false}">
               <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
-                Adaptive Pruning for Large Language Models with Structural Importance Awareness
+                Large Language Model Safety: A Holistic Survey
               </span>
               <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                 <div class="text-center pt-2"></div>
                 <p class="pt-2">
-                  <p><span class='px-1 mx-1 bg-yellow-200'>The recent advancements in large language models (LLMs) have significantly improved language understanding and generation capabilities. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.692</span></span><span class='px-1 mx-1 bg-yellow-200'>However, it is difficult to deploy LLMs on resource-constrained edge devices due to their high computational and storage resource demands. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.744</span></span><span class='px-1 mx-1 bg-yellow-200'>To address this issue, we propose a novel LLM model pruning method, namely structurally-aware adaptive pruning (SAAP), to significantly reduce the computational and memory costs while maintaining model performance. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.63</span></span><span class='px-1 mx-1 bg-yellow-200'>We first define an adaptive importance fusion metric to evaluate the importance of all coupled structures in LLMs by considering their homoscedastic uncertainty. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.645</span></span>Then, we rank the importance of all modules to determine the specific layers that should be pruned to meet particular performance requirements.<span class='px-1 mx-1 bg-yellow-200'>Furthermore, we develop a new group fine-tuning strategy to improve the inference efficiency of LLMs. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.704</span></span><span class='px-1 mx-1 bg-yellow-200'>Finally, we evaluate the proposed SAAP method on multiple LLMs across two common tasks, i.e., zero-shot classification and text generation. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.678</span></span>Experimental results show that our SAAP method outperforms several state-of-the-art baseline methods, achieving 2.17%, 2.37%, and 2.39% accuracy gains on LLaMA-7B, Vicuna-7B, and<span class='px-1 mx-1 bg-yellow-200'>LLaMA-13B. Additionally, SAAP improves the token generation speed by 5%, showcasing its practical advantages in resource-constrained scenarios. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.634</span></span></p>
+                  <p><span class='px-1 mx-1 bg-yellow-200'>The rapid development and deployment of large language models (LLMs) have introduced a new frontier in artificial intelligence, marked by unprecedented capabilities in natural language understanding and generation. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.687</span></span>However, the increasing integration of these models into critical applications raises substantial safety concerns, necessitating a thorough examination of their potential risks and associated mitigation strategies.   <span class='px-1 mx-1 bg-yellow-200'>This survey provides a comprehensive overview of the current landscape of LLM safety, covering four major categories: value misalignment, robustness to adversarial attacks, misuse, and autonomous AI risks. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.69</span></span><span class='px-1 mx-1 bg-yellow-200'>In addition to the comprehensive review of the mitigation methodologies and evaluation resources on these four aspects, we further explore four topics related to LLM safety: the safety implications of LLM agents, the role of interpretability in enhancing LLM safety, the technology roadmaps proposed and abided by a list of AI companies and institutes for LLM safety, and AI governance aimed at LLM safety with discussions on international cooperation, policy proposals, and prospective regulatory directions.    <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.71</span></span><span class='px-1 mx-1 bg-yellow-200'>Our findings underscore the necessity for a proactive, multifaceted approach to LLM safety, emphasizing the integration of technical solutions, ethical considerations, and robust governance frameworks. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.719</span></span><span class='px-1 mx-1 bg-yellow-200'>This survey is intended to serve as a foundational resource for academy researchers, industry practitioners, and policymakers, offering insights into the challenges and opportunities associated with the safe integration of LLMs into society. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.778</span></span><span class='px-1 mx-1 bg-yellow-200'>Ultimately, it seeks to contribute to the safe and beneficial development of LLMs, aligning with the overarching goal of harnessing AI for societal advancement and well-being. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.722</span></span><span class='px-1 mx-1 bg-yellow-200'>A curated list of related papers has been publicly available at https://github.com/tjunlp-lab/Awesome-LLM-Safety-Papers. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.698</span></span></p>
                 </p>
               <p class="pb-2 pt-2 text-center">
-                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.15127v1' target="_blank">
+                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.17686v1' target="_blank">
                   link
                 </a>
               </p>
@@ -1981,20 +2091,20 @@ <h2 class="text-2xl tracking-tight pt-4 font-bold">LLMs</h2>
         </td>
       </tr><tr>
           <td class="inline-block">
-            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-19</p>
+            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-23</p>
           </td>
           <td>
             <div x-data="{open: false}">
               <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
-                Relaxed exception semantics for Arm-A (extended version)
+                RAGONITE: Iterative Retrieval on Induced Databases and Verbalized RDF for Conversational QA over KGs with RAG
               </span>
               <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                 <div class="text-center pt-2"></div>
                 <p class="pt-2">
-                  <p>To manage exceptions, software relies on a key architectural guarantee, precision: that exceptions appear to execute between instructions.However, this definition, dating back over 60 years, fundamentally assumes a sequential programmers model.Modern architectures such as Arm-A with programmer-observable relaxed behaviour make such a naive definition inadequate, and it is unclear exactly what guarantees programmers have on exception entry and exit.   In this paper, we clarify the concepts needed to discuss exceptions in the relaxed-memory setting -- a key aspect of precisely specifying the architectural interface between hardware and software.<span class='px-1 mx-1 bg-yellow-200'>We explore the basic relaxed behaviour across exception boundaries, and the semantics of external aborts, using Arm-A as a representative modern architecture. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.601</span></span>We identify an important problem, present yet unexplored for decades: pinning down what it means for exceptions to be precise in a relaxed setting.We describe key phenomena that any definition should account for.We develop an axiomatic model for Arm-A precise exceptions, tooling for axiomatic model execution, and a library of tests.Finally we explore the relaxed semantics of software-generated interrupts, as used in sophisticated programming patterns, and sketch how they too could be modelled.</p>
+                  <p>Conversational question answering (ConvQA) is a convenient means of searching over RDF knowledge graphs (KGs), where a prevalent approach is to translate natural language questions to SPARQL queries.<span class='px-1 mx-1 bg-yellow-200'>However, SPARQL has certain shortcomings: (i) it is brittle for complex intents and conversational questions, and (ii) it is not suitable for more abstract needs. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.604</span></span>Instead, we propose a novel two-pronged system where we fuse: (i) SQL-query results over a database automatically derived from the KG, and (ii) text-search results over verbalizations of KG facts.Our pipeline supports iterative retrieval: when the results of any branch are found to be unsatisfactory, the system can automatically opt for further rounds.<span class='px-1 mx-1 bg-yellow-200'>We put everything together in a retrieval augmented generation (RAG) setup, where an LLM generates a coherent response from accumulated search results. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.677</span></span>We demonstrate the superiority of our proposed system over several baselines on a knowledge graph of BMW automobiles.</p>
                 </p>
               <p class="pb-2 pt-2 text-center">
-                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.15140v1' target="_blank">
+                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.17690v1' target="_blank">
                   link
                 </a>
               </p>
@@ -2003,20 +2113,20 @@ <h2 class="text-2xl tracking-tight pt-4 font-bold">LLMs</h2>
         </td>
       </tr><tr>
           <td class="inline-block">
-            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-19</p>
+            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-23</p>
           </td>
           <td>
             <div x-data="{open: false}">
               <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
-                Language Models as Continuous Self-Evolving Data Engineers
+                Knowledge Editing through Chain-of-Thought
               </span>
               <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                 <div class="text-center pt-2"></div>
                 <p class="pt-2">
-                  <p><span class='px-1 mx-1 bg-yellow-200'>Large Language Models (LLMs) have demonstrated remarkable capabilities on various tasks, while the further evolvement is limited to the lack of high-quality training data. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.686</span></span><span class='px-1 mx-1 bg-yellow-200'>In addition, traditional training approaches rely too much on expert-labeled data, setting an upper limit on the performance of LLMs. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.674</span></span><span class='px-1 mx-1 bg-yellow-200'>To address this issue, we propose a novel paradigm that enables LLMs to train itself by autonomously generating, cleaning, reviewing, and annotating data with preference information, named LANCE. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.737</span></span><span class='px-1 mx-1 bg-yellow-200'>Our approach demonstrates that LLMs can serve as continuous self-evolving data engineers, significantly reducing the time and cost of the post-training data construction process. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.708</span></span>Through iterative fine-tuning on different variants of the Qwen2, we validate the effectiveness of LANCE across various tasks, showing that it can continuously improve model performance and maintain high-quality data generation.Across eight benchmark dimensions, LANCE resulted in an average score enhancement of 3.36 for Qwen2-7B and 2.70 for Qwen2-7B-Instruct.This training paradigm with autonomous data construction not only reduces the reliance on human experts or external models but also ensures that the data aligns with human values and preferences, paving the way for the development of future superintelligent systems that can exceed human capabilities.</p>
+                  <p><span class='px-1 mx-1 bg-yellow-200'>Large Language Models (LLMs) have demonstrated exceptional capabilities across a wide range of natural language processing (NLP) tasks. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.668</span></span>However, keeping these models up-to-date with evolving world knowledge remains a significant challenge due to the high costs of frequent retraining.<span class='px-1 mx-1 bg-yellow-200'>To address this challenge, knowledge editing techniques have emerged to update LLMs with new information without rebuilding the model from scratch. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.711</span></span>Among these, the in-context editing paradigm stands out for its effectiveness in integrating new knowledge while preserving the model's original capabilities.Despite its potential, existing in-context knowledge editing methods are often task-specific, focusing primarily on multi-hop QA tasks using structured knowledge triples.Moreover, their reliance on few-shot prompting for task decomposition makes them unstable and less effective in generalizing across diverse tasks.   <span class='px-1 mx-1 bg-yellow-200'>In response to these limitations, we propose EditCoT, a novel knowledge editing framework that flexibly and efficiently updates LLMs across various tasks without retraining. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.681</span></span>EditCoT works by generating a chain-of-thought (CoT) for a given input and then iteratively refining this CoT process using a CoT editor based on updated knowledge.We evaluate EditCoT across a diverse range of benchmarks, covering multiple languages and tasks.The results demonstrate that our approach achieves state-of-the-art performance while offering superior generalization, effectiveness, and stability compared to existing methods, marking a significant advancement in the field of knowledge updating.Code and data are available at: https://github.com/bebr2/EditCoT.</p>
                 </p>
               <p class="pb-2 pt-2 text-center">
-                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.15151v1' target="_blank">
+                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.17727v1' target="_blank">
                   link
                 </a>
               </p>
@@ -2025,20 +2135,20 @@ <h2 class="text-2xl tracking-tight pt-4 font-bold">LLMs</h2>
         </td>
       </tr><tr>
           <td class="inline-block">
-            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-19</p>
+            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-23</p>
           </td>
           <td>
             <div x-data="{open: false}">
               <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
-                Measuring DNA Microswimmer Locomotion in Complex Flow Environments
+                Chumor 2.0: Towards Benchmarking Chinese Humor Understanding
               </span>
               <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                 <div class="text-center pt-2"></div>
                 <p class="pt-2">
-                  <p><span class='px-1 mx-1 bg-yellow-200'>Microswimmers are sub-millimeter swimming microrobots that show potential as a platform for controllable locomotion in applications including targeted cargo delivery and minimally invasive surgery. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.605</span></span><span class='px-1 mx-1 bg-yellow-200'>To be viable for these target applications, microswimmers will eventually need to be able to navigate in environments with dynamic fluid flows and forces. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.634</span></span><span class='px-1 mx-1 bg-yellow-200'>Experimental studies with microswimmers towards this goal are currently rare because of the difficulty isolating intentional microswimmer motion from environment-induced motion. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.609</span></span>In this work, we present a method for measuring microswimmer locomotion within a complex flow environment using fiducial microspheres.By tracking the particle motion of ferromagnetic and non-magnetic polystyrene fiducial microspheres, we capture the effect of fluid flow and field gradients on microswimmer trajectories.We then determine the field-driven translation of these microswimmers relative to fluid flow and demonstrate the effectiveness of this method by illustrating the motion of multiple microswimmers through different flows.</p>
+                  <p>Existing humor datasets and evaluations predominantly focus on English, leaving limited resources for culturally nuanced humor in non-English languages like Chinese.To address this gap, we construct Chumor, the first Chinese humor explanation dataset that exceeds the size of existing humor datasets.Chumor is sourced from Ruo Zhi Ba, a Chinese Reddit-like platform known for sharing intellectually challenging and culturally specific jokes.<span class='px-1 mx-1 bg-yellow-200'>We test ten LLMs through direct and chain-of-thought prompting, revealing that Chumor poses significant challenges to existing LLMs, with their accuracy slightly above random and far below human. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.744</span></span>In addition, our analysis highlights that human-annotated humor explanations are significantly better than those generated by GPT-4o and ERNIE-4-turbo.We release Chumor at https://huggingface.co/datasets/dnaihao/Chumor, our project page is at https://dnaihao.github.io/Chumor-dataset/, our leaderboard is at https://huggingface.co/spaces/dnaihao/Chumor, and our codebase is at https://github.com/dnaihao/Chumor-dataset.</p>
                 </p>
               <p class="pb-2 pt-2 text-center">
-                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.15152v1' target="_blank">
+                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.17729v1' target="_blank">
                   link
                 </a>
               </p>
@@ -2047,20 +2157,20 @@ <h2 class="text-2xl tracking-tight pt-4 font-bold">LLMs</h2>
         </td>
       </tr><tr>
           <td class="inline-block">
-            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-19</p>
+            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-23</p>
           </td>
           <td>
             <div x-data="{open: false}">
               <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
-                Prompt-A-Video: Prompt Your Video Diffusion Model via Preference-Aligned LLM
+                Reasoning to Attend: Try to Understand How <SEG> Token Works
               </span>
               <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                 <div class="text-center pt-2"></div>
                 <p class="pt-2">
-                  <p>Text-to-video models have made remarkable advancements through optimization on high-quality text-video pairs, where the textual prompts play a pivotal role in determining quality of output videos.However, achieving the desired output often entails multiple revisions and iterative inference to refine user-provided prompts.Current automatic methods for refining prompts encounter challenges such as Modality-Inconsistency, Cost-Discrepancy, and Model-Unaware when applied to text-to-video diffusion models.To address these problem, we introduce an LLM-based prompt adaptation framework, termed as Prompt-A-Video, which excels in crafting Video-Centric, Labor-Free and Preference-Aligned prompts tailored to specific video diffusion model.Our approach involves a meticulously crafted two-stage optimization and alignment system.<span class='px-1 mx-1 bg-yellow-200'>Initially, we conduct a reward-guided prompt evolution pipeline to automatically create optimal prompts pool and leverage them for supervised fine-tuning (SFT) of the LLM. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.632</span></span>Then multi-dimensional rewards are employed to generate pairwise data for the SFT model, followed by the direct preference optimization (DPO) algorithm to further facilitate preference alignment.Through extensive experimentation and comparative analyses, we validate the effectiveness of Prompt-A-Video across diverse generation models, highlighting its potential to push the boundaries of video generation.</p>
+                  <p>Current Large Multimodal Models (LMMs) empowered visual grounding typically rely on $\texttt{<SEG>}$ token as a text prompt to jointly optimize the vision-language model (e.g., LLaVA) and the downstream task-specified model (\eg, SAM).However, we observe that little research has looked into how it works.In this work, we first visualize the similarity maps, which are obtained by computing the semantic similarity between the $\texttt{<SEG>}$ token and the image token embeddings derived from the last hidden layer in both the LLaVA encoder and SAM decoder.Intriguingly, we have found that a striking consistency holds in terms of activation responses in the similarity map,which reveals that what $\texttt{<SEG>}$ token contributes to is the semantic similarity within image-text pairs.<span class='px-1 mx-1 bg-yellow-200'>Specifically, $\texttt{<SEG>}$ token, a placeholder expanded in text vocabulary, extensively queries among individual tokenized image patches to match the semantics of an object from text to the paired image while the Large Language Models (LLMs) are being fine-tuned. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.602</span></span>Upon the above findings, we present READ, which facilitates LMMs' resilient $\textbf{REA}$soning capability of where to atten$\textbf{D}$ under the guidance of highly activated points borrowed from similarity maps.Remarkably, READ features an intuitive design, Similarity as Points module (SasP), which can be seamlessly applied to $\texttt{<SEG>}$-like paradigms in a plug-and-play fashion.Also, extensive experiments have been conducted on the ReasonSeg and RefCOCO(+/g) datasets.To validate whether READ suffers from catastrophic forgetting of previous skills after fine-tuning, we further assess its generation ability on an augmented FP-RefCOCO(+/g) dataset.All codes and models are publicly available at https://github.com/rui-qian/READ.</p>
                 </p>
               <p class="pb-2 pt-2 text-center">
-                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.15156v1' target="_blank">
+                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.17741v1' target="_blank">
                   link
                 </a>
               </p>
@@ -2069,20 +2179,20 @@ <h2 class="text-2xl tracking-tight pt-4 font-bold">LLMs</h2>
         </td>
       </tr><tr>
           <td class="inline-block">
-            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-19</p>
+            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-23</p>
           </td>
           <td>
             <div x-data="{open: false}">
               <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
-                Rethinking Uncertainty Estimation in Natural Language Generation
+                YuLan-Mini: An Open Data-efficient Language Model
               </span>
               <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                 <div class="text-center pt-2"></div>
                 <p class="pt-2">
-                  <p><span class='px-1 mx-1 bg-yellow-200'>Large Language Models (LLMs) are increasingly employed in real-world applications, driving the need to evaluate the trustworthiness of their generated text. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.687</span></span>To this end, reliable uncertainty estimation is essential.<span class='px-1 mx-1 bg-yellow-200'>Since current LLMs generate text autoregressively through a stochastic process, the same prompt can lead to varying outputs. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.724</span></span>Consequently, leading uncertainty estimation methods generate and analyze multiple output sequences to determine the LLM's uncertainty.However, generating output sequences is computationally expensive, making these methods impractical at scale.In this work, we inspect the theoretical foundations of the leading methods and explore new directions to enhance their computational efficiency.Building on the framework of proper scoring rules, we find that the negative log-likelihood of the most likely output sequence constitutes a theoretically grounded uncertainty measure.To approximate this alternative measure, we propose G-NLL, which has the advantage of being obtained using only a single output sequence generated by greedy decoding.This makes uncertainty estimation more efficient and straightforward, while preserving theoretical rigor.<span class='px-1 mx-1 bg-yellow-200'>Empirical results demonstrate that G-NLL achieves state-of-the-art performance across various LLMs and tasks. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.675</span></span>Our work lays the foundation for efficient and reliable uncertainty estimation in natural language generation, challenging the necessity of more computationally involved methods currently leading the field.</p>
+                  <p><span class='px-1 mx-1 bg-yellow-200'>Effective pre-training of large language models (LLMs) has been challenging due to the immense resource demands and the complexity of the technical processes involved. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.67</span></span>This paper presents a detailed technical report on YuLan-Mini, a highly capable base model with 2.42B parameters that achieves top-tier performance among models of similar parameter scale.Our pre-training approach focuses on enhancing training efficacy through three key technical contributions: an elaborate data pipeline combines data cleaning with data schedule strategies, a robust optimization method to mitigate training instability, and an effective annealing approach that incorporates targeted data selection and long context training.Remarkably, YuLan-Mini, trained on 1.08T tokens, achieves performance comparable to industry-leading models that require significantly more data.To facilitate reproduction, we release the full details of the data composition for each training phase.Project details can be accessed at the following link: https://github.com/RUC-GSAI/YuLan-Mini.</p>
                 </p>
               <p class="pb-2 pt-2 text-center">
-                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.15176v1' target="_blank">
+                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.17743v1' target="_blank">
                   link
                 </a>
               </p>
@@ -2091,20 +2201,20 @@ <h2 class="text-2xl tracking-tight pt-4 font-bold">LLMs</h2>
         </td>
       </tr><tr>
           <td class="inline-block">
-            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-19</p>
+            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-23</p>
           </td>
           <td>
             <div x-data="{open: false}">
               <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
-                Critical-Questions-of-Thought: Steering LLM reasoning with Argumentative Querying
+                RepoTransBench: A Real-World Benchmark for Repository-Level Code Translation
               </span>
               <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                 <div class="text-center pt-2"></div>
                 <p class="pt-2">
-                  <p><span class='px-1 mx-1 bg-yellow-200'>Studies have underscored how, regardless of the recent breakthrough and swift advances in AI research, even state-of-the-art Large Language models (LLMs) continue to struggle when performing logical and mathematical reasoning. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.635</span></span><span class='px-1 mx-1 bg-yellow-200'>The results seem to suggest that LLMs still work as (highly advanced) data pattern identifiers, scoring poorly when attempting to generalise and solve reasoning problems the models have never previously seen or that are not close to samples presented in their training data. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.733</span></span>To address this compelling concern, this paper makes use of the notion of critical questions from the literature on argumentation theory, focusing in particular on Toulmin's model of argumentation.<span class='px-1 mx-1 bg-yellow-200'>We show that employing these critical questions can improve the reasoning capabilities of LLMs. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.774</span></span><span class='px-1 mx-1 bg-yellow-200'>By probing the rationale behind the models' reasoning process, the LLM can assess whether some logical mistake is occurring and correct it before providing the final reply to the user prompt. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.721</span></span>The underlying idea is drawn from the gold standard of any valid argumentative procedure: the conclusion is valid if it is entailed by accepted premises.Or, to paraphrase such Aristotelian principle in a real-world approximation, characterised by incomplete information and presumptive logic, the conclusion is valid if not proved otherwise.This approach successfully steers the models' output through a reasoning pipeline, resulting in better performance against the baseline and its Chain-of-Thought (CoT) implementation.<span class='px-1 mx-1 bg-yellow-200'>To this end, an extensive evaluation of the proposed approach on the MT-Bench Reasoning and Math tasks across a range of LLMs is provided. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.617</span></span></p>
+                  <p>Repository-level code translation refers to translating an entire code repository from one programming language to another while preserving the functionality of the source repository.Many benchmarks have been proposed to evaluate the performance of such code translators.However, previous benchmarks mostly provide fine-grained samples, focusing at either code snippet, function, or file-level code translation.Such benchmarks do not accurately reflect real-world demands, where entire repositories often need to be translated, involving longer code length and more complex functionalities.To address this gap, we propose a new benchmark, named RepoTransBench, which is a real-world repository-level code translation benchmark with an automatically executable test suite.<span class='px-1 mx-1 bg-yellow-200'>We conduct experiments on RepoTransBench to evaluate the translation performance of 11 advanced LLMs. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.734</span></span><span class='px-1 mx-1 bg-yellow-200'>We find that the Success@1 score (test success in one attempt) of the best-performing LLM is only 7.33%. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.705</span></span><span class='px-1 mx-1 bg-yellow-200'>To further explore the potential of LLMs for repository-level code translation, we provide LLMs with error-related feedback to perform iterative debugging and observe an average 7.09% improvement on Success@1. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.731</span></span><span class='px-1 mx-1 bg-yellow-200'>However, even with this improvement, the Success@1 score of the best-performing LLM is only 21%, which may not meet the need for reliable automatic repository-level code translation. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.715</span></span><span class='px-1 mx-1 bg-yellow-200'>Finally, we conduct a detailed error analysis and highlight current LLMs' deficiencies in repository-level code translation, which could provide a reference for further improvements. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.734</span></span></p>
                 </p>
               <p class="pb-2 pt-2 text-center">
-                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.15177v1' target="_blank">
+                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.17744v1' target="_blank">
                   link
                 </a>
               </p>
@@ -2113,20 +2223,20 @@ <h2 class="text-2xl tracking-tight pt-4 font-bold">LLMs</h2>
         </td>
       </tr><tr>
           <td class="inline-block">
-            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-19</p>
+            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-23</p>
           </td>
           <td>
             <div x-data="{open: false}">
               <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
-                HPC-Coder-V2: Studying Code LLMs Across Low-Resource Parallel Languages
+                Deliberation in Latent Space via Differentiable Cache Augmentation
               </span>
               <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                 <div class="text-center pt-2"></div>
                 <p class="pt-2">
-                  <p><span class='px-1 mx-1 bg-yellow-200'>Large Language Model (LLM) based coding tools have been tremendously successful as software development assistants, yet they are often designed for general purpose programming tasks and perform poorly for more specialized domains such as high performance computing. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.709</span></span><span class='px-1 mx-1 bg-yellow-200'>Creating specialized models and tools for these domains is crucial towards gaining the benefits of LLMs in areas such as HPC. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.723</span></span><span class='px-1 mx-1 bg-yellow-200'>While previous work has explored HPC-specific models, LLMs still struggle to generate parallel code and it is not at all clear what hurdles are still holding back these LLMs and what must be done to overcome them. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.734</span></span><span class='px-1 mx-1 bg-yellow-200'>In this work, we conduct an in-depth study along the many axes of fine-tuning a specialized HPC LLM in order to better understand the challenges. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.668</span></span><span class='px-1 mx-1 bg-yellow-200'>Based on our findings we fine-tune and evaluate a specialized HPC LLM that is shown to be the best performing open-source code LLM for parallel code generation to date. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.684</span></span></p>
+                  <p><span class='px-1 mx-1 bg-yellow-200'>Techniques enabling large language models (LLMs) to "think more" by generating and attending to intermediate reasoning steps have shown promise in solving complex problems. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.672</span></span>However, the standard approaches generate sequences of discrete tokens immediately before responding, and so they can incur significant latency costs and be challenging to optimize.<span class='px-1 mx-1 bg-yellow-200'>In this work, we demonstrate that a frozen LLM can be augmented with an offline coprocessor that operates on the model's key-value (kv) cache. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.714</span></span>This coprocessor augments the cache with a set of latent embeddings designed to improve the fidelity of subsequent decoding.We train this coprocessor using the language modeling loss from the decoder on standard pretraining data, while keeping the decoder itself frozen.This approach enables the model to learn, in an end-to-end differentiable fashion, how to distill additional computation into its kv-cache.Because the decoder remains unchanged, the coprocessor can operate offline and asynchronously, and the language model can function normally if the coprocessor is unavailable or if a given cache is deemed not to require extra computation.We show experimentally that when a cache is augmented, the decoder achieves lower perplexity on numerous subsequent tokens.Furthermore, even without any task-specific training, our experiments demonstrate that cache augmentation consistently reduces perplexity and improves performance across a range of reasoning-intensive tasks.</p>
                 </p>
               <p class="pb-2 pt-2 text-center">
-                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.15178v1' target="_blank">
+                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.17747v1' target="_blank">
                   link
                 </a>
               </p>
@@ -2135,20 +2245,20 @@ <h2 class="text-2xl tracking-tight pt-4 font-bold">LLMs</h2>
         </td>
       </tr><tr>
           <td class="inline-block">
-            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-19</p>
+            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-23</p>
           </td>
           <td>
             <div x-data="{open: false}">
               <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
-                LlamaFusion: Adapting Pretrained Language Models for Multimodal Generation
+                ADC: Enhancing Function Calling Via Adversarial Datasets and Code Line-Level Feedback
               </span>
               <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                 <div class="text-center pt-2"></div>
                 <p class="pt-2">
-                  <p><span class='px-1 mx-1 bg-yellow-200'>We present LlamaFusion, a framework for empowering pretrained text-only large language models (LLMs) with multimodal generative capabilities, enabling them to understand and generate both text and images in arbitrary sequences. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.604</span></span><span class='px-1 mx-1 bg-yellow-200'>LlamaFusion leverages existing Llama-3's weights for processing texts autoregressively while introducing additional and parallel transformer modules for processing images with diffusion. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.609</span></span>During training, the data from each modality is routed to its dedicated modules: modality-specific feedforward layers, query-key-value projections, and normalization layers process each modality independently, while the shared self-attention layers allow interactions across text and image features.<span class='px-1 mx-1 bg-yellow-200'>By freezing the text-specific modules and only training the image-specific modules, LlamaFusion preserves the language capabilities of text-only LLMs while developing strong visual understanding and generation abilities. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.735</span></span><span class='px-1 mx-1 bg-yellow-200'>Compared to methods that pretrain multimodal generative models from scratch, our experiments demonstrate that, LlamaFusion improves image understanding by 20% and image generation by 3.6% using only 50% of the FLOPs while maintaining Llama-3's language capabilities. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.618</span></span>We also demonstrate that this framework can adapt existing vision-language models with multimodal generation ability.<span class='px-1 mx-1 bg-yellow-200'>Overall, this framework not only leverages existing computational investments in text-only LLMs but also enables the parallel development of language and vision capabilities, presenting a promising direction for efficient multimodal model development. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.651</span></span></p>
+                  <p><span class='px-1 mx-1 bg-yellow-200'>Large Language Models (LLMs) have made significant strides in Natural Language Processing and coding, yet they struggle with robustness and accuracy in complex function calls. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.631</span></span><span class='px-1 mx-1 bg-yellow-200'>To tackle these challenges, this paper introduces ADC, an innovative approach that enhances LLMs' ability to follow function formats and match complex parameters. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.647</span></span>ADC utilizes a high-quality code fine-tuning dataset with line-level execution feedback, providing granular process supervision that fosters strong logical reasoning and adherence to function formats.It also employs an adversarial dataset generation process to improve parameter matching.The staged training methodology capitalizes on both enriched code datasets and refined adversarial datasets, leading to marked improvements in function calling capabilities on the Berkeley Function-Calling Leaderboard (BFCL) Benchmark.<span class='px-1 mx-1 bg-yellow-200'>The innovation of ADC lies in its strategic combination of process supervision, adversarial refinement, and incremental learning, setting a new standard for LLM proficiency in complex function calling. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.608</span></span></p>
                 </p>
               <p class="pb-2 pt-2 text-center">
-                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.15188v1' target="_blank">
+                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.17754v1' target="_blank">
                   link
                 </a>
               </p>
@@ -2157,20 +2267,20 @@ <h2 class="text-2xl tracking-tight pt-4 font-bold">LLMs</h2>
         </td>
       </tr><tr>
           <td class="inline-block">
-            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-19</p>
+            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-23</p>
           </td>
           <td>
             <div x-data="{open: false}">
               <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
-                Face the Facts! Evaluating RAG-based Fact-checking Pipelines in Realistic Settings
+                In Case You Missed It: ARC 'Challenge' Is Not That Challenging
               </span>
               <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                 <div class="text-center pt-2"></div>
                 <p class="pt-2">
-                  <p>Natural Language Processing and Generation systems have recently shown the potential to complement and streamline the costly and time-consuming job of professional fact-checkers.In this work, we lift several constraints of current state-of-the-art pipelines for automated fact-checking based on the Retrieval-Augmented Generation (RAG) paradigm.Our goal is to benchmark, under more realistic scenarios, RAG-based methods for the generation of verdicts - i.e., short texts discussing the veracity of a claim - evaluating them on stylistically complex claims and heterogeneous, yet reliable, knowledge bases.<span class='px-1 mx-1 bg-yellow-200'>Our findings show a complex landscape, where, for example, LLM-based retrievers outperform other retrieval techniques, though they still struggle with heterogeneous knowledge bases; larger models excel in verdict faithfulness, while smaller models provide better context adherence, with human evaluations favouring zero-shot and one-shot approaches for informativeness, and fine-tuned models for emotional alignment. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.626</span></span></p>
+                  <p><span class='px-1 mx-1 bg-yellow-200'>ARC Challenge appears more difficult than ARC Easy for modern LLMs primarily due to an evaluation setup that prevents direct comparison of answer choices rather than inherent complexity. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.671</span></span>Although some researchers have quietly shifted to a more appropriate scheme over the last year, the implications of this change have yet to be widely acknowledged.We highlight this overlooked shift, show how similar evaluation practices falsely imply reasoning deficits in other benchmarks, and demonstrate that fairer methods dramatically reduce performance gaps (e.g. on SIQA) and even yield superhuman results (OpenBookQA).In doing so, we reveal how evaluation shapes perceived difficulty and offer guidelines to ensure that multiple-choice evaluations accurately reflect actual model capabilities.</p>
                 </p>
               <p class="pb-2 pt-2 text-center">
-                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.15189v1' target="_blank">
+                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.17758v1' target="_blank">
                   link
                 </a>
               </p>
@@ -2179,20 +2289,20 @@ <h2 class="text-2xl tracking-tight pt-4 font-bold">LLMs</h2>
         </td>
       </tr><tr>
           <td class="inline-block">
-            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-19</p>
+            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-23</p>
           </td>
           <td>
             <div x-data="{open: false}">
               <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
-                MMLU-CF: A Contamination-free Multi-task Language Understanding Benchmark
+                ResearchTown: Simulator of Human Research Community
               </span>
               <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                 <div class="text-center pt-2"></div>
                 <p class="pt-2">
-                  <p><span class='px-1 mx-1 bg-yellow-200'>Multiple-choice question (MCQ) datasets like Massive Multitask Language Understanding (MMLU) are widely used to evaluate the commonsense, understanding, and problem-solving abilities of large language models (LLMs). <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.611</span></span><span class='px-1 mx-1 bg-yellow-200'>However, the open-source nature of these benchmarks and the broad sources of training data for LLMs have inevitably led to benchmark contamination, resulting in unreliable evaluation results. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.642</span></span>To alleviate this issue, we propose a contamination-free and more challenging MCQ benchmark called MMLU-CF.<span class='px-1 mx-1 bg-yellow-200'>This benchmark reassesses LLMs' understanding of world knowledge by averting both unintentional and malicious data leakage. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.669</span></span>To avoid unintentional data leakage, we source data from a broader domain and design three decontamination rules.To prevent malicious data leakage, we divide the benchmark into validation and test sets with similar difficulty and subject distributions.The test set remains closed-source to ensure reliable results, while the validation set is publicly available to promote transparency and facilitate independent verification.<span class='px-1 mx-1 bg-yellow-200'>Our evaluation of mainstream LLMs reveals that the powerful GPT-4o achieves merely a 5-shot score of 73.4% and a 0-shot score of 71.9% on the test set, which indicates the effectiveness of our approach in creating a more rigorous and contamination-free evaluation standard. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.664</span></span>The GitHub repository is available at https://github.com/microsoft/MMLU-CF and the dataset refers to https://huggingface.co/datasets/microsoft/MMLU-CF.</p>
+                  <p><span class='px-1 mx-1 bg-yellow-200'>Large Language Models (LLMs) have demonstrated remarkable potential in scientific domains, yet a fundamental question remains unanswered: Can we simulate human research communities with LLMs? <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.673</span></span>Addressing this question can deepen our understanding of the processes behind idea brainstorming and inspire the automatic discovery of novel scientific insights.In this work, we propose ResearchTown, a multi-agent framework for research community simulation.Within this framework, the human research community is simplified and modeled as an agent-data graph, where researchers and papers are represented as agent-type and data-type nodes, respectively, and connected based on their collaboration relationships.We also introduce TextGNN, a text-based inference framework that models various research activities (e.g., paper reading, paper writing, and review writing) as special forms of a unified message-passing process on the agent-data graph.To evaluate the quality of the research simulation, we present ResearchBench, a benchmark that uses a node-masking prediction task for scalable and objective assessment based on similarity.Our experiments reveal three key findings: (1) ResearchTown can provide a realistic simulation of collaborative research activities, including paper writing and review writing; (2) ResearchTown can maintain robust simulation with multiple researchers and diverse papers; (3) ResearchTown can generate interdisciplinary research ideas that potentially inspire novel research directions.</p>
                 </p>
               <p class="pb-2 pt-2 text-center">
-                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.15194v1' target="_blank">
+                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.17767v1' target="_blank">
                   link
                 </a>
               </p>
@@ -2201,20 +2311,20 @@ <h2 class="text-2xl tracking-tight pt-4 font-bold">LLMs</h2>
         </td>
       </tr><tr>
           <td class="inline-block">
-            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-19</p>
+            <p class='font-bold text-black px-2 mx-1 text-xs w-24'>2024-12-23</p>
           </td>
           <td>
             <div x-data="{open: false}">
               <span @click="open = ! open" class="hover:underline cursor-pointer decoration-2 decoration-green-600 text-gray-800 text-sm">
-                LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks
+                Memory makes computation universal, remember?
               </span>
               <div x-show="open" x-collapse.duration.500ms class="text-sm text-gray-500 pt-2">
                 <div class="text-center pt-2"></div>
                 <p class="pt-2">
-                  <p><span class='px-1 mx-1 bg-yellow-200'>This paper introduces LongBench v2, a benchmark designed to assess the ability of LLMs to handle long-context problems requiring deep understanding and reasoning across real-world multitasks. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.684</span></span>LongBench v2 consists of 503 challenging multiple-choice questions, with contexts ranging from 8k to 2M words, across six major task categories: single-document QA, multi-document QA, long in-context learning, long-dialogue history understanding, code repository understanding, and long structured data understanding.To ensure the breadth and the practicality, we collect data from nearly 100 highly educated individuals with diverse professional backgrounds.We employ both automated and manual review processes to maintain high quality and difficulty, resulting in human experts achieving only 53.7% accuracy under a 15-minute time constraint.Our evaluation reveals that the best-performing model, when directly answers the questions, achieves only 50.1% accuracy.In contrast, the o1-preview model, which includes longer reasoning, achieves 57.7%, surpassing the human baseline by 4%.These results highlight the importance of enhanced reasoning ability and scaling inference-time compute to tackle the long-context challenges in LongBench v2.The project is available at https://longbench2.github.io.</p>
+                  <p>Recent breakthroughs in AI capability have been attributed to increasingly sophisticated architectures and alignment techniques, but a simpler principle may explain these advances: memory makes computation universal.<span class='px-1 mx-1 bg-yellow-200'>Memory enables universal computation through two fundamental capabilities: recursive state maintenance and reliable history access. <span style='font-size: 0.65rem;' class='text-purple-500 font-bold'>0.601</span></span>We formally prove these requirements are both necessary and sufficient for universal computation.This principle manifests across scales, from cellular computation to neural networks to language models.Complex behavior emerges not from sophisticated processing units but from maintaining and accessing state across time.We demonstrate how parallel systems like neural networks achieve universal computation despite limitations in their basic units by maintaining state across iterations.This theoretical framework reveals a universal pattern: computational advances consistently emerge from enhanced abilities to maintain and access state rather than from more complex basic operations.Our analysis unifies understanding of computation across biological systems, artificial intelligence, and human cognition, reminding us that humanity's own computational capabilities have evolved in step with our technical ability to remember through oral traditions, writing, and now computing.</p>
                 </p>
               <p class="pb-2 pt-2 text-center">
-                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.15204v1' target="_blank">
+                <a class="underline decoration-2 text-green-600 text-md pt-2" href='http://arxiv.org/abs/2412.17794v1' target="_blank">
                   link
                 </a>
               </p>