From 41c3acbd5f9fcae89191da52f40911f179bf99b0 Mon Sep 17 00:00:00 2001 From: Automated Date: Wed, 25 Dec 2024 09:08:16 +0000 Subject: [PATCH] Latest data: Wed Dec 25 09:08:16 UTC 2024 --- index.html | 690 +++++++++++++++++++++++++++-------------------------- 1 file changed, 356 insertions(+), 334 deletions(-) diff --git a/index.html b/index.html index 37cd33b4..f196c87d 100644 --- a/index.html +++ b/index.html @@ -19,7 +19,7 @@

Vincent's Arxiv FrontPage


-

Generated on 2024-12-24.


+

Generated on 2024-12-25.


This frontpage is made by scraping arxiv and by running a sentence-model that detects if the abstract describes a paper about a topic of interest. One cool feature: it all pretty much runs via Github Actions.


@@ -29,6 +29,116 @@

Vincent's Arx

New Datasets

+ +

2024-12-24

+ + +
+ + Muse: A Multimodal Conversational Recommendation Dataset with Scenario-Grounded User Profiles + +
+
+

+

Current conversational recommendation systems focus predominantly on text.However, real-world recommendation settings are generally multimodal, causing a significant gap between existing research and practical applications.To address this issue, we propose Muse, the first multimodal conversational recommendation dataset.Muse comprises 83,148 utterances from 7,000 conversations centered around the Clothing domain.Each conversation contains comprehensive multimodal interactions, rich elements, and natural dialogues.Data in Muse are automatically synthesized by a multi-agent framework powered by multimodal large language models (MLLMs).It innovatively derives user profiles from real-world scenarios rather than depending on manual design and history data for better scalability, and then it fulfills conversation simulation and optimization.Both human and LLM evaluations demonstrate the high quality of conversations in Muse.Additionally, fine-tuning experiments on three MLLMs demonstrate Muse's learnable patterns for recommendations and responses, confirming its value for multimodal conversational recommendation.Our dataset and codes are available at \url{https://anonymous.4open.science/r/Muse-0086}. 0.811

+

+

+ + link + +

+
+
+ + + +

2024-12-24

+ + +
+ + An Overview and Discussion of the Suitability of Existing Speech Datasets to Train Machine Learning Models for Collective Problem Solving + +
+
+

+

This report characterized the suitability of existing datasets for devising new Machine Learning models, decision making methods, and analysis algorithms to improve Collaborative Problem Solving and then enumerated requirements for future datasets to be devised.Problem solving was assumed to be performed in teams of about three, four members, which talked to each other.A dataset consists of the speech recordings of such teams. 0.908The characterization methodology was based on metrics that capture cognitive, social, and emotional activities and situations.The report presented the analysis of a large group of datasets developed for Spoken Language Understanding, a research area with some similarity to Collaborative Problem Solving. 0.824

+

+

+ + link + +

+
+
+ + + +

2024-12-24

+ + +
+ + The Key of Understanding Vision Tasks: Explanatory Instructions + +
+
+

+

Computer Vision (CV) has yet to fully achieve the zero-shot task generalization observed in Natural Language Processing (NLP), despite following many of the milestones established in NLP, such as large transformer models, extensive pre-training, and the auto-regression paradigm, among others.In this paper, we explore the idea that CV adopts discrete and terminological task definitions (\eg, ``image segmentation''), which may be a key barrier to zero-shot task generalization.Our hypothesis is that without truly understanding previously-seen tasks--due to these terminological definitions--deep models struggle to generalize to novel tasks.To verify this, we introduce Explanatory Instructions, which provide an intuitive way to define CV task objectives through detailed linguistic transformations from input images to outputs.We create a large-scale dataset comprising 12 million ``image input $\to$ explanatory instruction $\to$ output'' triplets, and train an auto-regressive-based vision-language model (AR-based VLM) that takes both images and explanatory instructions as input. 0.714By learning to follow these instructions, the AR-based VLM achieves instruction-level zero-shot capabilities for previously-seen tasks and demonstrates strong zero-shot generalization for unseen CV tasks.Code and dataset will be openly available on our GitHub repository. 0.891

+

+

+ + link + +

+
+
+ + + +

2024-12-24

+ + +
+ + ZeroHSI: Zero-Shot 4D Human-Scene Interaction by Video Generation + +
+
+

+

Human-scene interaction (HSI) generation is crucial for applications in embodied AI, virtual reality, and robotics.While existing methods can synthesize realistic human motions in 3D scenes and generate plausible human-object interactions, they heavily rely on datasets containing paired 3D scene and motion capture data, which are expensive and time-consuming to collect across diverse environments and interactions.We present ZeroHSI, a novel approach that enables zero-shot 4D human-scene interaction synthesis by integrating video generation and neural human rendering.Our key insight is to leverage the rich motion priors learned by state-of-the-art video generation models, which have been trained on vast amounts of natural human movements and interactions, and use differentiable rendering to reconstruct human-scene interactions.ZeroHSI can synthesize realistic human motions in both static scenes and environments with dynamic objects, without requiring any ground-truth motion data.We evaluate ZeroHSI on a curated dataset of different types of various indoor and outdoor scenes with different interaction prompts, demonstrating its ability to generate diverse and contextually appropriate human-scene interactions. 0.765

+

+

+ + link + +

+
+
+ + + +

2024-12-24

+ + +
+ + Long-Form Speech Generation with Spoken Language Models + +
+
+

+

We consider the generative modeling of speech over multiple minutes, a requirement for long-form multimedia generation and audio-native voice assistants.However, current spoken language models struggle to generate plausible speech past tens of seconds, from high temporal resolution of speech tokens causing loss of coherence, to architectural issues with long-sequence training or extrapolation, to memory costs at inference time.With these considerations we propose SpeechSSM, the first speech language model to learn from and sample long-form spoken audio (e.g., 16 minutes of read or extemporaneous speech) in a single decoding session without text intermediates, based on recent advances in linear-time sequence modeling.Furthermore, to address growing challenges in spoken language evaluation, especially in this new long-form setting, we propose: new embedding-based and LLM-judged metrics; quality measurements over length and time; and a new benchmark for long-form speech processing and generation, LibriSpeech-Long.Speech samples and the dataset are released at https://google.github.io/tacotron/publications/speechssm/ 0.737

+

+

+ + link + +

+
+
+ +

2024-12-23

@@ -556,28 +666,6 @@

New Datasets

- - -

2024-12-17

- - -
- - Enabling Low-Resource Language Retrieval: Establishing Baselines for Urdu MS MARCO - -
-
-

-

As the Information Retrieval (IR) field increasingly recognizes the importance of inclusivity, addressing the needs of low-resource languages remains a significant challenge.This paper introduces the first large-scale Urdu IR dataset, created by translating the MS MARCO dataset through machine translation. 0.801We establish baseline results through zero-shot learning for IR in Urdu and subsequently apply the mMARCO multilingual IR methodology to this newly translated dataset.Our findings demonstrate that the fine-tuned model (Urdu-mT5-mMARCO) achieves a Mean Reciprocal Rank (MRR@10) of 0.247 and a Recall@10 of 0.439, representing significant improvements over zero-shot results and showing the potential for expanding IR access for Urdu speakers.By bridging access gaps for speakers of low-resource languages, this work not only advances multilingual IR research but also emphasizes the ethical and societal importance of inclusive IR technologies.This work provides valuable insights into the challenges and solutions for improving language representation and lays the groundwork for future research, especially in South Asian languages, which can benefit from the adaptable methods used in this study.

-

-

- - link - -

-
-
-

2024-12-17

@@ -799,21 +887,26 @@

New Datasets

+ + +

Data Quality

+ + -

2024-12-16

+

2024-12-24

- Speak & Improve Challenge 2025: Tasks and Baseline Systems + A region-wide, multi-year set of crop field boundary labels for Africa

-

This paper presents the "Speak & Improve Challenge 2025: Spoken Language Assessment and Feedback" -- a challenge associated with the ISCA SLaTE 2025 Workshop.The goal of the challenge is to advance research on spoken language assessment and feedback, with tasks associated with both the underlying technology and language learning feedback.Linked with the challenge, the Speak & Improve (S&I) Corpus 2025 is being pre-released, a dataset of L2 learner English data with holistic scores and language error annotation, collected from open (spontaneous) speaking tests on the Speak & Improve learning platform. 0.839The corpus consists of 340 hours of audio data from second language English learners with holistic scores, and a 60-hour subset with manual transcriptions and error labels. 0.718The Challenge has four shared tasks: Automatic Speech Recognition (ASR), Spoken Language Assessment (SLA), Spoken Grammatical Error Correction (SGEC), and Spoken Grammatical Error Correction Feedback (SGECF).Each of these tasks has a closed track where a predetermined set of models and data sources are allowed to be used, and an open track where any public resource may be used.Challenge participants may do one or more of the tasks.This paper describes the challenge, the S&I Corpus 2025, and the baseline systems released for the Challenge. 0.738

+

African agriculture is undergoing rapid transformation.Annual maps of crop fields are key to understanding the nature of this transformation, but such maps are currently lacking and must be developed using advanced machine learning models trained on high resolution remote sensing imagery.To enable the development of such models, we delineated field boundaries in 33,746 Planet images captured between 2017 and 2023 across the continent using a custom labeling platform with built-in procedures for assessing and mitigating label error.We collected 42,403 labels, including 7,204 labels arising from tasks dedicated to assessing label quality (Class 1 labels), 32,167 from sites mapped once by a single labeller (Class 2) and 3,032 labels from sites where 3 or more labellers were tasked to map the same location (Class 4). 0.668Class 1 labels were used to calculate labeller-specific quality scores, while Class 1 and 4 sites mapped by at least 3 labellers were used to further evaluate label uncertainty using a Bayesian risk metric. 0.672Quality metrics showed that label quality was moderately high (0.75) for measures of total field extent, but low regarding the number of individual fields delineated (0.33), and the position of field edges (0.05).These values are expected when delineating small-scale fields in 3-5 m resolution imagery, which can be too coarse to reliably distinguish smaller fields, particularly in dense croplands, and therefore requires substantial labeller judgement.Nevertheless, previous work shows that such labels can train effective field mapping models.Furthermore, this large, probabilistic sample on its own provides valuable insight into regional agricultural characteristics, highlighting variations in the median field size and density.The imagery and vectorized labels along with quality information is available for download from two public repositories.

- + link

@@ -822,20 +915,20 @@

New Datasets

-

2024-12-16

+

2024-12-17

- Speak & Improve Corpus 2025: an L2 English Speech Corpus for Language Assessment and Feedback + Label Errors in the Tobacco3482 Dataset

-

We introduce the Speak \& Improve Corpus 2025, a dataset of L2 learner English data with holistic scores and language error annotation, collected from open (spontaneous) speaking tests on the Speak \& Improve learning platform https://speakandimprove.com . 0.85The aim of the corpus release is to address a major challenge to developing L2 spoken language processing systems, the lack of publicly available data with high-quality annotations.It is being made available for non-commercial use on the ELiT website.In designing this corpus we have sought to make it cover a wide-range of speaker attributes, from their L1 to their speaking ability, as well as providing manual annotations.This enables a range of language-learning tasks to be examined, such as assessing speaking proficiency or providing feedback on grammatical errors in a learner's speech.Additionally, the data supports research into the underlying technology required for these tasks including automatic speech recognition (ASR) of low resource L2 learner English, disfluency detection or spoken grammatical error correction (GEC).The corpus consists of around 340 hours of L2 English learners audio with holistic scores, and a subset of audio annotated with transcriptions and error labels.

+

Tobacco3482 is a widely used document classification benchmark dataset.However, our manual inspection of the entire dataset uncovers widespread ontological issues, especially large amounts of annotation label problems in the dataset. 0.665We establish data label guidelines and find that 11.7% of the dataset is improperly annotated and should either have an unknown label or a corrected label, and 16.7% of samples in the dataset have multiple valid labels. 0.812We then analyze the mistakes of a top-performing model and find that 35% of the model's mistakes can be directly attributed to these label issues, highlighting the inherent problems with using a noisily labeled dataset as a benchmark. 0.747Supplementary material, including dataset annotations and code, is available at https://github.com/gordon-lim/tobacco3482-mistakes/.

- + link

@@ -849,15 +942,15 @@

New Datasets

- Semi-automated analysis of audio-recorded lessons: The case of teachers' engaging messages + RepFace: Refining Closed-Set Noise with Progressive Label Correction for Face Recognition

-

Engaging messages delivered by teachers are a key aspect of the classroom discourse that influences student outcomes.However, improving this communication is challenging due to difficulties in obtaining observations.This study presents a methodology for efficiently extracting actual observations of engaging messages from audio-recorded lessons.We collected 2,477 audio-recorded lessons from 75 teachers over two academic years. 0.893Using automatic transcription and keyword-based filtering analysis, we identified and classified engaging messages.This method reduced the information to be analysed by 90%, optimising the time and resources required compared to traditional manual coding.Subsequent descriptive analysis revealed that the most used messages emphasised the future benefits of participating in school activities.In addition, the use of engaging messages decreased as the academic year progressed.This study offers insights for researchers seeking to extract information from teachers' discourse in naturalistic settings and provides useful information for designing interventions to improve teachers' communication strategies.

+

Face recognition has made remarkable strides, driven by the expanding scale of datasets, advancements in various backbone and discriminative losses.However, face recognition performance is heavily affected by the label noise, especially closed-set noise.While numerous studies have focused on handling label noise, addressing closed-set noise still poses challenges. 0.769This paper identifies this challenge as training isn't robust to noise at the early-stage training, and necessitating an appropriate learning strategy for samples with low confidence, which are often misclassified as closed-set noise in later training phases.To address these issues, we propose a new framework to stabilize the training at early stages and split the samples into clean, ambiguous and noisy groups which are devised with separate training strategies.Initially, we employ generated auxiliary closed-set noisy samples to enable the model to identify noisy data at the early stages of training.Subsequently, we introduce how samples are split into clean, ambiguous and noisy groups by their similarity to the positive and nearest negative centers.Then we perform label fusion for ambiguous samples by incorporating accumulated model predictions. 0.602Finally, we apply label smoothing within the closed set, adjusting the label to a point between the nearest negative class and the initially assigned label. 0.624Extensive experiments validate the effectiveness of our method on mainstream face datasets, achieving state-of-the-art results.The code will be released upon acceptance.

- + link

@@ -866,20 +959,20 @@

New Datasets

-

2024-12-16

+

2024-12-11

- CG-Bench: Clue-grounded Question Answering Benchmark for Long Video Understanding + CAT: Class Aware Adaptive Thresholding for Semi-Supervised Domain Generalization

-

Most existing video understanding benchmarks for multimodal large language models (MLLMs) focus only on short videos.The limited number of benchmarks for long video understanding often rely solely on multiple-choice questions (MCQs).However, because of the inherent limitation of MCQ-based evaluation and the increasing reasoning ability of MLLMs, models can give the current answer purely by combining short video understanding with elimination, without genuinely understanding the video content.To address this gap, we introduce CG-Bench, a novel benchmark designed for clue-grounded question answering in long videos.CG-Bench emphasizes the model's ability to retrieve relevant clues for questions, enhancing evaluation credibility.It features 1,219 manually curated videos categorized by a granular system with 14 primary categories, 171 secondary categories, and 638 tertiary categories, making it the largest benchmark for long video analysis.The benchmark includes 12,129 QA pairs in three major question types: perception, reasoning, and hallucination.Compensating the drawbacks of pure MCQ-based evaluation, we design two novel clue-based evaluation methods: clue-grounded white box and black box evaluations, to assess whether the model generates answers based on the correct understanding of the video.We evaluate multiple closed-source and open-source MLLMs on CG-Bench.Results indicate that current models significantly underperform in understanding long videos compared to short ones, and a significant gap exists between open-source and commercial models.We hope CG-Bench can advance the development of more trustworthy and capable MLLMs for long video understanding.All annotations and video data are released at https://cg-bench.github.io/leaderboard/. 0.772

+

Domain Generalization (DG) seeks to transfer knowledge from multiple source domains to unseen target domains, even in the presence of domain shifts.Achieving effective generalization typically requires a large and diverse set of labeled source data to learn robust representations that can generalize to new, unseen domains.However, obtaining such high-quality labeled data is often costly and labor-intensive, limiting the practical applicability of DG.To address this, we investigate a more practical and challenging problem: semi-supervised domain generalization (SSDG) under a label-efficient paradigm.In this paper, we propose a novel method, CAT, which leverages semi-supervised learning with limited labeled data to achieve competitive generalization performance under domain shifts.Our method addresses key limitations of previous approaches, such as reliance on fixed thresholds and sensitivity to noisy pseudo-labels. 0.602CAT combines adaptive thresholding with noisy label refinement techniques, creating a straightforward yet highly effective solution for SSDG tasks.Specifically, our approach uses flexible thresholding to generate high-quality pseudo-labels with higher class diversity while refining noisy pseudo-labels to improve their reliability. 0.679Extensive experiments across multiple benchmark datasets demonstrate the superior performance of our method, highlighting its effectiveness in achieving robust generalization under domain shift.

- + link

@@ -888,20 +981,20 @@

New Datasets

-

2024-12-16

+

2024-12-10

- Instruction-based Image Manipulation by Watching How Things Move + Defending Against Neural Network Model Inversion Attacks via Data Poisoning

-

This paper introduces a novel dataset construction pipeline that samples pairs of frames from videos and uses multimodal large language models (MLLMs) to generate editing instructions for training instruction-based image manipulation models. 0.707Video frames inherently preserve the identity of subjects and scenes, ensuring consistent content preservation during editing.Additionally, video data captures diverse, natural dynamics-such as non-rigid subject motion and complex camera movements-that are difficult to model otherwise, making it an ideal source for scalable dataset construction.Using this approach, we create a new dataset to train InstructMove, a model capable of instruction-based complex manipulations that are difficult to achieve with synthetically generated datasets. 0.707Our model demonstrates state-of-the-art performance in tasks such as adjusting subject poses, rearranging elements, and altering camera perspectives.

+

Model inversion attacks pose a significant privacy threat to machine learning models by reconstructing sensitive data from their outputs.While various defenses have been proposed to counteract these attacks, they often come at the cost of the classifier's utility, thus creating a challenging trade-off between privacy protection and model utility.Moreover, most existing defenses require retraining the classifier for enhanced robustness, which is impractical for large-scale, well-established models.This paper introduces a novel defense mechanism to better balance privacy and utility, particularly against adversaries who employ a machine learning model (i.e., inversion model) to reconstruct private data.Drawing inspiration from data poisoning attacks, which can compromise the performance of machine learning models, we propose a strategy that leverages data poisoning to contaminate the training data of inversion models, thereby preventing model inversion attacks. Two defense methods are presented.The first, termed label-preserving poisoning attacks for all output vectors (LPA), involves subtle perturbations to all output vectors while preserving their labels.Our findings demonstrate that these minor perturbations, introduced through a data poisoning approach, significantly increase the difficulty of data reconstruction without compromising the utility of the classifier. 0.602Subsequently, we introduce a second method, label-flipping poisoning for partial output vectors (LFP), which selectively perturbs a small subset of output vectors and alters their labels during the process.Empirical results indicate that LPA is notably effective, outperforming the current state-of-the-art defenses.Our data poisoning-based defense provides a new retraining-free defense paradigm that preserves the victim classifier's utility.

- + link

@@ -911,24 +1004,24 @@

New Datasets

-

Data Quality

+

Benchmarks

-

2024-12-17

+

2024-12-24

- Label Errors in the Tobacco3482 Dataset + Re-assessing ImageNet: How aligned is its single-label assumption with its multi-label nature?

-

Tobacco3482 is a widely used document classification benchmark dataset.However, our manual inspection of the entire dataset uncovers widespread ontological issues, especially large amounts of annotation label problems in the dataset. 0.665We establish data label guidelines and find that 11.7% of the dataset is improperly annotated and should either have an unknown label or a corrected label, and 16.7% of samples in the dataset have multiple valid labels. 0.812We then analyze the mistakes of a top-performing model and find that 35% of the model's mistakes can be directly attributed to these label issues, highlighting the inherent problems with using a noisily labeled dataset as a benchmark. 0.747Supplementary material, including dataset annotations and code, is available at https://github.com/gordon-lim/tobacco3482-mistakes/.

+

ImageNet, an influential dataset in computer vision, is traditionally evaluated using single-label classification, which assumes that an image can be adequately described by a single concept or label.However, this approach may not fully capture the complex semantics within the images available in ImageNet, potentially hindering the development of models that effectively learn these intricacies.This study critically examines the prevalent single-label benchmarking approach and advocates for a shift to multi-label benchmarking for ImageNet. 0.608This shift would enable a more comprehensive assessment of the capabilities of deep neural network (DNN) models.We analyze the effectiveness of pre-trained state-of-the-art DNNs on ImageNet and one of its variants, ImageNetV2.Studies in the literature have reported unexpected accuracy drops of 11% to 14% on ImageNetV2.Our findings show that these reported declines are largely attributable to a characteristic of the dataset that has not received sufficient attention -- the proportion of images with multiple labels.Taking this characteristic into account, the results of our experiments provide evidence that there is no substantial degradation in effectiveness on ImageNetV2.Furthermore, we acknowledge that ImageNet pre-trained models exhibit some capability at capturing the multi-label nature of the dataset even though they were trained under the single-label assumption.Consequently, we propose a new evaluation approach to augment existing approaches that assess this capability.Our findings highlight the importance of considering the multi-label nature of the ImageNet dataset during benchmarking.Failing to do so could lead to incorrect conclusions regarding the effectiveness of DNNs and divert research efforts from addressing other substantial challenges related to the reliability and robustness of these models.

- + link

@@ -937,20 +1030,20 @@

Data Quality

-

2024-12-16

+

2024-12-24

- RepFace: Refining Closed-Set Noise with Progressive Label Correction for Face Recognition + Multilingual Mathematical Reasoning: Advancing Open-Source LLMs in Hindi and English

-

Face recognition has made remarkable strides, driven by the expanding scale of datasets, advancements in various backbone and discriminative losses.However, face recognition performance is heavily affected by the label noise, especially closed-set noise.While numerous studies have focused on handling label noise, addressing closed-set noise still poses challenges. 0.769This paper identifies this challenge as training isn't robust to noise at the early-stage training, and necessitating an appropriate learning strategy for samples with low confidence, which are often misclassified as closed-set noise in later training phases.To address these issues, we propose a new framework to stabilize the training at early stages and split the samples into clean, ambiguous and noisy groups which are devised with separate training strategies.Initially, we employ generated auxiliary closed-set noisy samples to enable the model to identify noisy data at the early stages of training.Subsequently, we introduce how samples are split into clean, ambiguous and noisy groups by their similarity to the positive and nearest negative centers.Then we perform label fusion for ambiguous samples by incorporating accumulated model predictions. 0.602Finally, we apply label smoothing within the closed set, adjusting the label to a point between the nearest negative class and the initially assigned label. 0.624Extensive experiments validate the effectiveness of our method on mainstream face datasets, achieving state-of-the-art results.The code will be released upon acceptance.

+

Large Language Models (LLMs) excel in linguistic tasks but struggle with mathematical reasoning, particularly in non English languages like Hindi.This research aims to enhance the mathematical reasoning skills of smaller, resource efficient open-source LLMs in both Hindi and English.We evaluate models like OpenHathi 7B, LLaMA-2 7B, WizardMath 7B, Mistral 7B, LLeMMa 7B, MAmmoTH 7B, Gemini Pro, and GPT-4 using zero-shot, few-shot chain-of-thought (CoT) methods, and supervised fine-tuning.Our approach incorporates curriculum learning, progressively training models on increasingly difficult problems, a novel Decomposition Strategy to simplify complex arithmetic operations, and a Structured Solution Design that divides solutions into phases.Our experiments result in notable performance enhancements. 0.67WizardMath 7B exceeds Gemini's accuracy on English datasets by +6% and matches Gemini's performance on Hindi datasets.Adopting a bilingual approach that combines English and Hindi samples achieves results comparable to individual language models, demonstrating the capability to learn mathematical reasoning in both languages.This research highlights the potential for improving mathematical reasoning in open-source LLMs.

- + link

@@ -959,20 +1052,20 @@

Data Quality

-

2024-12-11

+

2024-12-24

- CAT: Class Aware Adaptive Thresholding for Semi-Supervised Domain Generalization + Unlocking the Potential of Multiple BERT Models for Bangla Question Answering in NCTB Textbooks

-

Domain Generalization (DG) seeks to transfer knowledge from multiple source domains to unseen target domains, even in the presence of domain shifts.Achieving effective generalization typically requires a large and diverse set of labeled source data to learn robust representations that can generalize to new, unseen domains.However, obtaining such high-quality labeled data is often costly and labor-intensive, limiting the practical applicability of DG.To address this, we investigate a more practical and challenging problem: semi-supervised domain generalization (SSDG) under a label-efficient paradigm.In this paper, we propose a novel method, CAT, which leverages semi-supervised learning with limited labeled data to achieve competitive generalization performance under domain shifts.Our method addresses key limitations of previous approaches, such as reliance on fixed thresholds and sensitivity to noisy pseudo-labels. 0.602CAT combines adaptive thresholding with noisy label refinement techniques, creating a straightforward yet highly effective solution for SSDG tasks.Specifically, our approach uses flexible thresholding to generate high-quality pseudo-labels with higher class diversity while refining noisy pseudo-labels to improve their reliability. 0.679Extensive experiments across multiple benchmark datasets demonstrate the superior performance of our method, highlighting its effectiveness in achieving robust generalization under domain shift.

+

Evaluating text comprehension in educational settings is critical for understanding student performance and improving curricular effectiveness.This study investigates the capability of state-of-the-art language models-RoBERTa Base, Bangla-BERT, and BERT Base-in automatically assessing Bangla passage-based question-answering from the National Curriculum and Textbook Board (NCTB) textbooks for classes 6-10.A dataset of approximately 3,000 Bangla passage-based question-answering instances was compiled, and the models were evaluated using F1 Score and Exact Match (EM) metrics across various hyperparameter configurations.Our findings revealed that Bangla-BERT consistently outperformed the other models, achieving the highest F1 (0.75) and EM (0.53) scores, particularly with smaller batch sizes, the inclusion of stop words, and a moderate learning rate.In contrast, RoBERTa Base demonstrated the weakest performance, with the lowest F1 (0.19) and EM (0.27) scores under certain configurations. 0.65The results underscore the importance of fine-tuning hyperparameters for optimizing model performance and highlight the potential of machine learning models in evaluating text comprehension in educational contexts.However, limitations such as dataset size, spelling inconsistencies, and computational constraints emphasize the need for further research to enhance the robustness and applicability of these models.This study lays the groundwork for the future development of automated evaluation systems in educational institutions, providing critical insights into model performance in the context of Bangla text comprehension.

- + link

@@ -981,20 +1074,20 @@

Data Quality

-

2024-12-10

+

2024-12-24

- Defending Against Neural Network Model Inversion Attacks via Data Poisoning + Is Large Language Model Good at Triple Set Prediction? An Empirical Study

-

Model inversion attacks pose a significant privacy threat to machine learning models by reconstructing sensitive data from their outputs.While various defenses have been proposed to counteract these attacks, they often come at the cost of the classifier's utility, thus creating a challenging trade-off between privacy protection and model utility.Moreover, most existing defenses require retraining the classifier for enhanced robustness, which is impractical for large-scale, well-established models.This paper introduces a novel defense mechanism to better balance privacy and utility, particularly against adversaries who employ a machine learning model (i.e., inversion model) to reconstruct private data.Drawing inspiration from data poisoning attacks, which can compromise the performance of machine learning models, we propose a strategy that leverages data poisoning to contaminate the training data of inversion models, thereby preventing model inversion attacks. Two defense methods are presented.The first, termed label-preserving poisoning attacks for all output vectors (LPA), involves subtle perturbations to all output vectors while preserving their labels.Our findings demonstrate that these minor perturbations, introduced through a data poisoning approach, significantly increase the difficulty of data reconstruction without compromising the utility of the classifier. 0.602Subsequently, we introduce a second method, label-flipping poisoning for partial output vectors (LFP), which selectively perturbs a small subset of output vectors and alters their labels during the process.Empirical results indicate that LPA is notably effective, outperforming the current state-of-the-art defenses.Our data poisoning-based defense provides a new retraining-free defense paradigm that preserves the victim classifier's utility.

+

The core of the Knowledge Graph Completion (KGC) task is to predict and complete the missing relations or nodes in a KG.Common KGC tasks are mostly about inferring unknown elements with one or two elements being known in a triple.In comparison, the Triple Set Prediction (TSP) task is a more realistic knowledge graph completion task.It aims to predict all elements of unknown triples based on the information from known triples.In recent years, large language models (LLMs) have exhibited significant advancements in language comprehension, demonstrating considerable potential for KGC tasks.However, the potential of LLM on the TSP task has not yet to be investigated.Thus in this paper we proposed a new framework to explore the strengths and limitations of LLM in the TSP task.Specifically, the framework consists of LLM-based rule mining and LLM-based triple set prediction.The relation list of KG embedded within rich semantic information is first leveraged to prompt LLM in the generation of rules.This process is both efficient and independent of statistical information, making it easier to mine effective and realistic rules.For each subgraph, the specified rule is applied in conjunction with the relevant triples within that subgraph to guide the LLM in predicting the missing triples.Subsequently, the predictions from all subgraphs are consolidated to derive the complete set of predicted triples on KG.Finally, the method is evaluated on the relatively complete CFamily dataset. 0.618The experimental results indicate that when LLMs are required to adhere to a large amount of factual knowledge to predict missing triples, significant hallucinations occurs, leading to a noticeable decline in performance.To further explore the causes of this phenomenon, this paper presents a comprehensive analysis supported by a detailed case study.

- + link

@@ -1003,20 +1096,20 @@

Data Quality

-

2024-12-03

+

2024-12-24

- Class-wise Autoencoders Measure Classification Difficulty And Detect Label Mistakes + Bayesian Optimization of Bilevel Problems

-

We introduce a new framework for analyzing classification datasets based on the ratios of reconstruction errors between autoencoders trained on individual classes.This analysis framework enables efficient characterization of datasets on the sample, class, and entire dataset levels.We define reconstruction error ratios (RERs) that probe classification difficulty and allow its decomposition into (1) finite sample size and (2) Bayes error and decision-boundary complexity.Through systematic study across 19 popular visual datasets, we find that our RER-based dataset difficulty probe strongly correlates with error rate for state-of-the-art (SOTA) classification models.By interpreting sample-level classification difficulty as a label mistakenness score, we further find that RERs achieve SOTA performance on mislabel detection tasks on hard datasets under symmetric and asymmetric label noise. 0.664Our code is publicly available at https://github.com/voxel51/reconstruction-error-ratios.

+

Bilevel optimization, a hierarchical mathematical framework where one optimization problem is nested within another, has emerged as a powerful tool for modeling complex decision-making processes in various fields such as economics, engineering, and machine learning.This paper focuses on bilevel optimization where both upper-level and lower-level functions are black boxes and expensive to evaluate.We propose a Bayesian Optimization framework that models the upper and lower-level functions as Gaussian processes over the combined space of upper and lower-level decisions, allowing us to exploit knowledge transfer between different sub-problems.Additionally, we propose a novel acquisition function for this model.Our experimental results demonstrate that the proposed algorithm is highly sample-efficient and outperforms existing methods in finding high-quality solutions. 0.667

- + link

@@ -1024,26 +1117,21 @@

Data Quality

- - -

Benchmarks

- - -

2024-12-23

+

2024-12-24

- Can Stability be Detrimental? Better Generalization through Gradient Descent Instabilities + SHARQ: Explainability Framework for Association Rules on Relational Data

-

Traditional analyses of gradient descent optimization show that, when the largest eigenvalue of the loss Hessian - often referred to as the sharpness - is below a critical learning-rate threshold, then training is 'stable' and training loss decreases monotonically.Recent studies, however, have suggested that the majority of modern deep neural networks achieve good performance despite operating outside this stable regime.In this work, we demonstrate that such instabilities, induced by large learning rates, move model parameters toward flatter regions of the loss landscape.Our crucial insight lies in noting that, during these instabilities, the orientation of the Hessian eigenvectors rotate.This, we conjecture, allows the model to explore regions of the loss landscape that display more desirable geometrical properties for generalization, such as flatness.These rotations are a consequence of network depth, and we prove that for any network with depth > 1, unstable growth in parameters cause rotations in the principal components of the Hessian, which promote exploration of the parameter space away from unstable directions.Our empirical studies reveal an implicit regularization effect in gradient descent with large learning rates operating beyond the stability threshold.We find these lead to excellent generalization performance on modern benchmark datasets. 0.669

+

Association rules are an important technique for gaining insights over large relational datasets consisting of tuples of elements (i.e. attribute-value pairs).However, it is difficult to explain the relative importance of data elements with respect to the rules in which they appear.This paper develops a measure of an element's contribution to a set of association rules based on Shapley values, denoted SHARQ (ShApley Rules Quantification).As is the case with many Shapely-based computations, the cost of a naive calculation of the score is exponential in the number of elements.To that end, we present an efficient framework for computing the exact SharQ value of a single element whose running time is practically linear in the number of rules.Going one step further, we develop an efficient multi-element SHARQ algorithm which amortizes the cost of the single element SHARQ calculation over a set of elements.Based on the definition of SHARQ for elements we describe two additional use cases for association rules explainability: rule importance and attribute importance.Extensive experiments over a novel benchmark dataset containing 45 instances of mined rule sets show the effectiveness of our approach. 0.601

- + link

@@ -1052,20 +1140,20 @@

Benchmarks

-

2024-12-23

+

2024-12-24

- Graph Neural Networks Are Evolutionary Algorithms + Libra-Leaderboard: Towards Responsible AI through a Balanced Leaderboard of Safety and Capability

-

In this paper, we reveal the intrinsic duality between graph neural networks (GNNs) and evolutionary algorithms (EAs), bridging two traditionally distinct fields.Building on this insight, we propose Graph Neural Evolution (GNE), a novel evolutionary algorithm that models individuals as nodes in a graph and leverages designed frequency-domain filters to balance global exploration and local exploitation.Through the use of these filters, GNE aggregates high-frequency (diversity-enhancing) and low-frequency (stability-promoting) information, transforming EAs into interpretable and tunable mechanisms in the frequency domain.Extensive experiments on benchmark functions demonstrate that GNE consistently outperforms state-of-the-art algorithms such as GA, DE, CMA-ES, SDAES, and RL-SHADE, excelling in complex landscapes, optimal solution shifts, and noisy environments.Its robustness, adaptability, and superior convergence highlight its practical and theoretical value. 0.639Beyond optimization, GNE establishes a conceptual and mathematical foundation linking EAs and GNNs, offering new perspectives for both fields.Its framework encourages the development of task-adaptive filters and hybrid approaches for EAs, while its insights can inspire advances in GNNs, such as improved global information propagation and mitigation of oversmoothing.GNE's versatility extends to solving challenges in machine learning, including hyperparameter tuning and neural architecture search, as well as real-world applications in engineering and operations research.By uniting the dynamics of EAs with the structural insights of GNNs, this work provides a foundation for interdisciplinary innovation, paving the way for scalable and interpretable solutions to complex optimization problems.

+

To address this gap, we introduce Libra-Leaderboard, a comprehensive framework designed to rank LLMs through a balanced evaluation of performance and safety.Combining a dynamic leaderboard with an interactive LLM arena, Libra-Leaderboard encourages the joint optimization of capability and safety.Unlike traditional approaches that average performance and safety metrics, Libra-Leaderboard uses a distance-to-optimal-score method to calculate the overall rankings. 0.617This approach incentivizes models to achieve a balance rather than excelling in one dimension at the expense of some other ones.In the first release, Libra-Leaderboard evaluates 26 mainstream LLMs from 14 leading organizations, identifying critical safety challenges even in state-of-the-art models.

- + link

@@ -1074,20 +1162,20 @@

Benchmarks

-

2024-12-23

+

2024-12-24

- SMAC-Hard: Enabling Mixed Opponent Strategy Script and Self-play on SMAC + Distilling Fine-grained Sentiment Understanding from Large Language Models

-

The availability of challenging simulation environments is pivotal for advancing the field of Multi-Agent Reinforcement Learning (MARL).In cooperative MARL settings, the StarCraft Multi-Agent Challenge (SMAC) has gained prominence as a benchmark for algorithms following centralized training with decentralized execution paradigm.However, with continual advancements in SMAC, many algorithms now exhibit near-optimal performance, complicating the evaluation of their true effectiveness. 0.658To alleviate this problem, in this work, we highlight a critical issue: the default opponent policy in these environments lacks sufficient diversity, leading MARL algorithms to overfit and exploit unintended vulnerabilities rather than learning robust strategies.To overcome these limitations, we propose SMAC-HARD, a novel benchmark designed to enhance training robustness and evaluation comprehensiveness.SMAC-HARD supports customizable opponent strategies, randomization of adversarial policies, and interfaces for MARL self-play, enabling agents to generalize to varying opponent behaviors and improve model stability.Furthermore, we introduce a black-box testing framework wherein agents are trained without exposure to the edited opponent scripts but are tested against these scripts to evaluate the policy coverage and adaptability of MARL algorithms.We conduct extensive evaluations of widely used and state-of-the-art algorithms on SMAC-HARD, revealing the substantial challenges posed by edited and mixed strategy opponents.Additionally, the black-box strategy tests illustrate the difficulty of transferring learned policies to unseen adversaries.We envision SMAC-HARD as a critical step toward benchmarking the next generation of MARL algorithms, fostering progress in self-play methods for multi-agent systems.Our code is available at https://github.com/devindeng94/smac-hard.

+

Fine-grained sentiment analysis (FSA) aims to extract and summarize user opinions from vast opinionated text.Recent studies demonstrate that large language models (LLMs) possess exceptional sentiment understanding capabilities.However, directly deploying LLMs for FSA applications incurs high inference costs.Therefore, this paper investigates the distillation of fine-grained sentiment understanding from LLMs into small language models (SLMs).We prompt LLMs to examine and interpret the sentiments of given reviews and then utilize the generated content to pretrain SLMs.Additionally, we develop a comprehensive FSA benchmark to evaluate both SLMs and LLMs. 0.644Extensive experiments on this benchmark reveal that: (1) distillation significantly enhances the performance of SLMs in FSA tasks, achieving a 6.00\% improvement in $F_1$-score, and the distilled model can outperform Llama-2-7b with only 220M parameters; (2) distillation equips SLMs with excellent zero-shot sentiment classification capabilities, enabling them to match or even exceed their teacher models.These results suggest that distillation from LLMs is a highly promising direction for FSA.We will release our code, data, and pretrained model weights at \url{https://github.com/HITSZ-HLT/FSA-Distillation}.

- + link

@@ -1096,20 +1184,20 @@

Benchmarks

-

2024-12-23

+

2024-12-24

- Knowledge Editing through Chain-of-Thought + Efficient Aircraft Design Optimization Using Multi-Fidelity Models and Multi-fidelity Physics Informed Neural Networks

-

Large Language Models (LLMs) have demonstrated exceptional capabilities across a wide range of natural language processing (NLP) tasks.However, keeping these models up-to-date with evolving world knowledge remains a significant challenge due to the high costs of frequent retraining.To address this challenge, knowledge editing techniques have emerged to update LLMs with new information without rebuilding the model from scratch.Among these, the in-context editing paradigm stands out for its effectiveness in integrating new knowledge while preserving the model's original capabilities.Despite its potential, existing in-context knowledge editing methods are often task-specific, focusing primarily on multi-hop QA tasks using structured knowledge triples.Moreover, their reliance on few-shot prompting for task decomposition makes them unstable and less effective in generalizing across diverse tasks. In response to these limitations, we propose EditCoT, a novel knowledge editing framework that flexibly and efficiently updates LLMs across various tasks without retraining.EditCoT works by generating a chain-of-thought (CoT) for a given input and then iteratively refining this CoT process using a CoT editor based on updated knowledge.We evaluate EditCoT across a diverse range of benchmarks, covering multiple languages and tasks.The results demonstrate that our approach achieves state-of-the-art performance while offering superior generalization, effectiveness, and stability compared to existing methods, marking a significant advancement in the field of knowledge updating. 0.636Code and data are available at: https://github.com/bebr2/EditCoT.

+

Aircraft design optimization traditionally relies on computationally expensive simulation techniques such as Finite Element Method (FEM) and Finite Volume Method (FVM), which, while accurate, can significantly slow down the design iteration process.The challenge lies in reducing the computational complexity while maintaining high accuracy for quick evaluations of multiple design alternatives. 0.61This research explores advanced methods, including surrogate models, reduced-order models (ROM), and multi-fidelity machine learning techniques, to achieve more efficient aircraft design evaluations.Specifically, the study investigates the application of Multi-fidelity Physics-Informed Neural Networks (MPINN) and autoencoders for manifold alignment, alongside the potential of Generative Adversarial Networks (GANs) for refining design geometries.Through a proof-of-concept task, the research demonstrates the ability to predict high-fidelity results from low-fidelity simulations, offering a path toward faster and more cost effective aircraft design iterations.

- + link

@@ -1118,20 +1206,20 @@

Benchmarks

-

2024-12-23

+

2024-12-24

- RepoTransBench: A Real-World Benchmark for Repository-Level Code Translation + How Well Do LLMs Generate Code for Different Application Domains? Benchmark and Evaluation

-

Repository-level code translation refers to translating an entire code repository from one programming language to another while preserving the functionality of the source repository.Many benchmarks have been proposed to evaluate the performance of such code translators. 0.612However, previous benchmarks mostly provide fine-grained samples, focusing at either code snippet, function, or file-level code translation. 0.685Such benchmarks do not accurately reflect real-world demands, where entire repositories often need to be translated, involving longer code length and more complex functionalities.To address this gap, we propose a new benchmark, named RepoTransBench, which is a real-world repository-level code translation benchmark with an automatically executable test suite.We conduct experiments on RepoTransBench to evaluate the translation performance of 11 advanced LLMs.We find that the Success@1 score (test success in one attempt) of the best-performing LLM is only 7.33%.To further explore the potential of LLMs for repository-level code translation, we provide LLMs with error-related feedback to perform iterative debugging and observe an average 7.09% improvement on Success@1.However, even with this improvement, the Success@1 score of the best-performing LLM is only 21%, which may not meet the need for reliable automatic repository-level code translation.Finally, we conduct a detailed error analysis and highlight current LLMs' deficiencies in repository-level code translation, which could provide a reference for further improvements.

+

Recently, an increasing number of AI-driven programming assistants powered by code LLMs have been integrated into various real-world software development environments, significantly boosting developer productivity.However, existing code generation benchmarks primarily focus on general-purpose scenarios, leaving the code generation performance of LLMs for specific application domains largely unknown.In this paper, we introduce a new benchmark, MultiCodeBench, to fill this gap. 0.607MultiCodeBench comprises 2,400 programming tasks, covering 12 popular software development domains and 15 programming languages.Specifically, we perform in-depth research to identify these 12 application domains.Given that each domain may involve multiple technical frameworks, and that different frameworks present distinct challenges in the coding process, we categorize the commonly used frameworks and platforms within each domain.We then sample programming problems from GitHub repositories related to these subdomains.To ensure the quality of the tasks and mitigate data leakage issues, we invite annotators to rewrite the docstrings for each task in MultiCodeBench.Additionally, we build a static analysis-based dependency parsing tool to extract the dependencies in the ground truth for each task, enabling deeper performance analysis.Through extensive experiments on MultiCodeBench with eleven representative mainstream LLMs, we reveal the code generation performance of the LLMs across different application domains, providing practical insights for developers in downstream fields when selecting LLMs.Furthermore, we analyze the reasons behind the models' failures in completing software application development tasks, offering guidance for model developers to enhance domain-specific code generation capabilities.

- + link

@@ -1140,20 +1228,20 @@

Benchmarks

-

2024-12-23

+

2024-12-24

- Group Testing with General Correlation Using Hypergraphs + Resolution-Robust 3D MRI Reconstruction with 2D Diffusion Priors: Diverse-Resolution Training Outperforms Interpolation

-

Group testing, a problem with diverse applications across multiple disciplines, traditionally assumes independence across nodes' states.Recent research, however, focuses on real-world scenarios that often involve correlations among nodes, challenging the simplifying assumptions made in existing models.In this work, we consider a comprehensive model for arbitrary statistical correlation among nodes' states.To capture and leverage these correlations effectively, we model the problem by hypergraphs, inspired by [GLS22], augmented by a probability mass function on the hyper-edges. Using this model, we first design a novel greedy adaptive algorithm capable of conducting informative tests and dynamically updating the distribution.Performance analysis provides upper bounds on the number of tests required, which depend solely on the entropy of the underlying probability distribution and the average number of infections. 0.625We demonstrate that the algorithm recovers or improves upon all previously known results for group testing settings with correlation.Additionally, we provide families of graphs where the algorithm is order-wise optimal and give examples where the algorithm or its analysis is not tight.We then generalize the proposed framework of group testing with general correlation in two directions, namely noisy group testing and semi-non-adaptive group testing.In both settings, we provide novel theoretical bounds on the number of tests required.

+

Deep learning-based 3D imaging, in particular magnetic resonance imaging (MRI), is challenging because of limited availability of 3D training data.Therefore, 2D diffusion models trained on 2D slices are starting to be leveraged for 3D MRI reconstruction.However, as we show in this paper, existing methods pertain to a fixed voxel size, and performance degrades when the voxel size is varied, as it is often the case in clinical practice.In this paper, we propose and study several approaches for resolution-robust 3D MRI reconstruction with 2D diffusion priors.As a result of this investigation, we obtain a simple resolution-robust variational 3D reconstruction approach based on diffusion-guided regularization of randomly sampled 2D slices.This method provides competitive reconstruction quality compared to posterior sampling baselines. 0.648Towards resolving the sensitivity to resolution-shifts, we investigate state-of-the-art model-based approaches including Gaussian splatting, neural representations, and infinite-dimensional diffusion models, as well as a simple data-centric approach of training the diffusion model on several resolutions.Our experiments demonstrate that the model-based approaches fail to close the performance gap in 3D MRI.In contrast, the data-centric approach of training the diffusion model on various resolutions effectively provides a resolution-robust method without compromising accuracy.

- + link

@@ -1167,15 +1255,15 @@

Benchmarks

- In Case You Missed It: ARC 'Challenge' Is Not That Challenging + Can Stability be Detrimental? Better Generalization through Gradient Descent Instabilities

-

ARC Challenge appears more difficult than ARC Easy for modern LLMs primarily due to an evaluation setup that prevents direct comparison of answer choices rather than inherent complexity.Although some researchers have quietly shifted to a more appropriate scheme over the last year, the implications of this change have yet to be widely acknowledged.We highlight this overlooked shift, show how similar evaluation practices falsely imply reasoning deficits in other benchmarks, and demonstrate that fairer methods dramatically reduce performance gaps (e.g. on SIQA) and even yield superhuman results (OpenBookQA). 0.605In doing so, we reveal how evaluation shapes perceived difficulty and offer guidelines to ensure that multiple-choice evaluations accurately reflect actual model capabilities.

+

Traditional analyses of gradient descent optimization show that, when the largest eigenvalue of the loss Hessian - often referred to as the sharpness - is below a critical learning-rate threshold, then training is 'stable' and training loss decreases monotonically.Recent studies, however, have suggested that the majority of modern deep neural networks achieve good performance despite operating outside this stable regime.In this work, we demonstrate that such instabilities, induced by large learning rates, move model parameters toward flatter regions of the loss landscape.Our crucial insight lies in noting that, during these instabilities, the orientation of the Hessian eigenvectors rotate.This, we conjecture, allows the model to explore regions of the loss landscape that display more desirable geometrical properties for generalization, such as flatness.These rotations are a consequence of network depth, and we prove that for any network with depth > 1, unstable growth in parameters cause rotations in the principal components of the Hessian, which promote exploration of the parameter space away from unstable directions.Our empirical studies reveal an implicit regularization effect in gradient descent with large learning rates operating beyond the stability threshold.We find these lead to excellent generalization performance on modern benchmark datasets. 0.669

- + link

@@ -1189,15 +1277,15 @@

Benchmarks

- Large Motion Video Autoencoding with Cross-modal Video VAE + Graph Neural Networks Are Evolutionary Algorithms

-

Learning a robust video Variational Autoencoder (VAE) is essential for reducing video redundancy and facilitating efficient video generation.Directly applying image VAEs to individual frames in isolation can result in temporal inconsistencies and suboptimal compression rates due to a lack of temporal compression.Existing Video VAEs have begun to address temporal compression; however, they often suffer from inadequate reconstruction performance.In this paper, we present a novel and powerful video autoencoder capable of high-fidelity video encoding.First, we observe that entangling spatial and temporal compression by merely extending the image VAE to a 3D VAE can introduce motion blur and detail distortion artifacts.Thus, we propose temporal-aware spatial compression to better encode and decode the spatial information.Additionally, we integrate a lightweight motion compression model for further temporal compression.Second, we propose to leverage the textual information inherent in text-to-video datasets and incorporate text guidance into our model.This significantly enhances reconstruction quality, particularly in terms of detail preservation and temporal stability.Third, we further improve the versatility of our model through joint training on both images and videos, which not only enhances reconstruction quality but also enables the model to perform both image and video autoencoding.Extensive evaluations against strong recent baselines demonstrate the superior performance of our method. 0.825The project website can be found at~\href{https://yzxing87.github.io/vae/}{https://yzxing87.github.io/vae/}.

+

In this paper, we reveal the intrinsic duality between graph neural networks (GNNs) and evolutionary algorithms (EAs), bridging two traditionally distinct fields.Building on this insight, we propose Graph Neural Evolution (GNE), a novel evolutionary algorithm that models individuals as nodes in a graph and leverages designed frequency-domain filters to balance global exploration and local exploitation.Through the use of these filters, GNE aggregates high-frequency (diversity-enhancing) and low-frequency (stability-promoting) information, transforming EAs into interpretable and tunable mechanisms in the frequency domain.Extensive experiments on benchmark functions demonstrate that GNE consistently outperforms state-of-the-art algorithms such as GA, DE, CMA-ES, SDAES, and RL-SHADE, excelling in complex landscapes, optimal solution shifts, and noisy environments.Its robustness, adaptability, and superior convergence highlight its practical and theoretical value. 0.639Beyond optimization, GNE establishes a conceptual and mathematical foundation linking EAs and GNNs, offering new perspectives for both fields.Its framework encourages the development of task-adaptive filters and hybrid approaches for EAs, while its insights can inspire advances in GNNs, such as improved global information propagation and mitigation of oversmoothing.GNE's versatility extends to solving challenges in machine learning, including hyperparameter tuning and neural architecture search, as well as real-world applications in engineering and operations research.By uniting the dynamics of EAs with the structural insights of GNNs, this work provides a foundation for interdisciplinary innovation, paving the way for scalable and interpretable solutions to complex optimization problems.

- + link

@@ -1211,15 +1299,15 @@

Benchmarks

- Cross-View Referring Multi-Object Tracking + SMAC-Hard: Enabling Mixed Opponent Strategy Script and Self-play on SMAC

-

Referring Multi-Object Tracking (RMOT) is an important topic in the current tracking field.Its task form is to guide the tracker to track objects that match the language description.Current research mainly focuses on referring multi-object tracking under single-view, which refers to a view sequence or multiple unrelated view sequences.However, in the single-view, some appearances of objects are easily invisible, resulting in incorrect matching of objects with the language description.In this work, we propose a new task, called Cross-view Referring Multi-Object Tracking (CRMOT).It introduces the cross-view to obtain the appearances of objects from multiple views, avoiding the problem of the invisible appearances of objects in RMOT task.CRMOT is a more challenging task of accurately tracking the objects that match the language description and maintaining the identity consistency of objects in each cross-view.To advance CRMOT task, we construct a cross-view referring multi-object tracking benchmark based on CAMPUS and DIVOTrack datasets, named CRTrack.Specifically, it provides 13 different scenes and 221 language descriptions.Furthermore, we propose an end-to-end cross-view referring multi-object tracking method, named CRTracker.Extensive experiments on the CRTrack benchmark verify the effectiveness of our method. 0.788The dataset and code are available at https://github.com/chen-si-jia/CRMOT.

+

The availability of challenging simulation environments is pivotal for advancing the field of Multi-Agent Reinforcement Learning (MARL).In cooperative MARL settings, the StarCraft Multi-Agent Challenge (SMAC) has gained prominence as a benchmark for algorithms following centralized training with decentralized execution paradigm.However, with continual advancements in SMAC, many algorithms now exhibit near-optimal performance, complicating the evaluation of their true effectiveness. 0.658To alleviate this problem, in this work, we highlight a critical issue: the default opponent policy in these environments lacks sufficient diversity, leading MARL algorithms to overfit and exploit unintended vulnerabilities rather than learning robust strategies.To overcome these limitations, we propose SMAC-HARD, a novel benchmark designed to enhance training robustness and evaluation comprehensiveness.SMAC-HARD supports customizable opponent strategies, randomization of adversarial policies, and interfaces for MARL self-play, enabling agents to generalize to varying opponent behaviors and improve model stability.Furthermore, we introduce a black-box testing framework wherein agents are trained without exposure to the edited opponent scripts but are tested against these scripts to evaluate the policy coverage and adaptability of MARL algorithms.We conduct extensive evaluations of widely used and state-of-the-art algorithms on SMAC-HARD, revealing the substantial challenges posed by edited and mixed strategy opponents.Additionally, the black-box strategy tests illustrate the difficulty of transferring learned policies to unseen adversaries.We envision SMAC-HARD as a critical step toward benchmarking the next generation of MARL algorithms, fostering progress in self-play methods for multi-agent systems.Our code is available at https://github.com/devindeng94/smac-hard.

- + link

@@ -1228,20 +1316,20 @@

Benchmarks

-

2024-12-19

+

2024-12-23

- Contiguous Boundary Guarding + Knowledge Editing through Chain-of-Thought

-

We study the problem of guarding the boundary of a simple polygon with a minimum number of guards such that each guard covers a contiguous portion of the boundary.First, we present a simple greedy algorithm for this problem that returns a guard set of size at most OPT + 1, where OPT is the number of guards in an optimal solution.Then, we present a polynomial-time exact algorithm.While the algorithm is not complicated, its correctness proof is rather involved. 0.604This result is interesting in the sense that guarding problems are typically NP-hard and, in particular, it is NP-hard to minimize the number of guards to see the boundary of a simple polygon, without the contiguous boundary guarding constraint. From the combinatorial point of view, we show that any $n$-vertex polygon can be guarded by at most $\lfloor \frac{n-2}{2}\rfloor$ guards.This bound is tight because there are polygons that require this many guards.

+

Large Language Models (LLMs) have demonstrated exceptional capabilities across a wide range of natural language processing (NLP) tasks.However, keeping these models up-to-date with evolving world knowledge remains a significant challenge due to the high costs of frequent retraining.To address this challenge, knowledge editing techniques have emerged to update LLMs with new information without rebuilding the model from scratch.Among these, the in-context editing paradigm stands out for its effectiveness in integrating new knowledge while preserving the model's original capabilities.Despite its potential, existing in-context knowledge editing methods are often task-specific, focusing primarily on multi-hop QA tasks using structured knowledge triples.Moreover, their reliance on few-shot prompting for task decomposition makes them unstable and less effective in generalizing across diverse tasks. In response to these limitations, we propose EditCoT, a novel knowledge editing framework that flexibly and efficiently updates LLMs across various tasks without retraining.EditCoT works by generating a chain-of-thought (CoT) for a given input and then iteratively refining this CoT process using a CoT editor based on updated knowledge.We evaluate EditCoT across a diverse range of benchmarks, covering multiple languages and tasks.The results demonstrate that our approach achieves state-of-the-art performance while offering superior generalization, effectiveness, and stability compared to existing methods, marking a significant advancement in the field of knowledge updating. 0.636Code and data are available at: https://github.com/bebr2/EditCoT.

- + link

@@ -1250,20 +1338,20 @@

Benchmarks

-

2024-12-19

+

2024-12-23

- Knowing Where to Focus: Attention-Guided Alignment for Text-based Person Search + RepoTransBench: A Real-World Benchmark for Repository-Level Code Translation

-

In the realm of Text-Based Person Search (TBPS), mainstream methods aim to explore more efficient interaction frameworks between text descriptions and visual data.However, recent approaches encounter two principal challenges.Firstly, the widely used random-based Masked Language Modeling (MLM) considers all the words in the text equally during training.However, massive semantically vacuous words ('with', 'the', etc.) be masked fail to contribute efficient interaction in the cross-modal MLM and hampers the representation alignment.Secondly, manual descriptions in TBPS datasets are tedious and inevitably contain several inaccuracies.To address these issues, we introduce an Attention-Guided Alignment (AGA) framework featuring two innovative components: Attention-Guided Mask (AGM)Modeling and Text Enrichment Module (TEM).AGM dynamically masks semantically meaningful words by aggregating the attention weight derived from the text encoding process, thereby cross-modal MLM can capture information related to the masked word from text context and images and align their representations.Meanwhile, TEM alleviates low-quality representations caused by repetitive and erroneous text descriptions by replacing those semantically meaningful words with MLM's prediction.It not only enriches text descriptions but also prevents overfitting.Extensive experiments across three challenging benchmarks demonstrate the effectiveness of our AGA, achieving new state-of-the-art results with Rank-1 accuracy reaching 78.36%, 67.31%, and 67.4% on CUHK-PEDES, ICFG-PEDES, and RSTPReid, respectively. 0.679

+

Repository-level code translation refers to translating an entire code repository from one programming language to another while preserving the functionality of the source repository.Many benchmarks have been proposed to evaluate the performance of such code translators. 0.612However, previous benchmarks mostly provide fine-grained samples, focusing at either code snippet, function, or file-level code translation. 0.685Such benchmarks do not accurately reflect real-world demands, where entire repositories often need to be translated, involving longer code length and more complex functionalities.To address this gap, we propose a new benchmark, named RepoTransBench, which is a real-world repository-level code translation benchmark with an automatically executable test suite.We conduct experiments on RepoTransBench to evaluate the translation performance of 11 advanced LLMs.We find that the Success@1 score (test success in one attempt) of the best-performing LLM is only 7.33%.To further explore the potential of LLMs for repository-level code translation, we provide LLMs with error-related feedback to perform iterative debugging and observe an average 7.09% improvement on Success@1.However, even with this improvement, the Success@1 score of the best-performing LLM is only 21%, which may not meet the need for reliable automatic repository-level code translation.Finally, we conduct a detailed error analysis and highlight current LLMs' deficiencies in repository-level code translation, which could provide a reference for further improvements.

- + link

@@ -1272,20 +1360,20 @@

Benchmarks

-

2024-12-19

+

2024-12-23

- Solving the all pairs shortest path problem after minor update of a large dense graph + Group Testing with General Correlation Using Hypergraphs

-

The all pairs shortest path problem is a fundamental optimization problem in graph theory.We deal with re-calculating the all-pairs shortest path (APSP) matrix after a minor modification of a weighted dense graph, e.g., adding a node, removing a node, or updating an edge.We assume the APSP matrix for the original graph is already known.The graph can be directed or undirected.A cold-start calculation of the new APSP matrix by traditional algorithms, like the Floyd-Warshall algorithm or Dijkstra's algorithm, needs $ O(n^3) $ time.We propose two algorithms for warm-start calculation of the new APSP matrix.The best case complexity for a warm-start calculation is $ O(n^2) $, the worst case complexity is $ O(n^3) $. 0.64We implemented the algorithms and tested their performance with experiments. 0.725The result shows a warm-start calculation can save a great portion of calculation time, compared with cold-start calculation. 0.685

+

Group testing, a problem with diverse applications across multiple disciplines, traditionally assumes independence across nodes' states.Recent research, however, focuses on real-world scenarios that often involve correlations among nodes, challenging the simplifying assumptions made in existing models.In this work, we consider a comprehensive model for arbitrary statistical correlation among nodes' states.To capture and leverage these correlations effectively, we model the problem by hypergraphs, inspired by [GLS22], augmented by a probability mass function on the hyper-edges. Using this model, we first design a novel greedy adaptive algorithm capable of conducting informative tests and dynamically updating the distribution.Performance analysis provides upper bounds on the number of tests required, which depend solely on the entropy of the underlying probability distribution and the average number of infections. 0.625We demonstrate that the algorithm recovers or improves upon all previously known results for group testing settings with correlation.Additionally, we provide families of graphs where the algorithm is order-wise optimal and give examples where the algorithm or its analysis is not tight.We then generalize the proposed framework of group testing with general correlation in two directions, namely noisy group testing and semi-non-adaptive group testing.In both settings, we provide novel theoretical bounds on the number of tests required.

- + link

@@ -1294,20 +1382,20 @@

Benchmarks

-

2024-12-19

+

2024-12-23

- Efficient Ranking, Order Statistics, and Sorting under CKKS + In Case You Missed It: ARC 'Challenge' Is Not That Challenging

-

Fully Homomorphic Encryption (FHE) enables operations on encrypted data, making it extremely useful for privacy-preserving applications, especially in cloud computing environments.In such contexts, operations like ranking, order statistics, and sorting are fundamental functionalities often required for database queries or as building blocks of larger protocols.However, the high computational overhead and limited native operations of FHE pose significant challenges for an efficient implementation of these tasks.These challenges are exacerbated by the fact that all these functionalities are based on comparing elements, which is a severely expensive operation under encryption. Previous solutions have typically based their designs on swap-based techniques, where two elements are conditionally swapped based on the results of their comparison. 0.611These methods aim to reduce the primary computational bottleneck: the comparison depth, which is the number of non-parallelizable homomorphic comparisons.The current state of the art solution for sorting by Lu et al. (IEEE S&P'21), for instance, achieves a comparison depth of O(log^2(N)). 0.671In this paper, we address the challenge of reducing the comparison depth by shifting away from the swap-based paradigm.We present solutions for ranking, order statistics, and sorting, that all achieve a comparison depth of O(1), making our approach highly parallelizable. 0.65Leveraging the SIMD capabilities of the CKKS FHE scheme, our approach re-encodes the input vector under encryption to allow for simultaneous comparisons of all elements with each other.The homomorphic re-encoding incurs a minimal computational overhead of O(log(N))rotations.Experimental results show that our approach ranks a 128-element vector in approximately 2.64s, computes its argmin/argmax in 14.18s, and sorts it in 21.10s.

+

ARC Challenge appears more difficult than ARC Easy for modern LLMs primarily due to an evaluation setup that prevents direct comparison of answer choices rather than inherent complexity.Although some researchers have quietly shifted to a more appropriate scheme over the last year, the implications of this change have yet to be widely acknowledged.We highlight this overlooked shift, show how similar evaluation practices falsely imply reasoning deficits in other benchmarks, and demonstrate that fairer methods dramatically reduce performance gaps (e.g. on SIQA) and even yield superhuman results (OpenBookQA). 0.605In doing so, we reveal how evaluation shapes perceived difficulty and offer guidelines to ensure that multiple-choice evaluations accurately reflect actual model capabilities.

- + link

@@ -1316,20 +1404,20 @@

Benchmarks

-

2024-12-19

+

2024-12-23

- Adaptive Pruning for Large Language Models with Structural Importance Awareness + Large Motion Video Autoencoding with Cross-modal Video VAE

-

The recent advancements in large language models (LLMs) have significantly improved language understanding and generation capabilities.However, it is difficult to deploy LLMs on resource-constrained edge devices due to their high computational and storage resource demands.To address this issue, we propose a novel LLM model pruning method, namely structurally-aware adaptive pruning (SAAP), to significantly reduce the computational and memory costs while maintaining model performance.We first define an adaptive importance fusion metric to evaluate the importance of all coupled structures in LLMs by considering their homoscedastic uncertainty.Then, we rank the importance of all modules to determine the specific layers that should be pruned to meet particular performance requirements.Furthermore, we develop a new group fine-tuning strategy to improve the inference efficiency of LLMs.Finally, we evaluate the proposed SAAP method on multiple LLMs across two common tasks, i.e., zero-shot classification and text generation.Experimental results show that our SAAP method outperforms several state-of-the-art baseline methods, achieving 2.17%, 2.37%, and 2.39% accuracy gains on LLaMA-7B, Vicuna-7B, and 0.738LLaMA-13B. Additionally, SAAP improves the token generation speed by 5%, showcasing its practical advantages in resource-constrained scenarios.

+

Learning a robust video Variational Autoencoder (VAE) is essential for reducing video redundancy and facilitating efficient video generation.Directly applying image VAEs to individual frames in isolation can result in temporal inconsistencies and suboptimal compression rates due to a lack of temporal compression.Existing Video VAEs have begun to address temporal compression; however, they often suffer from inadequate reconstruction performance.In this paper, we present a novel and powerful video autoencoder capable of high-fidelity video encoding.First, we observe that entangling spatial and temporal compression by merely extending the image VAE to a 3D VAE can introduce motion blur and detail distortion artifacts.Thus, we propose temporal-aware spatial compression to better encode and decode the spatial information.Additionally, we integrate a lightweight motion compression model for further temporal compression.Second, we propose to leverage the textual information inherent in text-to-video datasets and incorporate text guidance into our model.This significantly enhances reconstruction quality, particularly in terms of detail preservation and temporal stability.Third, we further improve the versatility of our model through joint training on both images and videos, which not only enhances reconstruction quality but also enables the model to perform both image and video autoencoding.Extensive evaluations against strong recent baselines demonstrate the superior performance of our method. 0.825The project website can be found at~\href{https://yzxing87.github.io/vae/}{https://yzxing87.github.io/vae/}.

- + link

@@ -1338,20 +1426,20 @@

Benchmarks

-

2024-12-19

+

2024-12-23

- Rethinking Uncertainty Estimation in Natural Language Generation + Cross-View Referring Multi-Object Tracking

-

Large Language Models (LLMs) are increasingly employed in real-world applications, driving the need to evaluate the trustworthiness of their generated text.To this end, reliable uncertainty estimation is essential.Since current LLMs generate text autoregressively through a stochastic process, the same prompt can lead to varying outputs.Consequently, leading uncertainty estimation methods generate and analyze multiple output sequences to determine the LLM's uncertainty.However, generating output sequences is computationally expensive, making these methods impractical at scale.In this work, we inspect the theoretical foundations of the leading methods and explore new directions to enhance their computational efficiency. 0.633Building on the framework of proper scoring rules, we find that the negative log-likelihood of the most likely output sequence constitutes a theoretically grounded uncertainty measure.To approximate this alternative measure, we propose G-NLL, which has the advantage of being obtained using only a single output sequence generated by greedy decoding.This makes uncertainty estimation more efficient and straightforward, while preserving theoretical rigor.Empirical results demonstrate that G-NLL achieves state-of-the-art performance across various LLMs and tasks.Our work lays the foundation for efficient and reliable uncertainty estimation in natural language generation, challenging the necessity of more computationally involved methods currently leading the field.

+

Referring Multi-Object Tracking (RMOT) is an important topic in the current tracking field.Its task form is to guide the tracker to track objects that match the language description.Current research mainly focuses on referring multi-object tracking under single-view, which refers to a view sequence or multiple unrelated view sequences.However, in the single-view, some appearances of objects are easily invisible, resulting in incorrect matching of objects with the language description.In this work, we propose a new task, called Cross-view Referring Multi-Object Tracking (CRMOT).It introduces the cross-view to obtain the appearances of objects from multiple views, avoiding the problem of the invisible appearances of objects in RMOT task.CRMOT is a more challenging task of accurately tracking the objects that match the language description and maintaining the identity consistency of objects in each cross-view.To advance CRMOT task, we construct a cross-view referring multi-object tracking benchmark based on CAMPUS and DIVOTrack datasets, named CRTrack.Specifically, it provides 13 different scenes and 221 language descriptions.Furthermore, we propose an end-to-end cross-view referring multi-object tracking method, named CRTracker.Extensive experiments on the CRTrack benchmark verify the effectiveness of our method. 0.788The dataset and code are available at https://github.com/chen-si-jia/CRMOT.

- + link

@@ -1365,15 +1453,15 @@

Benchmarks

- Face the Facts! Evaluating RAG-based Fact-checking Pipelines in Realistic Settings + Contiguous Boundary Guarding

-

Natural Language Processing and Generation systems have recently shown the potential to complement and streamline the costly and time-consuming job of professional fact-checkers.In this work, we lift several constraints of current state-of-the-art pipelines for automated fact-checking based on the Retrieval-Augmented Generation (RAG) paradigm.Our goal is to benchmark, under more realistic scenarios, RAG-based methods for the generation of verdicts - i.e., short texts discussing the veracity of a claim - evaluating them on stylistically complex claims and heterogeneous, yet reliable, knowledge bases. 0.617Our findings show a complex landscape, where, for example, LLM-based retrievers outperform other retrieval techniques, though they still struggle with heterogeneous knowledge bases; larger models excel in verdict faithfulness, while smaller models provide better context adherence, with human evaluations favouring zero-shot and one-shot approaches for informativeness, and fine-tuned models for emotional alignment.

+

We study the problem of guarding the boundary of a simple polygon with a minimum number of guards such that each guard covers a contiguous portion of the boundary.First, we present a simple greedy algorithm for this problem that returns a guard set of size at most OPT + 1, where OPT is the number of guards in an optimal solution.Then, we present a polynomial-time exact algorithm.While the algorithm is not complicated, its correctness proof is rather involved. 0.604This result is interesting in the sense that guarding problems are typically NP-hard and, in particular, it is NP-hard to minimize the number of guards to see the boundary of a simple polygon, without the contiguous boundary guarding constraint. From the combinatorial point of view, we show that any $n$-vertex polygon can be guarded by at most $\lfloor \frac{n-2}{2}\rfloor$ guards.This bound is tight because there are polygons that require this many guards.

- + link

@@ -1387,15 +1475,15 @@

Benchmarks

- MMLU-CF: A Contamination-free Multi-task Language Understanding Benchmark + Knowing Where to Focus: Attention-Guided Alignment for Text-based Person Search

-

Multiple-choice question (MCQ) datasets like Massive Multitask Language Understanding (MMLU) are widely used to evaluate the commonsense, understanding, and problem-solving abilities of large language models (LLMs).However, the open-source nature of these benchmarks and the broad sources of training data for LLMs have inevitably led to benchmark contamination, resulting in unreliable evaluation results.To alleviate this issue, we propose a contamination-free and more challenging MCQ benchmark called MMLU-CF.This benchmark reassesses LLMs' understanding of world knowledge by averting both unintentional and malicious data leakage.To avoid unintentional data leakage, we source data from a broader domain and design three decontamination rules.To prevent malicious data leakage, we divide the benchmark into validation and test sets with similar difficulty and subject distributions. 0.605The test set remains closed-source to ensure reliable results, while the validation set is publicly available to promote transparency and facilitate independent verification.Our evaluation of mainstream LLMs reveals that the powerful GPT-4o achieves merely a 5-shot score of 73.4% and a 0-shot score of 71.9% on the test set, which indicates the effectiveness of our approach in creating a more rigorous and contamination-free evaluation standard.The GitHub repository is available at https://github.com/microsoft/MMLU-CF and the dataset refers to https://huggingface.co/datasets/microsoft/MMLU-CF.

+

In the realm of Text-Based Person Search (TBPS), mainstream methods aim to explore more efficient interaction frameworks between text descriptions and visual data.However, recent approaches encounter two principal challenges.Firstly, the widely used random-based Masked Language Modeling (MLM) considers all the words in the text equally during training.However, massive semantically vacuous words ('with', 'the', etc.) be masked fail to contribute efficient interaction in the cross-modal MLM and hampers the representation alignment.Secondly, manual descriptions in TBPS datasets are tedious and inevitably contain several inaccuracies.To address these issues, we introduce an Attention-Guided Alignment (AGA) framework featuring two innovative components: Attention-Guided Mask (AGM)Modeling and Text Enrichment Module (TEM).AGM dynamically masks semantically meaningful words by aggregating the attention weight derived from the text encoding process, thereby cross-modal MLM can capture information related to the masked word from text context and images and align their representations.Meanwhile, TEM alleviates low-quality representations caused by repetitive and erroneous text descriptions by replacing those semantically meaningful words with MLM's prediction.It not only enriches text descriptions but also prevents overfitting.Extensive experiments across three challenging benchmarks demonstrate the effectiveness of our AGA, achieving new state-of-the-art results with Rank-1 accuracy reaching 78.36%, 67.31%, and 67.4% on CUHK-PEDES, ICFG-PEDES, and RSTPReid, respectively. 0.679

- + link

@@ -1409,15 +1497,15 @@

Benchmarks

- LiDAR-RT: Gaussian-based Ray Tracing for Dynamic LiDAR Re-simulation + Solving the all pairs shortest path problem after minor update of a large dense graph

-

This paper targets the challenge of real-time LiDAR re-simulation in dynamic driving scenarios.Recent approaches utilize neural radiance fields combined with the physical modeling of LiDAR sensors to achieve high-fidelity re-simulation results.Unfortunately, these methods face limitations due to high computational demands in large-scale scenes and cannot perform real-time LiDAR rendering.To overcome these constraints, we propose LiDAR-RT, a novel framework that supports real-time, physically accurate LiDAR re-simulation for driving scenes.Our primary contribution is the development of an efficient and effective rendering pipeline, which integrates Gaussian primitives and hardware-accelerated ray tracing technology.Specifically, we model the physical properties of LiDAR sensors using Gaussian primitives with learnable parameters and incorporate scene graphs to handle scene dynamics.Building upon this scene representation, our framework first constructs a bounding volume hierarchy (BVH), then casts rays for each pixel and generates novel LiDAR views through a differentiable rendering algorithm.Importantly, our framework supports realistic rendering with flexible scene editing operations and various sensor configurations.Extensive experiments across multiple public benchmarks demonstrate that our method outperforms state-of-the-art methods in terms of rendering quality and efficiency. 0.708Our project page is at https://zju3dv.github.io/lidar-rt.

+

The all pairs shortest path problem is a fundamental optimization problem in graph theory.We deal with re-calculating the all-pairs shortest path (APSP) matrix after a minor modification of a weighted dense graph, e.g., adding a node, removing a node, or updating an edge.We assume the APSP matrix for the original graph is already known.The graph can be directed or undirected.A cold-start calculation of the new APSP matrix by traditional algorithms, like the Floyd-Warshall algorithm or Dijkstra's algorithm, needs $ O(n^3) $ time.We propose two algorithms for warm-start calculation of the new APSP matrix.The best case complexity for a warm-start calculation is $ O(n^2) $, the worst case complexity is $ O(n^3) $. 0.64We implemented the algorithms and tested their performance with experiments. 0.725The result shows a warm-start calculation can save a great portion of calculation time, compared with cold-start calculation. 0.685

- + link

@@ -1431,15 +1519,15 @@

Benchmarks

- AutoTrust: Benchmarking Trustworthiness in Large Vision Language Models for Autonomous Driving + Efficient Ranking, Order Statistics, and Sorting under CKKS

-

Recent advancements in large vision language models (VLMs) tailored for autonomous driving (AD) have shown strong scene understanding and reasoning capabilities, making them undeniable candidates for end-to-end driving systems.However, limited work exists on studying the trustworthiness of DriveVLMs -- a critical factor that directly impacts public transportation safety.In this paper, we introduce AutoTrust, a comprehensive trustworthiness benchmark for large vision-language models in autonomous driving (DriveVLMs), considering diverse perspectives -- including trustfulness, safety, robustness, privacy, and fairness.We constructed the largest visual question-answering dataset for investigating trustworthiness issues in driving scenarios, comprising over 10k unique scenes and 18k queries.We evaluated six publicly available VLMs, spanning from generalist to specialist, from open-source to commercial models.Our exhaustive evaluations have unveiled previously undiscovered vulnerabilities of DriveVLMs to trustworthiness threats.Specifically, we found that the general VLMs like LLaVA-v1.6 and GPT-4o-mini surprisingly outperform specialized models fine-tuned for driving in terms of overall trustworthiness.DriveVLMs like DriveLM-Agent are particularly vulnerable to disclosing sensitive information.Additionally, both generalist and specialist VLMs remain susceptible to adversarial attacks and struggle to ensure unbiased decision-making across diverse environments and populations.Our findings call for immediate and decisive action to address the trustworthiness of DriveVLMs -- an issue of critical importance to public safety and the welfare of all citizens relying on autonomous transportation systems.Our benchmark is publicly available at \url{https://github.com/taco-group/AutoTrust}, and the leaderboard is released at \url{https://taco-group.github.io/AutoTrust/}. 0.613

+

Fully Homomorphic Encryption (FHE) enables operations on encrypted data, making it extremely useful for privacy-preserving applications, especially in cloud computing environments.In such contexts, operations like ranking, order statistics, and sorting are fundamental functionalities often required for database queries or as building blocks of larger protocols.However, the high computational overhead and limited native operations of FHE pose significant challenges for an efficient implementation of these tasks.These challenges are exacerbated by the fact that all these functionalities are based on comparing elements, which is a severely expensive operation under encryption. Previous solutions have typically based their designs on swap-based techniques, where two elements are conditionally swapped based on the results of their comparison. 0.611These methods aim to reduce the primary computational bottleneck: the comparison depth, which is the number of non-parallelizable homomorphic comparisons.The current state of the art solution for sorting by Lu et al. (IEEE S&P'21), for instance, achieves a comparison depth of O(log^2(N)). 0.671In this paper, we address the challenge of reducing the comparison depth by shifting away from the swap-based paradigm.We present solutions for ranking, order statistics, and sorting, that all achieve a comparison depth of O(1), making our approach highly parallelizable. 0.65Leveraging the SIMD capabilities of the CKKS FHE scheme, our approach re-encodes the input vector under encryption to allow for simultaneous comparisons of all elements with each other.The homomorphic re-encoding incurs a minimal computational overhead of O(log(N))rotations.Experimental results show that our approach ranks a 128-element vector in approximately 2.64s, computes its argmin/argmax in 14.18s, and sorts it in 21.10s.

- + link

@@ -1453,15 +1541,15 @@

Benchmarks

- PRIMA: Multi-Image Vision-Language Models for Reasoning Segmentation + Adaptive Pruning for Large Language Models with Structural Importance Awareness

-

Despite significant advancements in Large Vision-Language Models (LVLMs), existing pixel-grounding models operate on single-image settings, limiting their ability to perform detailed, fine-grained comparisons across multiple images.Conversely, current multi-image understanding models lack pixel-level grounding.Our work addresses this gap by introducing the task of multi-image pixel-grounded reasoning segmentation, and PRIMA, a novel LVLM that integrates pixel-level grounding with robust multi-image reasoning capabilities to produce contextually rich, pixel-grounded explanations.Central to PRIMA is an efficient vision module that queries fine-grained visual representations across multiple images, reducing TFLOPs by $25.3\%$. To support training and evaluation, we curate $M^4Seg$, a new reasoning segmentation benchmark consisting of $\sim$224K question-answer pairs that require fine-grained visual understanding across multiple images.Experimental results demonstrate PRIMA outperforms state-of-the-art baselines. 0.759

+

The recent advancements in large language models (LLMs) have significantly improved language understanding and generation capabilities.However, it is difficult to deploy LLMs on resource-constrained edge devices due to their high computational and storage resource demands.To address this issue, we propose a novel LLM model pruning method, namely structurally-aware adaptive pruning (SAAP), to significantly reduce the computational and memory costs while maintaining model performance.We first define an adaptive importance fusion metric to evaluate the importance of all coupled structures in LLMs by considering their homoscedastic uncertainty.Then, we rank the importance of all modules to determine the specific layers that should be pruned to meet particular performance requirements.Furthermore, we develop a new group fine-tuning strategy to improve the inference efficiency of LLMs.Finally, we evaluate the proposed SAAP method on multiple LLMs across two common tasks, i.e., zero-shot classification and text generation.Experimental results show that our SAAP method outperforms several state-of-the-art baseline methods, achieving 2.17%, 2.37%, and 2.39% accuracy gains on LLaMA-7B, Vicuna-7B, and 0.738LLaMA-13B. Additionally, SAAP improves the token generation speed by 5%, showcasing its practical advantages in resource-constrained scenarios.

- + link

@@ -1475,81 +1563,15 @@

Benchmarks

- UIP2P: Unsupervised Instruction-based Image Editing via Cycle Edit Consistency - -
-
-

-

We propose an unsupervised model for instruction-based image editing that eliminates the need for ground-truth edited images during training.Existing supervised methods depend on datasets containing triplets of input image, edited image, and edit instruction.These are generated by either existing editing methods or human-annotations, which introduce biases and limit their generalization ability.Our method addresses these challenges by introducing a novel editing mechanism called Cycle Edit Consistency (CEC), which applies forward and backward edits in one training step and enforces consistency in image and attention spaces.This allows us to bypass the need for ground-truth edited images and unlock training for the first time on datasets comprising either real image-caption pairs or image-caption-edit triplets.We empirically show that our unsupervised technique performs better across a broader range of edits with high fidelity and precision. 0.624By eliminating the need for pre-existing datasets of triplets, reducing biases associated with supervised methods, and proposing CEC, our work represents a significant advancement in unblocking scaling of instruction-based image editing.

-

-

- - link - -

-
-
- - - -

2024-12-18

- - -
- - Modality-Independent Graph Neural Networks with Global Transformers for Multimodal Recommendation - -
-
-

-

Multimodal recommendation systems can learn users' preferences from existing user-item interactions as well as the semantics of multimodal data associated with items.Many existing methods model this through a multimodal user-item graph, approaching multimodal recommendation as a graph learning task.Graph Neural Networks (GNNs) have shown promising performance in this domain.Prior research has capitalized on GNNs' capability to capture neighborhood information within certain receptive fields (typically denoted by the number of hops, $K$) to enrich user and item semantics.We observe that the optimal receptive fields for GNNs can vary across different modalities.In this paper, we propose GNNs with Modality-Independent Receptive Fields, which employ separate GNNs with independent receptive fields for different modalities to enhance performance.Our results indicate that the optimal $K$ for certain modalities on specific datasets can be as low as 1 or 2, which may restrict the GNNs' capacity to capture global information.To address this, we introduce a Sampling-based Global Transformer, which utilizes uniform global sampling to effectively integrate global information for GNNs.We conduct comprehensive experiments that demonstrate the superiority of our approach over existing methods. 0.631Our code is publicly available at https://github.com/CrawlScript/MIG-GT.

-

-

- - link - -

-
-
- - - -

2024-12-18

- - -
- - Discovering maximally consistent distribution of causal tournaments with Large Language Models - -
-
-

-

Causal discovery is essential for understanding complex systems, yet traditional methods often depend on strong, untestable assumptions, making the process challenging.Large Language Models (LLMs) present a promising alternative for extracting causal insights from text-based metadata, which consolidates domain expertise.However, LLMs are prone to unreliability and hallucinations, necessitating strategies that account for their limitations.One such strategy involves leveraging a consistency measure to evaluate reliability. 0.6Additionally, most text metadata does not clearly distinguish direct causal relationships from indirect ones, further complicating the inference of causal graphs.As a result, focusing on causal orderings, rather than causal graphs, emerges as a more practical and robust approach.We propose a novel method to derive a distribution of acyclic tournaments (representing plausible causal orders) that maximizes a consistency score.Our approach begins by computing pairwise consistency scores between variables, yielding a cyclic tournament that aggregates these scores.From this structure, we identify optimal acyclic tournaments compatible with the original tournament, prioritizing those that maximize consistency across all configurations.We tested our method on both classical and well-established bechmarks, as well as real-world datasets from epidemiology and public health.Our results demonstrate the effectiveness of our approach in recovering distributions causal orders with minimal error.

-

-

- - link - -

-
-
- - - -

2024-12-18

- - -
- - A Computationally Grounded Framework for Cognitive Attitudes (extended version) + Rethinking Uncertainty Estimation in Natural Language Generation

-

We introduce a novel language for reasoning about agents' cognitive attitudes of both epistemic and motivational type.We interpret it by means of a computationally grounded semantics using belief bases.Our language includes five types of modal operators for implicit belief, complete attraction, complete repulsion, realistic attraction and realistic repulsion.We give an axiomatization and show that our operators are not mutually expressible and that they can be combined to represent a large variety of psychological concepts including ambivalence, indifference, being motivated, being demotivated and preference.We present a dynamic extension of the language that supports reasoning about the effects of belief change operations.Finally, we provide a succinct formulation of model checking for our languages and a PSPACE model checking algorithm relying on a reduction into TQBF.We present some experimental results for the implemented algorithm on computation time in a concrete example. 0.631

+

Large Language Models (LLMs) are increasingly employed in real-world applications, driving the need to evaluate the trustworthiness of their generated text.To this end, reliable uncertainty estimation is essential.Since current LLMs generate text autoregressively through a stochastic process, the same prompt can lead to varying outputs.Consequently, leading uncertainty estimation methods generate and analyze multiple output sequences to determine the LLM's uncertainty.However, generating output sequences is computationally expensive, making these methods impractical at scale.In this work, we inspect the theoretical foundations of the leading methods and explore new directions to enhance their computational efficiency. 0.633Building on the framework of proper scoring rules, we find that the negative log-likelihood of the most likely output sequence constitutes a theoretically grounded uncertainty measure.To approximate this alternative measure, we propose G-NLL, which has the advantage of being obtained using only a single output sequence generated by greedy decoding.This makes uncertainty estimation more efficient and straightforward, while preserving theoretical rigor.Empirical results demonstrate that G-NLL achieves state-of-the-art performance across various LLMs and tasks.Our work lays the foundation for efficient and reliable uncertainty estimation in natural language generation, challenging the necessity of more computationally involved methods currently leading the field.

- + link

@@ -1558,20 +1580,20 @@

Benchmarks

-

2024-12-18

+

2024-12-19

- Online MDP with Transition Prototypes: A Robust Adaptive Approach + Face the Facts! Evaluating RAG-based Fact-checking Pipelines in Realistic Settings

-

In this work, we consider an online robust Markov Decision Process (MDP) where we have the information of finitely many prototypes of the underlying transition kernel.We consider an adaptively updated ambiguity set of the prototypes and propose an algorithm that efficiently identifies the true underlying transition kernel while guaranteeing the performance of the corresponding robust policy.To be more specific, we provide a sublinear regret of the subsequent optimal robust policy.We also provide an early stopping mechanism and a worst-case performance bound of the value function.In numerical experiments, we demonstrate that our method outperforms existing approaches, particularly in the early stage with limited data. 0.681This work contributes to robust MDPs by considering possible prior information about the underlying transition probability and online learning, offering both theoretical insights and practical algorithms for improved decision-making under uncertainty.

+

Natural Language Processing and Generation systems have recently shown the potential to complement and streamline the costly and time-consuming job of professional fact-checkers.In this work, we lift several constraints of current state-of-the-art pipelines for automated fact-checking based on the Retrieval-Augmented Generation (RAG) paradigm.Our goal is to benchmark, under more realistic scenarios, RAG-based methods for the generation of verdicts - i.e., short texts discussing the veracity of a claim - evaluating them on stylistically complex claims and heterogeneous, yet reliable, knowledge bases. 0.617Our findings show a complex landscape, where, for example, LLM-based retrievers outperform other retrieval techniques, though they still struggle with heterogeneous knowledge bases; larger models excel in verdict faithfulness, while smaller models provide better context adherence, with human evaluations favouring zero-shot and one-shot approaches for informativeness, and fine-tuned models for emotional alignment.

- + link

@@ -1580,20 +1602,20 @@

Benchmarks

-

2024-12-18

+

2024-12-19

- On the Use of Abundant Road Speed Data for Travel Demand Calibration of Urban Traffic Simulators + MMLU-CF: A Contamination-free Multi-task Language Understanding Benchmark

-

This work develops a compute-efficient algorithm to tackle a fundamental problem in transportation: that of urban travel demand estimation.It focuses on the calibration of origin-destination travel demand input parameters for high-resolution traffic simulation models.It considers the use of abundant traffic road speed data.The travel demand calibration problem is formulated as a continuous, high-dimensional, simulation-based optimization (SO) problem with bound constraints.There is a lack of compute efficient algorithms to tackle this problem.We propose the use of an SO algorithm that relies on an efficient, analytical, differentiable, physics-based traffic model, known as a metamodel or surrogate model.We formulate a metamodel that enables the use of road speed data.Tests are performed on a Salt Lake City network.We study how the amount of data, as well as the congestion levels, impact both in-sample and out-of-sample performance.The proposed method outperforms the benchmark for both in-sample and out-of-sample performance by 84.4% and 72.2% in terms of speeds and counts, respectively. 0.729Most importantly, the proposed method yields the highest compute efficiency, identifying solutions with good performance within few simulation function evaluations (i.e., with small samples). 0.693

+

Multiple-choice question (MCQ) datasets like Massive Multitask Language Understanding (MMLU) are widely used to evaluate the commonsense, understanding, and problem-solving abilities of large language models (LLMs).However, the open-source nature of these benchmarks and the broad sources of training data for LLMs have inevitably led to benchmark contamination, resulting in unreliable evaluation results.To alleviate this issue, we propose a contamination-free and more challenging MCQ benchmark called MMLU-CF.This benchmark reassesses LLMs' understanding of world knowledge by averting both unintentional and malicious data leakage.To avoid unintentional data leakage, we source data from a broader domain and design three decontamination rules.To prevent malicious data leakage, we divide the benchmark into validation and test sets with similar difficulty and subject distributions. 0.605The test set remains closed-source to ensure reliable results, while the validation set is publicly available to promote transparency and facilitate independent verification.Our evaluation of mainstream LLMs reveals that the powerful GPT-4o achieves merely a 5-shot score of 73.4% and a 0-shot score of 71.9% on the test set, which indicates the effectiveness of our approach in creating a more rigorous and contamination-free evaluation standard.The GitHub repository is available at https://github.com/microsoft/MMLU-CF and the dataset refers to https://huggingface.co/datasets/microsoft/MMLU-CF.

- + link

@@ -1602,20 +1624,20 @@

Benchmarks

-

2024-12-18

+

2024-12-19

- On Calibration in Multi-Distribution Learning + LiDAR-RT: Gaussian-based Ray Tracing for Dynamic LiDAR Re-simulation

-

Modern challenges of robustness, fairness, and decision-making in machine learning have led to the formulation of multi-distribution learning (MDL) frameworks in which a predictor is optimized across multiple distributions.We study the calibration properties of MDL to better understand how the predictor performs uniformly across the multiple distributions.Through classical results on decomposing proper scoring losses, we first derive the Bayes optimal rule for MDL, demonstrating that it maximizes the generalized entropy of the associated loss function.Our analysis reveals that while this approach ensures minimal worst-case loss, it can lead to non-uniform calibration errors across the multiple distributions and there is an inherent calibration-refinement trade-off, even at Bayes optimality. 0.615Our results highlight a critical limitation: despite the promise of MDL, one must use caution when designing predictors tailored to multiple distributions so as to minimize disparity.

+

This paper targets the challenge of real-time LiDAR re-simulation in dynamic driving scenarios.Recent approaches utilize neural radiance fields combined with the physical modeling of LiDAR sensors to achieve high-fidelity re-simulation results.Unfortunately, these methods face limitations due to high computational demands in large-scale scenes and cannot perform real-time LiDAR rendering.To overcome these constraints, we propose LiDAR-RT, a novel framework that supports real-time, physically accurate LiDAR re-simulation for driving scenes.Our primary contribution is the development of an efficient and effective rendering pipeline, which integrates Gaussian primitives and hardware-accelerated ray tracing technology.Specifically, we model the physical properties of LiDAR sensors using Gaussian primitives with learnable parameters and incorporate scene graphs to handle scene dynamics.Building upon this scene representation, our framework first constructs a bounding volume hierarchy (BVH), then casts rays for each pixel and generates novel LiDAR views through a differentiable rendering algorithm.Importantly, our framework supports realistic rendering with flexible scene editing operations and various sensor configurations.Extensive experiments across multiple public benchmarks demonstrate that our method outperforms state-of-the-art methods in terms of rendering quality and efficiency. 0.708Our project page is at https://zju3dv.github.io/lidar-rt.

- + link

@@ -1624,20 +1646,20 @@

Benchmarks

-

2024-12-17

+

2024-12-19

- OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain + AutoTrust: Benchmarking Trustworthiness in Large Vision Language Models for Autonomous Driving

-

As a typical and practical application of Large Language Models (LLMs), Retrieval-Augmented Generation (RAG) techniques have gained extensive attention, particularly in vertical domains where LLMs may lack domain-specific knowledge.In this paper, we introduce an omnidirectional and automatic RAG benchmark, OmniEval, in the financial domain. 0.657Our benchmark is characterized by its multi-dimensional evaluation framework, including (1) a matrix-based RAG scenario evaluation system that categorizes queries into five task classes and 16 financial topics, leading to a structured assessment of diverse query scenarios; (2) a multi-dimensional evaluation data generation approach, which combines GPT-4-based automatic generation and human annotation, achieving an 87.47\% acceptance ratio in human evaluations on generated instances; (3) a multi-stage evaluation system that evaluates both retrieval and generation performance, result in a comprehensive evaluation on the RAG pipeline; and (4) robust evaluation metrics derived from rule-based and LLM-based ones, enhancing the reliability of assessments through manual annotations and supervised fine-tuning of an LLM evaluator.Our experiments demonstrate the comprehensiveness of OmniEval, which includes extensive test datasets and highlights the performance variations of RAG systems across diverse topics and tasks, revealing significant opportunities for RAG models to improve their capabilities in vertical domains.We open source the code of our benchmark in \href{https://github.com/RUC-NLPIR/OmniEval}{https://github.com/RUC-NLPIR/OmniEval}. 0.715

+

Recent advancements in large vision language models (VLMs) tailored for autonomous driving (AD) have shown strong scene understanding and reasoning capabilities, making them undeniable candidates for end-to-end driving systems.However, limited work exists on studying the trustworthiness of DriveVLMs -- a critical factor that directly impacts public transportation safety.In this paper, we introduce AutoTrust, a comprehensive trustworthiness benchmark for large vision-language models in autonomous driving (DriveVLMs), considering diverse perspectives -- including trustfulness, safety, robustness, privacy, and fairness.We constructed the largest visual question-answering dataset for investigating trustworthiness issues in driving scenarios, comprising over 10k unique scenes and 18k queries.We evaluated six publicly available VLMs, spanning from generalist to specialist, from open-source to commercial models.Our exhaustive evaluations have unveiled previously undiscovered vulnerabilities of DriveVLMs to trustworthiness threats.Specifically, we found that the general VLMs like LLaVA-v1.6 and GPT-4o-mini surprisingly outperform specialized models fine-tuned for driving in terms of overall trustworthiness.DriveVLMs like DriveLM-Agent are particularly vulnerable to disclosing sensitive information.Additionally, both generalist and specialist VLMs remain susceptible to adversarial attacks and struggle to ensure unbiased decision-making across diverse environments and populations.Our findings call for immediate and decisive action to address the trustworthiness of DriveVLMs -- an issue of critical importance to public safety and the welfare of all citizens relying on autonomous transportation systems.Our benchmark is publicly available at \url{https://github.com/taco-group/AutoTrust}, and the leaderboard is released at \url{https://taco-group.github.io/AutoTrust/}. 0.613

- + link

@@ -1646,20 +1668,20 @@

Benchmarks

-

2024-12-17

+

2024-12-19

- Queries, Representation & Detection: The Next 100 Model Fingerprinting Schemes + PRIMA: Multi-Image Vision-Language Models for Reasoning Segmentation

-

The deployment of machine learning models in operational contexts represents a significant investment for any organisation.Consequently, the risk of these models being misappropriated by competitors needs to be addressed.In recent years, numerous proposals have been put forth to detect instances of model stealing.However, these proposals operate under implicit and disparate data and model access assumptions; as a consequence, it remains unclear how they can be effectively compared to one another.Our evaluation shows that a simple baseline that we introduce performs on par with existing state-of-the-art fingerprints, which, on the other hand, are much more complex. 0.601To uncover the reasons behind this intriguing result, this paper introduces a systematic approach to both the creation of model fingerprinting schemes and their evaluation benchmarks.By dividing model fingerprinting into three core components -- Query, Representation and Detection (QuRD) -- we are able to identify $\sim100$ previously unexplored QuRD combinations and gain insights into their performance.Finally, we introduce a set of metrics to compare and guide the creation of more representative model stealing detection benchmarks.Our approach reveals the need for more challenging benchmarks and a sound comparison with baselines. 0.709To foster the creation of new fingerprinting schemes and benchmarks, we open-source our fingerprinting toolbox.

+

Despite significant advancements in Large Vision-Language Models (LVLMs), existing pixel-grounding models operate on single-image settings, limiting their ability to perform detailed, fine-grained comparisons across multiple images.Conversely, current multi-image understanding models lack pixel-level grounding.Our work addresses this gap by introducing the task of multi-image pixel-grounded reasoning segmentation, and PRIMA, a novel LVLM that integrates pixel-level grounding with robust multi-image reasoning capabilities to produce contextually rich, pixel-grounded explanations.Central to PRIMA is an efficient vision module that queries fine-grained visual representations across multiple images, reducing TFLOPs by $25.3\%$. To support training and evaluation, we curate $M^4Seg$, a new reasoning segmentation benchmark consisting of $\sim$224K question-answer pairs that require fine-grained visual understanding across multiple images.Experimental results demonstrate PRIMA outperforms state-of-the-art baselines. 0.759

- + link

@@ -1668,20 +1690,20 @@

Benchmarks

-

2024-12-17

+

2024-12-19

- TAME: Temporal Audio-based Mamba for Enhanced Drone Trajectory Estimation and Classification + UIP2P: Unsupervised Instruction-based Image Editing via Cycle Edit Consistency

-

The increasing prevalence of compact UAVs has introduced significant risks to public safety, while traditional drone detection systems are often bulky and costly.To address these challenges, we present TAME, the Temporal Audio-based Mamba for Enhanced Drone Trajectory Estimation and Classification.This innovative anti-UAV detection model leverages a parallel selective state-space model to simultaneously capture and learn both the temporal and spectral features of audio, effectively analyzing propagation of sound.To further enhance temporal features, we introduce a Temporal Feature Enhancement Module, which integrates spectral features into temporal data using residual cross-attention.This enhanced temporal information is then employed for precise 3D trajectory estimation and classification.Our model sets a new standard of performance on the MMUAD benchmarks, demonstrating superior accuracy and effectiveness. 0.657The code and trained models are publicly available on GitHub \url{https://github.com/AmazingDay1/TAME}.

+

We propose an unsupervised model for instruction-based image editing that eliminates the need for ground-truth edited images during training.Existing supervised methods depend on datasets containing triplets of input image, edited image, and edit instruction.These are generated by either existing editing methods or human-annotations, which introduce biases and limit their generalization ability.Our method addresses these challenges by introducing a novel editing mechanism called Cycle Edit Consistency (CEC), which applies forward and backward edits in one training step and enforces consistency in image and attention spaces.This allows us to bypass the need for ground-truth edited images and unlock training for the first time on datasets comprising either real image-caption pairs or image-caption-edit triplets.We empirically show that our unsupervised technique performs better across a broader range of edits with high fidelity and precision. 0.624By eliminating the need for pre-existing datasets of triplets, reducing biases associated with supervised methods, and proposing CEC, our work represents a significant advancement in unblocking scaling of instruction-based image editing.

- + link

@@ -1690,20 +1712,20 @@

Benchmarks

-

2024-12-17

+

2024-12-18

- Identifying Bias in Deep Neural Networks Using Image Transforms + Modality-Independent Graph Neural Networks with Global Transformers for Multimodal Recommendation

-

CNNs have become one of the most commonly used computational tool in the past two decades.One of the primary downsides of CNNs is that they work as a ``black box", where the user cannot necessarily know how the image data are analyzed, and therefore needs to rely on empirical evaluation to test the efficacy of a trained CNN.This can lead to hidden biases that affect the performance evaluation of neural networks, but are difficult to identify.Here we discuss examples of such hidden biases in common and widely used benchmark datasets, and propose techniques for identifying dataset biases that can affect the standard performance evaluation metrics. 0.667One effective approach to identify dataset bias is to perform image classification by using merely blank background parts of the original images.However, in some situations a blank background in the images is not available, making it more difficult to separate foreground or contextual information from the bias.To overcome this, we propose a method to identify dataset bias without the need to crop background information from the images.That method is based on applying several image transforms to the original images, including Fourier transform, wavelet transforms, median filter, and their combinations.These transforms were applied to recover background bias information that CNNs use to classify images.This transformations affect the contextual visual information in a different manner than it affects the systemic background bias.Therefore, the method can distinguish between contextual information and the bias, and alert on the presence of background bias even without the need to separate sub-images parts from the blank background of the original images.Code used in the experiments is publicly available.

+

Multimodal recommendation systems can learn users' preferences from existing user-item interactions as well as the semantics of multimodal data associated with items.Many existing methods model this through a multimodal user-item graph, approaching multimodal recommendation as a graph learning task.Graph Neural Networks (GNNs) have shown promising performance in this domain.Prior research has capitalized on GNNs' capability to capture neighborhood information within certain receptive fields (typically denoted by the number of hops, $K$) to enrich user and item semantics.We observe that the optimal receptive fields for GNNs can vary across different modalities.In this paper, we propose GNNs with Modality-Independent Receptive Fields, which employ separate GNNs with independent receptive fields for different modalities to enhance performance.Our results indicate that the optimal $K$ for certain modalities on specific datasets can be as low as 1 or 2, which may restrict the GNNs' capacity to capture global information.To address this, we introduce a Sampling-based Global Transformer, which utilizes uniform global sampling to effectively integrate global information for GNNs.We conduct comprehensive experiments that demonstrate the superiority of our approach over existing methods. 0.631Our code is publicly available at https://github.com/CrawlScript/MIG-GT.

- + link

@@ -1712,20 +1734,20 @@

Benchmarks

-

2024-12-17

+

2024-12-18

- Accuracy Limits as a Barrier to Biometric System Security + Discovering maximally consistent distribution of causal tournaments with Large Language Models

-

Biometric systems are widely used for identity verification and identification, including authentication (i.e., one-to-one matching to verify a claimed identity) and identification (i.e., one-to-many matching to find a subject in a database).The matching process relies on measuring similarities or dissimilarities between a fresh biometric template and enrolled templates.The False Match Rate FMR is a key metric for assessing the accuracy and reliability of such systems. 0.616This paper analyzes biometric systems based on their FMR, with two main contributions.First, we explore untargeted attacks, where an adversary aims to impersonate any user within a database.We determine the number of trials required for an attacker to successfully impersonate a user and derive the critical population size (i.e., the maximum number of users in the database) required to maintain a given level of security.Furthermore, we compute the critical FMR value needed to ensure resistance against untargeted attacks as the database size increases.Second, we revisit the biometric birthday problem to evaluate the approximate and exact probabilities that two users in a database collide (i.e., can impersonate each other).Based on this analysis, we derive both the approximate critical population size and the critical FMR value needed to bound the likelihood of such collisions occurring with a given probability.These thresholds offer insights for designing systems that mitigate the risk of impersonation and collisions, particularly in large-scale biometric databases.Our findings indicate that current biometric systems fail to deliver sufficient accuracy to achieve an adequate security level against untargeted attacks, even in small-scale databases.Moreover, state-of-the-art systems face significant challenges in addressing the biometric birthday problem, especially as database sizes grow.

+

Causal discovery is essential for understanding complex systems, yet traditional methods often depend on strong, untestable assumptions, making the process challenging.Large Language Models (LLMs) present a promising alternative for extracting causal insights from text-based metadata, which consolidates domain expertise.However, LLMs are prone to unreliability and hallucinations, necessitating strategies that account for their limitations.One such strategy involves leveraging a consistency measure to evaluate reliability. 0.6Additionally, most text metadata does not clearly distinguish direct causal relationships from indirect ones, further complicating the inference of causal graphs.As a result, focusing on causal orderings, rather than causal graphs, emerges as a more practical and robust approach.We propose a novel method to derive a distribution of acyclic tournaments (representing plausible causal orders) that maximizes a consistency score.Our approach begins by computing pairwise consistency scores between variables, yielding a cyclic tournament that aggregates these scores.From this structure, we identify optimal acyclic tournaments compatible with the original tournament, prioritizing those that maximize consistency across all configurations.We tested our method on both classical and well-established bechmarks, as well as real-world datasets from epidemiology and public health.Our results demonstrate the effectiveness of our approach in recovering distributions causal orders with minimal error.

- + link

@@ -1734,20 +1756,20 @@

Benchmarks

-

2024-12-17

+

2024-12-18

- AIR-Bench: Automated Heterogeneous Information Retrieval Benchmark + A Computationally Grounded Framework for Cognitive Attitudes (extended version)

-

Evaluation plays a crucial role in the advancement of information retrieval (IR) models.However, current benchmarks, which are based on predefined domains and human-labeled data, face limitations in addressing evaluation needs for emerging domains both cost-effectively and efficiently.To address this challenge, we propose the Automated Heterogeneous Information Retrieval Benchmark (AIR-Bench).AIR-Bench is distinguished by three key features: 1) Automated.The testing data in AIR-Bench is automatically generated by large language models (LLMs) without human intervention.2) Heterogeneous.The testing data in AIR-Bench is generated with respect to diverse tasks, domains and languages.3) Dynamic.The domains and languages covered by AIR-Bench are constantly augmented to provide an increasingly comprehensive evaluation benchmark for community developers. 0.617We develop a reliable and robust data generation pipeline to automatically create diverse and high-quality evaluation datasets based on real-world corpora.Our findings demonstrate that the generated testing data in AIR-Bench aligns well with human-labeled testing data, making AIR-Bench a dependable benchmark for evaluating IR models.The resources in AIR-Bench are publicly available at https://github.com/AIR-Bench/AIR-Bench.

+

We introduce a novel language for reasoning about agents' cognitive attitudes of both epistemic and motivational type.We interpret it by means of a computationally grounded semantics using belief bases.Our language includes five types of modal operators for implicit belief, complete attraction, complete repulsion, realistic attraction and realistic repulsion.We give an axiomatization and show that our operators are not mutually expressible and that they can be combined to represent a large variety of psychological concepts including ambivalence, indifference, being motivated, being demotivated and preference.We present a dynamic extension of the language that supports reasoning about the effects of belief change operations.Finally, we provide a succinct formulation of model checking for our languages and a PSPACE model checking algorithm relying on a reduction into TQBF.We present some experimental results for the implemented algorithm on computation time in a concrete example. 0.631

- + link

@@ -1756,20 +1778,20 @@

Benchmarks

-

2024-12-17

+

2024-12-18

- Improving Explainability of Sentence-level Metrics via Edit-level Attribution for Grammatical Error Correction + Online MDP with Transition Prototypes: A Robust Adaptive Approach

-

Various evaluation metrics have been proposed for Grammatical Error Correction (GEC), but many, particularly reference-free metrics, lack explainability. 0.603This lack of explainability hinders researchers from analyzing the strengths and weaknesses of GEC models and limits the ability to provide detailed feedback for users.To address this issue, we propose attributing sentence-level scores to individual edits, providing insight into how specific corrections contribute to the overall performance.For the attribution method, we use Shapley values, from cooperative game theory, to compute the contribution of each edit.Experiments with existing sentence-level metrics demonstrate high consistency across different edit granularities and show approximately 70\% alignment with human evaluations. 0.612In addition, we analyze biases in the metrics based on the attribution results, revealing trends such as the tendency to ignore orthographic edits.Our implementation is available at \url{https://github.com/naist-nlp/gec-attribute}.

+

In this work, we consider an online robust Markov Decision Process (MDP) where we have the information of finitely many prototypes of the underlying transition kernel.We consider an adaptively updated ambiguity set of the prototypes and propose an algorithm that efficiently identifies the true underlying transition kernel while guaranteeing the performance of the corresponding robust policy.To be more specific, we provide a sublinear regret of the subsequent optimal robust policy.We also provide an early stopping mechanism and a worst-case performance bound of the value function.In numerical experiments, we demonstrate that our method outperforms existing approaches, particularly in the early stage with limited data. 0.681This work contributes to robust MDPs by considering possible prior information about the underlying transition probability and online learning, offering both theoretical insights and practical algorithms for improved decision-making under uncertainty.

- + link

@@ -1778,20 +1800,20 @@

Benchmarks

-

2024-12-17

+

2024-12-18

- Previous Knowledge Utilization In Online Anytime Belief Space Planning + On the Use of Abundant Road Speed Data for Travel Demand Calibration of Urban Traffic Simulators

-

Online planning under uncertainty remains a critical challenge in robotics and autonomous systems.While tree search techniques are commonly employed to construct partial future trajectories within computational constraints, most existing methods discard information from previous planning sessions considering continuous spaces.This study presents a novel, computationally efficient approach that leverages historical planning data in current decision-making processes.We provide theoretical foundations for our information reuse strategy and introduce an algorithm based on Monte Carlo Tree Search (MCTS) that implements this approach.Experimental results demonstrate that our method significantly reduces computation time while maintaining high performance levels. 0.695Our findings suggest that integrating historical planning information can substantially improve the efficiency of online decision-making in uncertain environments, paving the way for more responsive and adaptive autonomous systems.

+

This work develops a compute-efficient algorithm to tackle a fundamental problem in transportation: that of urban travel demand estimation.It focuses on the calibration of origin-destination travel demand input parameters for high-resolution traffic simulation models.It considers the use of abundant traffic road speed data.The travel demand calibration problem is formulated as a continuous, high-dimensional, simulation-based optimization (SO) problem with bound constraints.There is a lack of compute efficient algorithms to tackle this problem.We propose the use of an SO algorithm that relies on an efficient, analytical, differentiable, physics-based traffic model, known as a metamodel or surrogate model.We formulate a metamodel that enables the use of road speed data.Tests are performed on a Salt Lake City network.We study how the amount of data, as well as the congestion levels, impact both in-sample and out-of-sample performance.The proposed method outperforms the benchmark for both in-sample and out-of-sample performance by 84.4% and 72.2% in terms of speeds and counts, respectively. 0.729Most importantly, the proposed method yields the highest compute efficiency, identifying solutions with good performance within few simulation function evaluations (i.e., with small samples). 0.693

- + link

@@ -1800,20 +1822,20 @@

Benchmarks

-

2024-12-17

+

2024-12-18

- Analyzing Toxicity in Open Source Software Communications Using Psycholinguistics and Moral Foundations Theory + On Calibration in Multi-Distribution Learning

-

Studies have shown that toxic behavior can cause contributors to leave, and hinder newcomers' (especially from underrepresented communities) participation in Open Source Software (OSS) projects.Thus, detection of toxic language plays a crucial role in OSS collaboration and inclusivity.Off-the-shelf toxicity detectors are ineffective when applied to OSS communications, due to the distinct nature of toxicity observed in these channels (e.g., entitlement and arrogance are more frequently observed on GitHub than on Reddit or Twitter).In this paper, we investigate a machine learning-based approach for the automatic detection of toxic communications in OSS.We leverage psycholinguistic lexicons, and Moral Foundations Theory to analyze toxicity in two types of OSS communication channels; issue comments and code reviews.Our evaluation indicates that our approach can achieve a significant performance improvement (up to 7% increase in F1 score) over the existing domain-specific toxicity detector. 0.603We found that using moral values as features is more effective than linguistic cues, resulting in 67.50% F1-measure in identifying toxic instances in code review data and 64.83% in issue comments.While the detection accuracy is far from accurate, this improvement demonstrates the potential of integrating moral and psycholinguistic features in toxicity detection models.These findings highlight the importance of context-specific models that consider the unique communication styles within OSS, where interpersonal and value-driven language dynamics differ markedly from general social media platforms.Future work could focus on refining these models to further enhance detection accuracy, possibly by incorporating community-specific norms and conversational context to better capture the nuanced expressions of toxicity in OSS environments.

+

Modern challenges of robustness, fairness, and decision-making in machine learning have led to the formulation of multi-distribution learning (MDL) frameworks in which a predictor is optimized across multiple distributions.We study the calibration properties of MDL to better understand how the predictor performs uniformly across the multiple distributions.Through classical results on decomposing proper scoring losses, we first derive the Bayes optimal rule for MDL, demonstrating that it maximizes the generalized entropy of the associated loss function.Our analysis reveals that while this approach ensures minimal worst-case loss, it can lead to non-uniform calibration errors across the multiple distributions and there is an inherent calibration-refinement trade-off, even at Bayes optimality. 0.615Our results highlight a critical limitation: despite the promise of MDL, one must use caution when designing predictors tailored to multiple distributions so as to minimize disparity.

- + link

@@ -1937,20 +1959,20 @@

LLMs

-

2024-12-23

+

2024-12-24

- Emerging Security Challenges of Large Language Models + Muse: A Multimodal Conversational Recommendation Dataset with Scenario-Grounded User Profiles

-

Large language models (LLMs) have achieved record adoption in a short period of time across many different sectors including high importance areas such as education [4] and healthcare 0.688[23].LLMs are open-ended models trained on diverse data without being tailored for specific downstream tasks, enabling broad applicability across various domains. 0.673They are commonly used for text generation, but also widely used to assist with code generation [3], and even analysis of security information, as Microsoft Security Copilot demonstrates [18].Traditional Machine Learning (ML) models are vulnerable to adversarial attacks [9].So the concerns on the potential security implications of such wide scale adoption of LLMs have led to the creation of this working group on the security of LLMs. 0.775During the Dagstuhl seminar on "Network Attack Detection and Defense - AI-Powered Threats and Responses", the working group discussions focused on the vulnerability of LLMs to adversarial attacks, rather than their potential use in generating malware or enabling cyberattacks. 0.699Although we note the potential threat represented by the latter, the role of the LLMs in such uses is mostly as an accelerator for development, similar to what it is in benign use. 0.78To make the analysis more specific, the working group employed ChatGPT as a concrete example of an LLM and addressed the following points, which also form the structure of this report: 1. How do LLMs differ in vulnerabilities from traditional ML models? 0.7262.What are the attack objectives in LLMs? 0.7273. How complex it is to assess the risks posed by the vulnerabilities of LLMs? 0.7434.What is the supply chain in LLMs, how data flow in and out of systems and what are the security implications? 0.752We conclude with an overview of open challenges and outlook.

+

Current conversational recommendation systems focus predominantly on text.However, real-world recommendation settings are generally multimodal, causing a significant gap between existing research and practical applications.To address this issue, we propose Muse, the first multimodal conversational recommendation dataset.Muse comprises 83,148 utterances from 7,000 conversations centered around the Clothing domain.Each conversation contains comprehensive multimodal interactions, rich elements, and natural dialogues.Data in Muse are automatically synthesized by a multi-agent framework powered by multimodal large language models (MLLMs).It innovatively derives user profiles from real-world scenarios rather than depending on manual design and history data for better scalability, and then it fulfills conversation simulation and optimization.Both human and LLM evaluations demonstrate the high quality of conversations in Muse. 0.697Additionally, fine-tuning experiments on three MLLMs demonstrate Muse's learnable patterns for recommendations and responses, confirming its value for multimodal conversational recommendation.Our dataset and codes are available at \url{https://anonymous.4open.science/r/Muse-0086}.

- + link

@@ -1959,20 +1981,20 @@

LLMs

-

2024-12-23

+

2024-12-24

- Dynamic safety cases for frontier AI + Research on the Proximity Relationships of Psychosomatic Disease Knowledge Graph Modules Extracted by Large Language Models

-

Frontier artificial intelligence (AI) systems present both benefits and risks to society.Safety cases - structured arguments supported by evidence - are one way to help ensure the safe development and deployment of these systems.Yet the evolving nature of AI capabilities, as well as changes in the operational environment and understanding of risk, necessitates mechanisms for continuously updating these safety cases.Typically, in other sectors, safety cases are produced pre-deployment and do not require frequent updates post-deployment, which can be a manual, costly process.This paper proposes a Dynamic Safety Case Management System (DSCMS) to support both the initial creation of a safety case and its systematic, semi-automated revision over time. 0.637Drawing on methods developed in the autonomous vehicles (AV) sector - state-of-the-art Checkable Safety Arguments (CSA) combined with Safety Performance Indicators (SPIs) recommended by UL 4600, a DSCMS helps developers maintain alignment between system safety claims and the latest system state.We demonstrate this approach on a safety case template for offensive cyber capabilities and suggest ways it can be integrated into governance structures for safety-critical decision-making.While the correctness of the initial safety argument remains paramount - particularly for high-severity risks - a DSCMS provides a framework for adapting to new insights and strengthening incident response. 0.613We outline challenges and further work towards development and implementation of this approach as part of continuous safety assurance of frontier AI systems.

+

As social changes accelerate, the incidence of psychosomatic disorders has significantly increased, becoming a major challenge in global health issues.This necessitates an innovative knowledge system and analytical methods to aid in diagnosis and treatment.Here, we establish the ontology model and entity types, using the BERT model and LoRA-tuned LLM for named entity recognition, constructing the knowledge graph with 9668 triples. 0.601Next, by analyzing the network distances between disease, symptom, and drug modules, it was found that closer network distances among diseases can predict greater similarities in their clinical manifestations, treatment approaches, and psychological mechanisms, and closer distances between symptoms indicate that they are more likely to co-occur.Lastly, by comparing the proximity d and proximity z score, it was shown that symptom-disease pairs in primary diagnostic relationships have a stronger association and are of higher referential value than those in diagnostic relationships.The research results revealed the potential connections between diseases, co-occurring symptoms, and similarities in treatment strategies, providing new perspectives for the diagnosis and treatment of psychosomatic disorders and valuable information for future mental health research and practice.

- + link

@@ -1981,20 +2003,20 @@

LLMs

-

2024-12-23

+

2024-12-24

- Tracking the Feature Dynamics in LLM Training: A Mechanistic Study + Explainable Multi-Modal Data Exploration in Natural Language via LLM Agent

-

Understanding training dynamics and feature evolution is crucial for the mechanistic interpretability of large language models (LLMs).Although sparse autoencoders (SAEs) have been used to identify features within LLMs, a clear picture of how these features evolve during training remains elusive.In this study, we: (1) introduce SAE-Track, a method to efficiently obtain a continual series of SAEs; (2) formulate the process of feature formation and conduct a mechanistic analysis; and (3) analyze and visualize feature drift during training.Our work provides new insights into the dynamics of features in LLMs, enhancing our understanding of training mechanisms and feature evolution. 0.693

+

International enterprises, organizations, or hospitals collect large amounts of multi-modal data stored in databases, text documents, images, and videos.While there has been recent progress in the separate fields of multi-modal data exploration as well as in database systems that automatically translate natural language questions to database query languages, the research challenge of querying database systems combined with other unstructured modalities such as images in natural language is widely unexplored. In this paper, we propose XMODE - a system that enables explainable, multi-modal data exploration in natural language.Our approach is based on the following research contributions: (1) Our system is inspired by a real-world use case that enables users to explore multi-modal information systems.(2) XMODE leverages a LLM-based agentic AI framework to decompose a natural language question into subtasks such as text-to-SQL generation and image analysis. 0.618(3) Experimental results on multi-modal datasets over relational data and images demonstrate that our system outperforms state-of-the-art multi-modal exploration systems, excelling not only in accuracy but also in various performance metrics such as query latency, API costs, planning efficiency, and explanation quality, thanks to the more effective utilization of the reasoning capabilities of LLMs. 0.616

- + link

@@ -2003,20 +2025,20 @@

LLMs

-

2024-12-23

+

2024-12-24

- SCBench: A Sports Commentary Benchmark for Video LLMs + Is Large Language Model Good at Triple Set Prediction? An Empirical Study

-

Recently, significant advances have been made in Video Large Language Models (Video LLMs) in both academia and industry. 0.629However, methods to evaluate and benchmark the performance of different Video LLMs, especially their fine-grained, temporal visual capabilities, remain very limited. 0.683On one hand, current benchmarks use relatively simple videos (e.g., subtitled movie clips) where the model can understand the entire video by processing just a few frames.On the other hand, their datasets lack diversity in task format, comprising only QA or multi-choice QA, which overlooks the models' capacity for generating in-depth and precise texts.Sports videos, which feature intricate visual information, sequential events, and emotionally charged commentary, present a critical challenge for Video LLMs, making sports commentary an ideal benchmarking task. 0.657Inspired by these challenges, we propose a novel task: sports video commentary generation, developed $\textbf{SCBench}$ for Video LLMs. 0.611To construct such a benchmark, we introduce (1) $\textbf{SCORES}$, a six-dimensional metric specifically designed for our task, upon which we propose a GPT-based evaluation method, and (2) $\textbf{CommentarySet}$, a dataset consisting of 5,775 annotated video clips and ground-truth labels tailored to our metric.Based on SCBench, we conduct comprehensive evaluations on multiple Video LLMs (e.g. VILA, Video-LLaVA, etc.) and chain-of-thought baseline methods. 0.632Our results found that InternVL-Chat-2 achieves the best performance with 5.44, surpassing the second-best by 1.04.Our work provides a fresh perspective for future research, aiming to enhance models' overall capabilities in complex visual understanding tasks.Our dataset will be released soon.

+

The core of the Knowledge Graph Completion (KGC) task is to predict and complete the missing relations or nodes in a KG.Common KGC tasks are mostly about inferring unknown elements with one or two elements being known in a triple.In comparison, the Triple Set Prediction (TSP) task is a more realistic knowledge graph completion task.It aims to predict all elements of unknown triples based on the information from known triples.In recent years, large language models (LLMs) have exhibited significant advancements in language comprehension, demonstrating considerable potential for KGC tasks. 0.675However, the potential of LLM on the TSP task has not yet to be investigated. 0.719Thus in this paper we proposed a new framework to explore the strengths and limitations of LLM in the TSP task. 0.696Specifically, the framework consists of LLM-based rule mining and LLM-based triple set prediction.The relation list of KG embedded within rich semantic information is first leveraged to prompt LLM in the generation of rules. 0.658This process is both efficient and independent of statistical information, making it easier to mine effective and realistic rules.For each subgraph, the specified rule is applied in conjunction with the relevant triples within that subgraph to guide the LLM in predicting the missing triples.Subsequently, the predictions from all subgraphs are consolidated to derive the complete set of predicted triples on KG.Finally, the method is evaluated on the relatively complete CFamily dataset.The experimental results indicate that when LLMs are required to adhere to a large amount of factual knowledge to predict missing triples, significant hallucinations occurs, leading to a noticeable decline in performance. 0.718To further explore the causes of this phenomenon, this paper presents a comprehensive analysis supported by a detailed case study.

- + link

@@ -2025,20 +2047,20 @@

LLMs

-

2024-12-23

+

2024-12-24

- Detecting anxiety and depression in dialogues: a multi-label and explainable approach + 3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D Scene Understanding

-

Anxiety and depression are the most common mental health issues worldwide, affecting a non-negligible part of the population.Accordingly, stakeholders, including governments' health systems, are developing new strategies to promote early detection and prevention from a holistic perspective (i.e., addressing several disorders simultaneously).In this work, an entirely novel system for the multi-label classification of anxiety and depression is proposed.The input data consists of dialogues from user interactions with an assistant chatbot.Another relevant contribution lies in using Large Language Models (LLMs) for feature extraction, provided the complexity and variability of language.The combination of LLMs, given their high capability for language understanding, and Machine Learning (ML) models, provided their contextual knowledge about the classification problem thanks to the labeled data, constitute a promising approach towards mental health assessment. 0.629To promote the solution's trustworthiness, reliability, and accountability, explainability descriptions of the model's decision are provided in a graphical dashboard.Experimental results on a real dataset attain 90 % accuracy, improving those in the prior literature.The ultimate objective is to contribute in an accessible and scalable way before formal treatment occurs in the healthcare systems.

+

A 3D scene graph represents a compact scene model, storing information about the objects and the semantic relationships between them, making its use promising for robotic tasks.When interacting with a user, an embodied intelligent agent should be capable of responding to various queries about the scene formulated in natural language.Large Language Models (LLMs) are beneficial solutions for user-robot interaction due to their natural language understanding and reasoning abilities. 0.689Recent methods for creating learnable representations of 3D scenes have demonstrated the potential to improve the quality of LLMs responses by adapting to the 3D world. 0.68However, the existing methods do not explicitly utilize information about the semantic relationships between objects, limiting themselves to information about their coordinates.In this work, we propose a method 3DGraphLLM for constructing a learnable representation of a 3D scene graph.The learnable representation is used as input for LLMs to perform 3D vision-language tasks. 0.684In our experiments on popular ScanRefer, RIORefer, Multi3DRefer, ScanQA, Sqa3D, and Scan2cap datasets, we demonstrate the advantage of this approach over baseline methods that do not use information about the semantic relationships between objects.The code is publicly available at https://github.com/CognitiveAISystems/3DGraphLLM.

- + link

@@ -2047,20 +2069,20 @@

LLMs

-

2024-12-23

+

2024-12-24

- Generating Completions for Fragmented Broca's Aphasic Sentences Using Large Language Models + Think or Remember? Detecting and Directing LLMs Towards Memorization or Generalization

-

Broca's aphasia is a type of aphasia characterized by non-fluent, effortful and fragmented speech production with relatively good comprehension.Since traditional aphasia treatment methods are often time-consuming, labour-intensive, and do not reflect real-world conversations, applying natural language processing based approaches such as Large Language Models (LLMs) could potentially contribute to improving existing treatment approaches. 0.65To address this issue, we explore the use of sequence-to-sequence LLMs for completing fragmented Broca's aphasic sentences.We first generate synthetic Broca's aphasic data using a rule-based system designed to mirror the linguistic characteristics of Broca's aphasic speech.Using this synthetic data, we then fine-tune four pre-trained LLMs on the task of completing fragmented sentences. 0.629We evaluate our fine-tuned models on both synthetic and authentic Broca's aphasic data.We demonstrate LLMs' capability for reconstructing fragmented sentences, with the models showing improved performance with longer input utterances. 0.661Our result highlights the LLMs' potential in advancing communication aids for individuals with Broca's aphasia and possibly other clinical populations. 0.678

+

In this paper, we explore the foundational mechanisms of memorization and generalization in Large Language Models (LLMs), inspired by the functional specialization observed in the human brain.Our investigation serves as a case study leveraging specially designed datasets and experimental-scale LLMs to lay the groundwork for understanding these behaviors. 0.678Specifically, we aim to first enable LLMs to exhibit both memorization and generalization by training with the designed dataset, then (a) examine whether LLMs exhibit neuron-level spatial differentiation for memorization and generalization, (b) predict these behaviors using model internal representations, and (c) steer the behaviors through inference-time interventions. 0.642Our findings reveal that neuron-wise differentiation of memorization and generalization is observable in LLMs, and targeted interventions can successfully direct their behavior. 0.685

- + link

@@ -2069,20 +2091,20 @@

LLMs

-

2024-12-23

+

2024-12-24

- Large Language Model Safety: A Holistic Survey + Large Language Model guided Deep Reinforcement Learning for Decision Making in Autonomous Driving

-

The rapid development and deployment of large language models (LLMs) have introduced a new frontier in artificial intelligence, marked by unprecedented capabilities in natural language understanding and generation. 0.687However, the increasing integration of these models into critical applications raises substantial safety concerns, necessitating a thorough examination of their potential risks and associated mitigation strategies. This survey provides a comprehensive overview of the current landscape of LLM safety, covering four major categories: value misalignment, robustness to adversarial attacks, misuse, and autonomous AI risks. 0.69In addition to the comprehensive review of the mitigation methodologies and evaluation resources on these four aspects, we further explore four topics related to LLM safety: the safety implications of LLM agents, the role of interpretability in enhancing LLM safety, the technology roadmaps proposed and abided by a list of AI companies and institutes for LLM safety, and AI governance aimed at LLM safety with discussions on international cooperation, policy proposals, and prospective regulatory directions. 0.71Our findings underscore the necessity for a proactive, multifaceted approach to LLM safety, emphasizing the integration of technical solutions, ethical considerations, and robust governance frameworks. 0.719This survey is intended to serve as a foundational resource for academy researchers, industry practitioners, and policymakers, offering insights into the challenges and opportunities associated with the safe integration of LLMs into society. 0.778Ultimately, it seeks to contribute to the safe and beneficial development of LLMs, aligning with the overarching goal of harnessing AI for societal advancement and well-being. 0.722A curated list of related papers has been publicly available at https://github.com/tjunlp-lab/Awesome-LLM-Safety-Papers. 0.698

+

Deep reinforcement learning (DRL) shows promising potential for autonomous driving decision-making.However, DRL demands extensive computational resources to achieve a qualified policy in complex driving scenarios due to its low learning efficiency.Moreover, leveraging expert guidance from human to enhance DRL performance incurs prohibitively high labor costs, which limits its practical application.In this study, we propose a novel large language model (LLM) guided deep reinforcement learning (LGDRL) framework for addressing the decision-making problem of autonomous vehicles.Within this framework, an LLM-based driving expert is integrated into the DRL to provide intelligent guidance for the learning process of DRL. 0.65Subsequently, in order to efficiently utilize the guidance of the LLM expert to enhance the performance of DRL decision-making policies, the learning and interaction process of DRL is enhanced through an innovative expert policy constrained algorithm and a novel LLM-intervened interaction mechanism. 0.669Experimental results demonstrate that our method not only achieves superior driving performance with a 90\% task success rate but also significantly improves the learning efficiency and expert guidance utilization efficiency compared to state-of-the-art baseline algorithms.Moreover, the proposed method enables the DRL agent to maintain consistent and reliable performance in the absence of LLM expert guidance. 0.649The code and supplementary videos are available at https://bitmobility.github.io/LGDRL/.

- + link

@@ -2091,20 +2113,20 @@

LLMs

-

2024-12-23

+

2024-12-24

- RAGONITE: Iterative Retrieval on Induced Databases and Verbalized RDF for Conversational QA over KGs with RAG + Automated Code Review In Practice

-

Conversational question answering (ConvQA) is a convenient means of searching over RDF knowledge graphs (KGs), where a prevalent approach is to translate natural language questions to SPARQL queries.However, SPARQL has certain shortcomings: (i) it is brittle for complex intents and conversational questions, and (ii) it is not suitable for more abstract needs. 0.604Instead, we propose a novel two-pronged system where we fuse: (i) SQL-query results over a database automatically derived from the KG, and (ii) text-search results over verbalizations of KG facts.Our pipeline supports iterative retrieval: when the results of any branch are found to be unsatisfactory, the system can automatically opt for further rounds.We put everything together in a retrieval augmented generation (RAG) setup, where an LLM generates a coherent response from accumulated search results. 0.677We demonstrate the superiority of our proposed system over several baselines on a knowledge graph of BMW automobiles.

+

Code review is a widespread practice to improve software quality and transfer knowledge.It is often seen as time-consuming due to the need for manual effort and potential delays.Several AI-assisted tools, such as Qodo, GitHub Copilot, and Coderabbit, provide automated reviews using large language models (LLMs).The effects of such tools in the industry are yet to be examined. This study examines the impact of LLM-based automated code review tools in an industrial setting. 0.724The study was conducted within a software development environment that adopted an AI-assisted review tool (based on open-source Qodo PR Agent).Around 238 practitioners across ten projects had access to the tool.We focused on three projects with 4,335 pull requests, 1,568 of which underwent automated reviews.Data collection comprised three sources: (1) a quantitative analysis of pull request data, including comment labels indicating whether developers acted on the automated comments, (2) surveys sent to developers regarding their experience with reviews on individual pull requests, and (3) a broader survey of 22 practitioners capturing their general opinions on automated reviews. 73.8% of automated comments were resolved.However, the average pull request closure duration increased from five hours 52 minutes to eight hours 20 minutes, with varying trends across projects.Most practitioners reported a minor improvement in code quality due to automated reviews. The LLM-based tool proved useful in software development, enhancing bug detection, increasing awareness of code quality, and promoting best practices. 0.769However, it also led to longer pull request closure times and introduced drawbacks like faulty reviews, unnecessary corrections, and irrelevant comments.

- + link

@@ -2113,20 +2135,20 @@

LLMs

-

2024-12-23

+

2024-12-24

- Knowledge Editing through Chain-of-Thought + Harnessing Large Language Models for Knowledge Graph Question Answering via Adaptive Multi-Aspect Retrieval-Augmentation

-

Large Language Models (LLMs) have demonstrated exceptional capabilities across a wide range of natural language processing (NLP) tasks. 0.668However, keeping these models up-to-date with evolving world knowledge remains a significant challenge due to the high costs of frequent retraining.To address this challenge, knowledge editing techniques have emerged to update LLMs with new information without rebuilding the model from scratch. 0.711Among these, the in-context editing paradigm stands out for its effectiveness in integrating new knowledge while preserving the model's original capabilities.Despite its potential, existing in-context knowledge editing methods are often task-specific, focusing primarily on multi-hop QA tasks using structured knowledge triples.Moreover, their reliance on few-shot prompting for task decomposition makes them unstable and less effective in generalizing across diverse tasks. In response to these limitations, we propose EditCoT, a novel knowledge editing framework that flexibly and efficiently updates LLMs across various tasks without retraining. 0.681EditCoT works by generating a chain-of-thought (CoT) for a given input and then iteratively refining this CoT process using a CoT editor based on updated knowledge.We evaluate EditCoT across a diverse range of benchmarks, covering multiple languages and tasks.The results demonstrate that our approach achieves state-of-the-art performance while offering superior generalization, effectiveness, and stability compared to existing methods, marking a significant advancement in the field of knowledge updating.Code and data are available at: https://github.com/bebr2/EditCoT.

+

Large Language Models (LLMs) demonstrate remarkable capabilities, yet struggle with hallucination and outdated knowledge when tasked with complex knowledge reasoning, resulting in factually incorrect outputs. 0.724Previous studies have attempted to mitigate it by retrieving factual knowledge from large-scale knowledge graphs (KGs) to assist LLMs in logical reasoning and prediction of answers. 0.645However, this kind of approach often introduces noise and irrelevant data, especially in situations with extensive context from multiple knowledge aspects.In this way, LLM attention can be potentially mislead from question and relevant information. 0.75In our study, we introduce an Adaptive Multi-Aspect Retrieval-augmented over KGs (Amar) framework.This method retrieves knowledge including entities, relations, and subgraphs, and converts each piece of retrieved text into prompt embeddings.The Amar framework comprises two key sub-components: 1) a self-alignment module that aligns commonalities among entities, relations, and subgraphs to enhance retrieved text, thereby reducing noise interference; 2) a relevance gating module that employs a soft gate to learn the relevance score between question and multi-aspect retrieved data, to determine which information should be used to enhance LLMs' output, or even filtered altogether.Our method has achieved state-of-the-art performance on two common datasets, WebQSP and CWQ, showing a 1.9\% improvement in accuracy over its best competitor and a 6.6\% improvement in logical form generation over a method that directly uses retrieved text as context prompts.These results demonstrate the effectiveness of Amar in improving the reasoning of LLMs. 0.723

- + link

@@ -2135,20 +2157,20 @@

LLMs

-

2024-12-23

+

2024-12-24

- Chumor 2.0: Towards Benchmarking Chinese Humor Understanding + Token-Budget-Aware LLM Reasoning

-

Existing humor datasets and evaluations predominantly focus on English, leaving limited resources for culturally nuanced humor in non-English languages like Chinese.To address this gap, we construct Chumor, the first Chinese humor explanation dataset that exceeds the size of existing humor datasets.Chumor is sourced from Ruo Zhi Ba, a Chinese Reddit-like platform known for sharing intellectually challenging and culturally specific jokes.We test ten LLMs through direct and chain-of-thought prompting, revealing that Chumor poses significant challenges to existing LLMs, with their accuracy slightly above random and far below human. 0.744In addition, our analysis highlights that human-annotated humor explanations are significantly better than those generated by GPT-4o and ERNIE-4-turbo.We release Chumor at https://huggingface.co/datasets/dnaihao/Chumor, our project page is at https://dnaihao.github.io/Chumor-dataset/, our leaderboard is at https://huggingface.co/spaces/dnaihao/Chumor, and our codebase is at https://github.com/dnaihao/Chumor-dataset.

+

Reasoning is critical for large language models (LLMs) to excel in a wide range of tasks. 0.654While methods like Chain-of-Thought (CoT) reasoning enhance LLM performance by decomposing problems into intermediate steps, they also incur significant overhead in token usage, leading to increased costs. 0.694We find that the reasoning process of current LLMs is unnecessarily lengthy and it can be compressed by including a reasonable token budget in the prompt, but the choice of token budget plays a crucial role in the actual compression effectiveness. 0.739We then propose a token-budget-aware LLM reasoning framework, which dynamically estimates token budgets for different problems based on reasoning complexity and uses the estimated token budgets to guide the reasoning process. 0.654Experiments show that our method effectively reduces token costs in CoT reasoning with only a slight performance reduction, offering a practical solution to balance efficiency and accuracy in LLM reasoning. 0.675Code: https://github.com/GeniusHTX/TALE.

- + link

@@ -2157,20 +2179,20 @@

LLMs

-

2024-12-23

+

2024-12-24

- Reasoning to Attend: Try to Understand How Token Works + Libra-Leaderboard: Towards Responsible AI through a Balanced Leaderboard of Safety and Capability

-

Current Large Multimodal Models (LMMs) empowered visual grounding typically rely on $\texttt{}$ token as a text prompt to jointly optimize the vision-language model (e.g., LLaVA) and the downstream task-specified model (\eg, SAM).However, we observe that little research has looked into how it works.In this work, we first visualize the similarity maps, which are obtained by computing the semantic similarity between the $\texttt{}$ token and the image token embeddings derived from the last hidden layer in both the LLaVA encoder and SAM decoder.Intriguingly, we have found that a striking consistency holds in terms of activation responses in the similarity map,which reveals that what $\texttt{}$ token contributes to is the semantic similarity within image-text pairs.Specifically, $\texttt{}$ token, a placeholder expanded in text vocabulary, extensively queries among individual tokenized image patches to match the semantics of an object from text to the paired image while the Large Language Models (LLMs) are being fine-tuned. 0.602Upon the above findings, we present READ, which facilitates LMMs' resilient $\textbf{REA}$soning capability of where to atten$\textbf{D}$ under the guidance of highly activated points borrowed from similarity maps.Remarkably, READ features an intuitive design, Similarity as Points module (SasP), which can be seamlessly applied to $\texttt{}$-like paradigms in a plug-and-play fashion.Also, extensive experiments have been conducted on the ReasonSeg and RefCOCO(+/g) datasets.To validate whether READ suffers from catastrophic forgetting of previous skills after fine-tuning, we further assess its generation ability on an augmented FP-RefCOCO(+/g) dataset.All codes and models are publicly available at https://github.com/rui-qian/READ.

+

To address this gap, we introduce Libra-Leaderboard, a comprehensive framework designed to rank LLMs through a balanced evaluation of performance and safety. 0.68Combining a dynamic leaderboard with an interactive LLM arena, Libra-Leaderboard encourages the joint optimization of capability and safety. 0.655Unlike traditional approaches that average performance and safety metrics, Libra-Leaderboard uses a distance-to-optimal-score method to calculate the overall rankings.This approach incentivizes models to achieve a balance rather than excelling in one dimension at the expense of some other ones.In the first release, Libra-Leaderboard evaluates 26 mainstream LLMs from 14 leading organizations, identifying critical safety challenges even in state-of-the-art models. 0.712

- + link

@@ -2179,20 +2201,20 @@

LLMs

-

2024-12-23

+

2024-12-24

- YuLan-Mini: An Open Data-efficient Language Model + Distilling Fine-grained Sentiment Understanding from Large Language Models

-

Effective pre-training of large language models (LLMs) has been challenging due to the immense resource demands and the complexity of the technical processes involved. 0.67This paper presents a detailed technical report on YuLan-Mini, a highly capable base model with 2.42B parameters that achieves top-tier performance among models of similar parameter scale.Our pre-training approach focuses on enhancing training efficacy through three key technical contributions: an elaborate data pipeline combines data cleaning with data schedule strategies, a robust optimization method to mitigate training instability, and an effective annealing approach that incorporates targeted data selection and long context training.Remarkably, YuLan-Mini, trained on 1.08T tokens, achieves performance comparable to industry-leading models that require significantly more data.To facilitate reproduction, we release the full details of the data composition for each training phase.Project details can be accessed at the following link: https://github.com/RUC-GSAI/YuLan-Mini.

+

Fine-grained sentiment analysis (FSA) aims to extract and summarize user opinions from vast opinionated text.Recent studies demonstrate that large language models (LLMs) possess exceptional sentiment understanding capabilities. 0.668However, directly deploying LLMs for FSA applications incurs high inference costs. 0.712Therefore, this paper investigates the distillation of fine-grained sentiment understanding from LLMs into small language models (SLMs). 0.639We prompt LLMs to examine and interpret the sentiments of given reviews and then utilize the generated content to pretrain SLMs. 0.711Additionally, we develop a comprehensive FSA benchmark to evaluate both SLMs and LLMs. 0.622Extensive experiments on this benchmark reveal that: (1) distillation significantly enhances the performance of SLMs in FSA tasks, achieving a 6.00\% improvement in $F_1$-score, and the distilled model can outperform Llama-2-7b with only 220M parameters; (2) distillation equips SLMs with excellent zero-shot sentiment classification capabilities, enabling them to match or even exceed their teacher models. 0.6These results suggest that distillation from LLMs is a highly promising direction for FSA. 0.711We will release our code, data, and pretrained model weights at \url{https://github.com/HITSZ-HLT/FSA-Distillation}.

- + link

@@ -2201,20 +2223,20 @@

LLMs

-

2024-12-23

+

2024-12-24

- RepoTransBench: A Real-World Benchmark for Repository-Level Code Translation + Zero-resource Speech Translation and Recognition with LLMs

-

Repository-level code translation refers to translating an entire code repository from one programming language to another while preserving the functionality of the source repository.Many benchmarks have been proposed to evaluate the performance of such code translators.However, previous benchmarks mostly provide fine-grained samples, focusing at either code snippet, function, or file-level code translation.Such benchmarks do not accurately reflect real-world demands, where entire repositories often need to be translated, involving longer code length and more complex functionalities.To address this gap, we propose a new benchmark, named RepoTransBench, which is a real-world repository-level code translation benchmark with an automatically executable test suite.We conduct experiments on RepoTransBench to evaluate the translation performance of 11 advanced LLMs. 0.734We find that the Success@1 score (test success in one attempt) of the best-performing LLM is only 7.33%. 0.705To further explore the potential of LLMs for repository-level code translation, we provide LLMs with error-related feedback to perform iterative debugging and observe an average 7.09% improvement on Success@1. 0.731However, even with this improvement, the Success@1 score of the best-performing LLM is only 21%, which may not meet the need for reliable automatic repository-level code translation. 0.715Finally, we conduct a detailed error analysis and highlight current LLMs' deficiencies in repository-level code translation, which could provide a reference for further improvements. 0.734

+

Despite recent advancements in speech processing, zero-resource speech translation (ST) and automatic speech recognition (ASR) remain challenging problems.In this work, we propose to leverage a multilingual Large Language Model (LLM) to perform ST and ASR in languages for which the model has never seen paired audio-text data.We achieve this by using a pre-trained multilingual speech encoder, a multilingual LLM, and a lightweight adaptation module that maps the audio representations to the token embedding space of the LLM.We perform several experiments both in ST and ASR to understand how to best train the model and what data has the most impact on performance in previously unseen languages.In ST, our best model is capable to achieve BLEU scores over 23 in CoVoST2 for two previously unseen languages, while in ASR, we achieve WERs of up to 28.2\%.We finally show that the performance of our system is bounded by the ability of the LLM to output text in the desired language. 0.654

- + link

@@ -2223,20 +2245,20 @@

LLMs

-

2024-12-23

+

2024-12-24

- Deliberation in Latent Space via Differentiable Cache Augmentation + How Well Do LLMs Generate Code for Different Application Domains? Benchmark and Evaluation

-

Techniques enabling large language models (LLMs) to "think more" by generating and attending to intermediate reasoning steps have shown promise in solving complex problems. 0.672However, the standard approaches generate sequences of discrete tokens immediately before responding, and so they can incur significant latency costs and be challenging to optimize.In this work, we demonstrate that a frozen LLM can be augmented with an offline coprocessor that operates on the model's key-value (kv) cache. 0.714This coprocessor augments the cache with a set of latent embeddings designed to improve the fidelity of subsequent decoding.We train this coprocessor using the language modeling loss from the decoder on standard pretraining data, while keeping the decoder itself frozen.This approach enables the model to learn, in an end-to-end differentiable fashion, how to distill additional computation into its kv-cache.Because the decoder remains unchanged, the coprocessor can operate offline and asynchronously, and the language model can function normally if the coprocessor is unavailable or if a given cache is deemed not to require extra computation.We show experimentally that when a cache is augmented, the decoder achieves lower perplexity on numerous subsequent tokens.Furthermore, even without any task-specific training, our experiments demonstrate that cache augmentation consistently reduces perplexity and improves performance across a range of reasoning-intensive tasks.

+

Recently, an increasing number of AI-driven programming assistants powered by code LLMs have been integrated into various real-world software development environments, significantly boosting developer productivity. 0.657However, existing code generation benchmarks primarily focus on general-purpose scenarios, leaving the code generation performance of LLMs for specific application domains largely unknown. 0.694In this paper, we introduce a new benchmark, MultiCodeBench, to fill this gap.MultiCodeBench comprises 2,400 programming tasks, covering 12 popular software development domains and 15 programming languages.Specifically, we perform in-depth research to identify these 12 application domains.Given that each domain may involve multiple technical frameworks, and that different frameworks present distinct challenges in the coding process, we categorize the commonly used frameworks and platforms within each domain.We then sample programming problems from GitHub repositories related to these subdomains.To ensure the quality of the tasks and mitigate data leakage issues, we invite annotators to rewrite the docstrings for each task in MultiCodeBench.Additionally, we build a static analysis-based dependency parsing tool to extract the dependencies in the ground truth for each task, enabling deeper performance analysis.Through extensive experiments on MultiCodeBench with eleven representative mainstream LLMs, we reveal the code generation performance of the LLMs across different application domains, providing practical insights for developers in downstream fields when selecting LLMs. 0.708Furthermore, we analyze the reasons behind the models' failures in completing software application development tasks, offering guidance for model developers to enhance domain-specific code generation capabilities.

- + link

@@ -2245,20 +2267,20 @@

LLMs

-

2024-12-23

+

2024-12-24

- ADC: Enhancing Function Calling Via Adversarial Datasets and Code Line-Level Feedback + A Paragraph is All It Takes: Rich Robot Behaviors from Interacting, Trusted LLMs

-

Large Language Models (LLMs) have made significant strides in Natural Language Processing and coding, yet they struggle with robustness and accuracy in complex function calls. 0.631To tackle these challenges, this paper introduces ADC, an innovative approach that enhances LLMs' ability to follow function formats and match complex parameters. 0.647ADC utilizes a high-quality code fine-tuning dataset with line-level execution feedback, providing granular process supervision that fosters strong logical reasoning and adherence to function formats.It also employs an adversarial dataset generation process to improve parameter matching.The staged training methodology capitalizes on both enriched code datasets and refined adversarial datasets, leading to marked improvements in function calling capabilities on the Berkeley Function-Calling Leaderboard (BFCL) Benchmark.The innovation of ADC lies in its strategic combination of process supervision, adversarial refinement, and incremental learning, setting a new standard for LLM proficiency in complex function calling. 0.608

+

Large Language Models (LLMs) are compact representations of all public knowledge of our physical environment and animal and human behaviors. 0.693The application of LLMs to robotics may offer a path to highly capable robots that perform well across most human tasks with limited or even zero tuning. 0.736Aside from increasingly sophisticated reasoning and task planning, networks of (suitably designed) LLMs offer ease of upgrading capabilities and allow humans to directly observe the robot's thinking. 0.75Here we explore the advantages, limitations, and particularities of using LLMs to control physical robots. 0.734The basic system consists of four LLMs communicating via a human language data bus implemented via web sockets and ROS2 message passing. 0.716Surprisingly, rich robot behaviors and good performance across different tasks could be achieved despite the robot's data fusion cycle running at only 1Hz and the central data bus running at the extremely limited rates of the human brain, of around 40 bits/s.The use of natural language for inter-LLM communication allowed the robot's reasoning and decision making to be directly observed by humans and made it trivial to bias the system's behavior with sets of rules written in plain English. 0.641These rules were immutably written into Ethereum, a global, public, and censorship resistant Turing-complete computer.We suggest that by using natural language as the data bus among interacting AIs, and immutable public ledgers to store behavior constraints, it is possible to build robots that combine unexpectedly rich performance, upgradability, and durable alignment with humans.

- + link

@@ -2267,20 +2289,20 @@

LLMs

-

2024-12-23

+

2024-12-24

- In Case You Missed It: ARC 'Challenge' Is Not That Challenging + Decentralized Intelligence in GameFi: Embodied AI Agents and the Convergence of DeFi and Virtual Ecosystems

-

ARC Challenge appears more difficult than ARC Easy for modern LLMs primarily due to an evaluation setup that prevents direct comparison of answer choices rather than inherent complexity. 0.671Although some researchers have quietly shifted to a more appropriate scheme over the last year, the implications of this change have yet to be widely acknowledged.We highlight this overlooked shift, show how similar evaluation practices falsely imply reasoning deficits in other benchmarks, and demonstrate that fairer methods dramatically reduce performance gaps (e.g. on SIQA) and even yield superhuman results (OpenBookQA).In doing so, we reveal how evaluation shapes perceived difficulty and offer guidelines to ensure that multiple-choice evaluations accurately reflect actual model capabilities.

+

In the rapidly evolving landscape of GameFi, a fusion of gaming and decentralized finance (DeFi), there exists a critical need to enhance player engagement and economic interaction within gaming ecosystems.Our GameFi ecosystem aims to fundamentally transform this landscape by integrating advanced embodied AI agents into GameFi platforms.These AI agents, developed using cutting-edge large language models (LLMs), such as GPT-4 and Claude AI, are capable of proactive, adaptive, and contextually rich interactions with players. 0.617By going beyond traditional scripted responses, these agents become integral participants in the game's narrative and economic systems, directly influencing player strategies and in-game economies.We address the limitations of current GameFi platforms, which often lack immersive AI interactions and mechanisms for community engagement or creator monetization.Through the deep integration of AI agents with blockchain technology, we establish a consensus-driven, decentralized GameFi ecosystem.This ecosystem empowers creators to monetize their contributions and fosters democratic collaboration among players and creators.Furthermore, by embedding DeFi mechanisms into the gaming experience, we enhance economic participation and provide new opportunities for financial interactions within the game.Our approach enhances player immersion and retention and advances the GameFi ecosystem by bridging traditional gaming with Web3 technologies.By integrating sophisticated AI and DeFi elements, we contribute to the development of more engaging, economically robust, and community-centric gaming environments.This project represents a significant advancement in the state-of-the-art in GameFi, offering insights and methodologies that can be applied throughout the gaming industry.

- + link

@@ -2288,21 +2310,26 @@

LLMs

+ + +

Developer Research

+ + -

2024-12-23

+

2024-12-24

- ResearchTown: Simulator of Human Research Community + Automated Code Review In Practice

-

Large Language Models (LLMs) have demonstrated remarkable potential in scientific domains, yet a fundamental question remains unanswered: Can we simulate human research communities with LLMs? 0.673Addressing this question can deepen our understanding of the processes behind idea brainstorming and inspire the automatic discovery of novel scientific insights.In this work, we propose ResearchTown, a multi-agent framework for research community simulation.Within this framework, the human research community is simplified and modeled as an agent-data graph, where researchers and papers are represented as agent-type and data-type nodes, respectively, and connected based on their collaboration relationships.We also introduce TextGNN, a text-based inference framework that models various research activities (e.g., paper reading, paper writing, and review writing) as special forms of a unified message-passing process on the agent-data graph.To evaluate the quality of the research simulation, we present ResearchBench, a benchmark that uses a node-masking prediction task for scalable and objective assessment based on similarity.Our experiments reveal three key findings: (1) ResearchTown can provide a realistic simulation of collaborative research activities, including paper writing and review writing; (2) ResearchTown can maintain robust simulation with multiple researchers and diverse papers; (3) ResearchTown can generate interdisciplinary research ideas that potentially inspire novel research directions.

+

Code review is a widespread practice to improve software quality and transfer knowledge. 0.63It is often seen as time-consuming due to the need for manual effort and potential delays.Several AI-assisted tools, such as Qodo, GitHub Copilot, and Coderabbit, provide automated reviews using large language models (LLMs).The effects of such tools in the industry are yet to be examined. This study examines the impact of LLM-based automated code review tools in an industrial setting.The study was conducted within a software development environment that adopted an AI-assisted review tool (based on open-source Qodo PR Agent).Around 238 practitioners across ten projects had access to the tool.We focused on three projects with 4,335 pull requests, 1,568 of which underwent automated reviews.Data collection comprised three sources: (1) a quantitative analysis of pull request data, including comment labels indicating whether developers acted on the automated comments, (2) surveys sent to developers regarding their experience with reviews on individual pull requests, and (3) a broader survey of 22 practitioners capturing their general opinions on automated reviews. 73.8% of automated comments were resolved.However, the average pull request closure duration increased from five hours 52 minutes to eight hours 20 minutes, with varying trends across projects.Most practitioners reported a minor improvement in code quality due to automated reviews. 0.6The LLM-based tool proved useful in software development, enhancing bug detection, increasing awareness of code quality, and promoting best practices.However, it also led to longer pull request closure times and introduced drawbacks like faulty reviews, unnecessary corrections, and irrelevant comments.

- + link

@@ -2311,20 +2338,20 @@

LLMs

-

2024-12-23

+

2024-12-24

- Memory makes computation universal, remember? + How Well Do LLMs Generate Code for Different Application Domains? Benchmark and Evaluation

-

Recent breakthroughs in AI capability have been attributed to increasingly sophisticated architectures and alignment techniques, but a simpler principle may explain these advances: memory makes computation universal.Memory enables universal computation through two fundamental capabilities: recursive state maintenance and reliable history access. 0.601We formally prove these requirements are both necessary and sufficient for universal computation.This principle manifests across scales, from cellular computation to neural networks to language models.Complex behavior emerges not from sophisticated processing units but from maintaining and accessing state across time.We demonstrate how parallel systems like neural networks achieve universal computation despite limitations in their basic units by maintaining state across iterations.This theoretical framework reveals a universal pattern: computational advances consistently emerge from enhanced abilities to maintain and access state rather than from more complex basic operations.Our analysis unifies understanding of computation across biological systems, artificial intelligence, and human cognition, reminding us that humanity's own computational capabilities have evolved in step with our technical ability to remember through oral traditions, writing, and now computing.

+

Recently, an increasing number of AI-driven programming assistants powered by code LLMs have been integrated into various real-world software development environments, significantly boosting developer productivity.However, existing code generation benchmarks primarily focus on general-purpose scenarios, leaving the code generation performance of LLMs for specific application domains largely unknown.In this paper, we introduce a new benchmark, MultiCodeBench, to fill this gap.MultiCodeBench comprises 2,400 programming tasks, covering 12 popular software development domains and 15 programming languages.Specifically, we perform in-depth research to identify these 12 application domains.Given that each domain may involve multiple technical frameworks, and that different frameworks present distinct challenges in the coding process, we categorize the commonly used frameworks and platforms within each domain.We then sample programming problems from GitHub repositories related to these subdomains.To ensure the quality of the tasks and mitigate data leakage issues, we invite annotators to rewrite the docstrings for each task in MultiCodeBench.Additionally, we build a static analysis-based dependency parsing tool to extract the dependencies in the ground truth for each task, enabling deeper performance analysis.Through extensive experiments on MultiCodeBench with eleven representative mainstream LLMs, we reveal the code generation performance of the LLMs across different application domains, providing practical insights for developers in downstream fields when selecting LLMs.Furthermore, we analyze the reasons behind the models' failures in completing software application development tasks, offering guidance for model developers to enhance domain-specific code generation capabilities. 0.603

- + link

@@ -2332,11 +2359,6 @@

LLMs

- - -

Developer Research

- -

2024-12-17