Skip to content

Latest commit

 

History

History
201 lines (199 loc) · 306 KB

Image-Classification.md

File metadata and controls

201 lines (199 loc) · 306 KB

Image Classification

Title Date Abstract Comment CodeRepository
Learning of deep convolutional network image classifiers via stochastic gradient descent and over-parametrization 2025-03-05
Show

Image classification from independent and identically distributed random variables is considered. Image classifiers are defined which are based on a linear combination of deep convolutional networks with max-pooling layer. Here all the weights are learned by stochastic gradient descent. A general result is presented which shows that the image classifiers are able to approximate the best possible deep convolutional network. In case that the a posteriori probability satisfies a suitable hierarchical composition model it is shown that the corresponding deep convolutional neural network image classifier achieves a rate of convergence which is independent of the dimension of the images.

arXiv...

arXiv admin note: text overlap with arXiv:2312.17007

None
Inference-Scale Complexity in ANN-SNN Conversion for High-Performance and Low-Power Applications 2025-03-05
Show

Spiking Neural Networks (SNNs) have emerged as a promising substitute for Artificial Neural Networks (ANNs) due to their advantages of fast inference and low power consumption. However, the lack of efficient training algorithms has hindered their widespread adoption. Even efficient ANN-SNN conversion methods necessitate quantized training of ANNs to enhance the effectiveness of the conversion, incurring additional training costs. To address these challenges, we propose an efficient ANN-SNN conversion framework with only inference scale complexity. The conversion framework includes a local threshold balancing algorithm, which enables efficient calculation of the optimal thresholds and fine-grained adjustment of the threshold value by channel-wise scaling. We also introduce an effective delayed evaluation strategy to mitigate the influence of the spike propagation delays. We demonstrate the scalability of our framework in typical computer vision tasks: image classification, semantic segmentation, object detection, and video classification. Our algorithm outperforms existing methods, highlighting its practical applicability and efficiency. Moreover, we have evaluated the energy consumption of the converted SNNs, demonstrating their superior low-power advantage compared to conventional ANNs. This approach simplifies the deployment of SNNs by leveraging open-source pre-trained ANN models, enabling fast, low-power inference with negligible performance reduction. Code is available at https://github.com/putshua/Inference-scale-ANN-SNN.

Accep...

Accepted by CVPR 2025 (Poster)

Code Link
Lie Algebra Canonicalization: Equivariant Neural Operators under arbitrary Lie Groups 2025-03-04
Show

The quest for robust and generalizable machine learning models has driven recent interest in exploiting symmetries through equivariant neural networks. In the context of PDE solvers, recent works have shown that Lie point symmetries can be a useful inductive bias for Physics-Informed Neural Networks (PINNs) through data and loss augmentation. Despite this, directly enforcing equivariance within the model architecture for these problems remains elusive. This is because many PDEs admit non-compact symmetry groups, oftentimes not studied beyond their infinitesimal generators, making them incompatible with most existing equivariant architectures. In this work, we propose Lie aLgebrA Canonicalization (LieLAC), a novel approach that exploits only the action of infinitesimal generators of the symmetry group, circumventing the need for knowledge of the full group structure. To achieve this, we address existing theoretical issues in the canonicalization literature, establishing connections with frame averaging in the case of continuous non-compact groups. Operating within the framework of canonicalization, LieLAC can easily be integrated with unconstrained pre-trained models, transforming inputs to a canonical form before feeding them into the existing model, effectively aligning the input for model inference according to allowed symmetries. LieLAC utilizes standard Lie group descent schemes, achieving equivariance in pre-trained models. Finally, we showcase LieLAC's efficacy on tasks of invariant image classification and Lie point symmetry equivariant neural PDE solvers using pre-trained models.

44 pa...

44 pages; accepted at ICLR 2025

None
Measurement noise scaling laws for cellular representation learning 2025-03-04
Show

Deep learning scaling laws predict how performance improves with increased model and dataset size. Here we identify measurement noise in data as another performance scaling axis, governed by a distinct logarithmic law. We focus on representation learning models of biological single cell genomic data, where a dominant source of measurement noise is due to molecular undersampling. We introduce an information-theoretic metric for cellular representation model quality, and find that it scales with sampling depth. A single quantitative relationship holds across several model types and across several datasets. We show that the analytical form of this relationship can be derived from a simple Gaussian noise model, which in turn provides an intuitive interpretation for the scaling law. Finally, we show that the same relationship emerges in image classification models with respect to two types of imaging noise, suggesting that measurement noise scaling may be a general phenomenon. Scaling with noise can serve as a guide in generating and curating data for deep learning models, particularly in fields where measurement quality can vary dramatically between datasets.

None
R2Det: Exploring Relaxed Rotation Equivariance in 2D object detection 2025-03-04
Show

Group Equivariant Convolution (GConv) empowers models to explore underlying symmetry in data, improving performance. However, real-world scenarios often deviate from ideal symmetric systems caused by physical permutation, characterized by non-trivial actions of a symmetry group, resulting in asymmetries that affect the outputs, a phenomenon known as Symmetry Breaking. Traditional GConv-based methods are constrained by rigid operational rules within group space, assuming data remains strictly symmetry after limited group transformations. This limitation makes it difficult to adapt to Symmetry-Breaking and non-rigid transformations. Motivated by this, we mainly focus on a common scenario: Rotational Symmetry-Breaking. By relaxing strict group transformations within Strict Rotation-Equivariant group $\mathbf{C}_n$, we redefine a Relaxed Rotation-Equivariant group $\mathbf{R}_n$ and introduce a novel Relaxed Rotation-Equivariant GConv (R2GConv) with only a minimal increase of $4n$ parameters compared to GConv. Based on R2GConv, we propose a Relaxed Rotation-Equivariant Network (R2Net) as the backbone and develop a Relaxed Rotation-Equivariant Object Detector (R2Det) for 2D object detection. Experimental results demonstrate the effectiveness of the proposed R2GConv in natural image classification, and R2Det achieves excellent performance in 2D object detection with improved generalization capabilities and robustness. The code is available in \texttt{https://github.com/wuer5/r2det}.

Code Link
XFMamba: Cross-Fusion Mamba for Multi-View Medical Image Classification 2025-03-04
Show

Compared to single view medical image classification, using multiple views can significantly enhance predictive accuracy as it can account for the complementarity of each view while leveraging correlations between views. Existing multi-view approaches typically employ separate convolutional or transformer branches combined with simplistic feature fusion strategies. However, these approaches inadvertently disregard essential cross-view correlations, leading to suboptimal classification performance, and suffer from challenges with limited receptive field (CNNs) or quadratic computational complexity (transformers). Inspired by state space sequence models, we propose XFMamba, a pure Mamba-based cross-fusion architecture to address the challenge of multi-view medical image classification. XFMamba introduces a novel two-stage fusion strategy, facilitating the learning of single-view features and their cross-view disparity. This mechanism captures spatially long-range dependencies in each view while enhancing seamless information transfer between views. Results on three public datasets, MURA, CheXpert and DDSM, illustrate the effectiveness of our approach across diverse multi-view medical image classification tasks, showing that it outperforms existing convolution-based and transformer-based multi-view methods. Code is available at https://github.com/XZheng0427/XFMamba.

Code Link
Remote Sensing Image Classification Using Convolutional Neural Network (CNN) and Transfer Learning Techniques 2025-03-04
Show

This study investigates the classification of aerial images depicting transmission towers, forests, farmland, and mountains. To complete the classification job, features are extracted from input photos using a Convolutional Neural Network (CNN) architecture. Then, the images are classified using Softmax. To test the model, we ran it for ten epochs using a batch size of 90, the Adam optimizer, and a learning rate of 0.001. Both training and assessment are conducted using a dataset that blends self-collected pictures from Google satellite imagery with the MLRNet dataset. The comprehensive dataset comprises 10,400 images. Our study shows that transfer learning models and MobileNetV2 in particular, work well for landscape categorization. These models are good options for practical use because they strike a good mix between precision and efficiency; our approach achieves results with an overall accuracy of 87% on the built CNN model. Furthermore, we reach even higher accuracies by utilizing the pretrained VGG16 and MobileNetV2 models as a starting point for transfer learning. Specifically, VGG16 achieves an accuracy of 90% and a test loss of 0.298, while MobileNetV2 outperforms both models with an accuracy of 96% and a test loss of 0.119; the results demonstrate the effectiveness of employing transfer learning with MobileNetV2 for classifying transmission towers, forests, farmland, and mountains.

This ...

This paper is published in Journal of Computer Science, Volume 21 No. 3, 2025. It contains 635-645 pages

None
Replay Consolidation with Label Propagation for Continual Object Detection 2025-03-04
Show

Continual Learning (CL) aims to learn new data while remembering previously acquired knowledge. In contrast to CL for image classification, CL for Object Detection faces additional challenges such as the missing annotations problem. In this scenario, images from previous tasks may contain instances of unknown classes that could reappear as labeled in future tasks, leading to task interference in replay-based approaches. Consequently, most approaches in the literature have focused on distillation-based techniques, which are effective when there is a significant class overlap between tasks. In our work, we propose an alternative to distillation-based approaches with a novel approach called Replay Consolidation with Label Propagation for Object Detection (RCLPOD). RCLPOD enhances the replay memory by improving the quality of the stored samples through a technique that promotes class balance while also improving the quality of the ground truth associated with these samples through a technique called label propagation. RCLPOD outperforms existing techniques on well-established benchmarks such as VOC and COC. Moreover, our approach is developed to work with modern architectures like YOLOv8, making it suitable for dynamic, real-world applications such as autonomous driving and robotics, where continuous learning and resource efficiency are essential.

None
Satellite image classification with neural quantum kernels 2025-03-04
Show

Achieving practical applications of quantum machine learning for real-world scenarios remains challenging despite significant theoretical progress. This paper proposes a novel approach for classifying satellite images, a task of particular relevance to the earth observation (EO) industry, using quantum machine learning techniques. Specifically, we focus on classifying images that contain solar panels, addressing a complex real-world classification problem. Our approach begins with classical pre-processing to reduce the dimensionality of the satellite image dataset. We then apply neural quantum kernels (NQKs)-quantum kernels derived from trained quantum neural networks (QNNs)-for classification. We evaluate several strategies within this framework, demonstrating results that are competitive with the best classical methods. Key findings include the robustness of or results and their scalability, with successful performance achieved up to 8 qubits.

None
Reflective-Net: Learning from Explanations 2025-03-04
Show

We examine whether data generated by explanation techniques, which promote a process of self-reflection, can improve classifier performance. Our work is based on the idea that humans have the ability to make quick, intuitive decisions as well as to reflect on their own thinking and learn from explanations. To the best of our knowledge, this is the first time that the potential of mimicking this process by using explanations generated by explainability methods has been explored. We found that combining explanations with traditional labeled data leads to significant improvements in classification accuracy and training efficiency across multiple image classification datasets and convolutional neural network architectures. It is worth noting that during training, we not only used explanations for the correct or predicted class, but also for other classes. This serves multiple purposes, including allowing for reflection on potential outcomes and enriching the data through augmentation.

None
SAR-W-MixMAE: SAR Foundation Model Training Using Backscatter Power Weighting 2025-03-04
Show

Foundation model approaches such as masked auto-encoders (MAE) or its variations are now being successfully applied to satellite imagery. Most of the ongoing technical validation of foundation models have been applied to optical images like RGB or multi-spectral images. Due to difficulty in semantic labeling to create datasets and higher noise content with respect to optical images, Synthetic Aperture Radar (SAR) data has not been explored a lot in the field for foundation models. Therefore, in this work as a pre-training approach, we explored masked auto-encoder, specifically MixMAE on Sentinel-1 SAR images and its impact on SAR image classification tasks. Moreover, we proposed to use the physical characteristic of SAR data for applying weighting parameter on the auto-encoder training loss (MSE) to reduce the effect of speckle noise and very high values on the SAR images. Proposed SAR intensity-based weighting of the reconstruction loss demonstrates promising results both on SAR pre-training and downstream tasks specifically on flood detection compared with the baseline model.

5 pages, 1 figure None
Making Better Mistakes in CLIP-Based Zero-Shot Classification with Hierarchy-Aware Language Prompts 2025-03-04
Show

Recent studies are leveraging advancements in large language models (LLMs) trained on extensive internet-crawled text data to generate textual descriptions of downstream classes in CLIP-based zero-shot image classification. While most of these approaches aim at improving accuracy, our work focuses on ``making better mistakes", of which the mistakes' severities are derived from the given label hierarchy of downstream tasks. Since CLIP's image encoder is trained with language supervising signals, it implicitly captures the hierarchical semantic relationships between different classes. This motivates our goal of making better mistakes in zero-shot classification, a task for which CLIP is naturally well-suited. Our approach (HAPrompts) queries the language model to produce textual representations for given classes as zero-shot classifiers of CLIP to perform image classification on downstream tasks. To our knowledge, this is the first work to introduce making better mistakes in CLIP-based zero-shot classification. Our approach outperforms the related methods in a holistic comparison across five datasets of varying scales with label hierarchies of different heights in our experiments. Our code and LLM-generated image prompts: \href{https://github.com/ltong1130ztr/HAPrompts}{https://github.com/ltong1130ztr/HAPrompts}.

20 pages Code Link
Sharpness-Aware Minimization: General Analysis and Improved Rates 2025-03-04
Show

Sharpness-Aware Minimization (SAM) has emerged as a powerful method for improving generalization in machine learning models by minimizing the sharpness of the loss landscape. However, despite its success, several important questions regarding the convergence properties of SAM in non-convex settings are still open, including the benefits of using normalization in the update rule, the dependence of the analysis on the restrictive bounded variance assumption, and the convergence guarantees under different sampling strategies. To address these questions, in this paper, we provide a unified analysis of SAM and its unnormalized variant (USAM) under one single flexible update rule (Unified SAM), and we present convergence results of the new algorithm under a relaxed and more natural assumption on the stochastic noise. Our analysis provides convergence guarantees for SAM under different step size selections for non-convex problems and functions that satisfy the Polyak-Lojasiewicz (PL) condition (a non-convex generalization of strongly convex functions). The proposed theory holds under the arbitrary sampling paradigm, which includes importance sampling as special case, allowing us to analyze variants of SAM that were never explicitly considered in the literature. Experiments validate the theoretical findings and further demonstrate the practical effectiveness of Unified SAM in training deep neural networks for image classification tasks.

13th ...

13th International Conference on Learning Representations (ICLR 2025)

None
Language-Informed Hyperspectral Image Synthesis for Imbalanced-Small Sample Classification via Semi-Supervised Conditional Diffusion Model 2025-03-04
Show

Data augmentation effectively addresses the imbalanced-small sample data (ISSD) problem in hyperspectral image classification (HSIC). While most methodologies extend features in the latent space, few leverage text-driven generation to create realistic and diverse samples. Recently, text-guided diffusion models have gained significant attention due to their ability to generate highly diverse and high-quality images based on text prompts in natural image synthesis. Motivated by this, this paper proposes Txt2HSI-LDM(VAE), a novel language-informed hyperspectral image synthesis method to address the ISSD in HSIC. The proposed approach uses a denoising diffusion model, which iteratively removes Gaussian noise to generate hyperspectral samples conditioned on textual descriptions. First, to address the high-dimensionality of hyperspectral data, a universal variational autoencoder (VAE) is designed to map the data into a low-dimensional latent space, which provides stable features and reduces the inference complexity of diffusion model. Second, a semi-supervised diffusion model is designed to fully take advantage of unlabeled data. Random polygon spatial clipping (RPSC) and uncertainty estimation of latent feature (LF-UE) are used to simulate the varying degrees of mixing. Third, the VAE decodes HSI from latent space generated by the diffusion model with the language conditions as input. In our experiments, we fully evaluate synthetic samples' effectiveness from statistical characteristics and data distribution in 2D-PCA space. Additionally, visual-linguistic cross-attention is visualized on the pixel level to prove that our proposed model can capture the spatial layout and geometry of the generated data. Experiments demonstrate that the performance of the proposed Txt2HSI-LDM(VAE) surpasses the classical backbone models, state-of-the-art CNNs, and semi-supervised methods.

None
ProAPO: Progressively Automatic Prompt Optimization for Visual Classification 2025-03-04
Show

Vision-language models (VLMs) have made significant progress in image classification by training with large-scale paired image-text data. Their performances largely depend on the prompt quality. While recent methods show that visual descriptions generated by large language models (LLMs) enhance the generalization of VLMs, class-specific prompts may be inaccurate or lack discrimination due to the hallucination in LLMs. In this paper, we aim to find visually discriminative prompts for fine-grained categories with minimal supervision and no human-in-the-loop. An evolution-based algorithm is proposed to progressively optimize language prompts from task-specific templates to class-specific descriptions. Unlike optimizing templates, the search space shows an explosion in class-specific candidate prompts. This increases prompt generation costs, iterative times, and the overfitting problem. To this end, we first introduce several simple yet effective edit-based and evolution-based operations to generate diverse candidate prompts by one-time query of LLMs. Then, two sampling strategies are proposed to find a better initial search point and reduce traversed categories, saving iteration costs. Moreover, we apply a novel fitness score with entropy constraints to mitigate overfitting. In a challenging one-shot image classification setting, our method outperforms existing textual prompt-based methods and improves LLM-generated description methods across 13 datasets. Meanwhile, we demonstrate that our optimal prompts improve adapter-based methods and transfer effectively across different backbones.

Accep...

Accepted to the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2025

None
Gradient-Guided Annealing for Domain Generalization 2025-03-03
Show

Domain Generalization (DG) research has gained considerable traction as of late, since the ability to generalize to unseen data distributions is a requirement that eludes even state-of-the-art training algorithms. In this paper we observe that the initial iterations of model training play a key role in domain generalization effectiveness, since the loss landscape may be significantly different across the training and test distributions, contrary to the case of i.i.d. data. Conflicts between gradients of the loss components of each domain lead the optimization procedure to undesirable local minima that do not capture the domain-invariant features of the target classes. We propose alleviating domain conflicts in model optimization, by iteratively annealing the parameters of a model in the early stages of training and searching for points where gradients align between domains. By discovering a set of parameter values where gradients are updated towards the same direction for each data distribution present in the training set, the proposed Gradient-Guided Annealing (GGA) algorithm encourages models to seek out minima that exhibit improved robustness against domain shifts. The efficacy of GGA is evaluated on five widely accepted and challenging image classification domain generalization benchmarks, where its use alone is able to establish highly competitive or even state-of-the-art performance. Moreover, when combined with previously proposed domain-generalization algorithms it is able to consistently improve their effectiveness by significant margins.

Paper...

Paper accepted in CVPR2025

None
Visual-RFT: Visual Reinforcement Fine-Tuning 2025-03-03
Show

Reinforcement Fine-Tuning (RFT) in Large Reasoning Models like OpenAI o1 learns from feedback on its answers, which is especially useful in applications when fine-tuning data is scarce. Recent open-source work like DeepSeek-R1 demonstrates that reinforcement learning with verifiable reward is one key direction in reproducing o1. While the R1-style model has demonstrated success in language models, its application in multi-modal domains remains under-explored. This work introduces Visual Reinforcement Fine-Tuning (Visual-RFT), which further extends the application areas of RFT on visual tasks. Specifically, Visual-RFT first uses Large Vision-Language Models (LVLMs) to generate multiple responses containing reasoning tokens and final answers for each input, and then uses our proposed visual perception verifiable reward functions to update the model via the policy optimization algorithm such as Group Relative Policy Optimization (GRPO). We design different verifiable reward functions for different perception tasks, such as the Intersection over Union (IoU) reward for object detection. Experimental results on fine-grained image classification, few-shot object detection, reasoning grounding, as well as open-vocabulary object detection benchmarks show the competitive performance and advanced generalization ability of Visual-RFT compared with Supervised Fine-tuning (SFT). For example, Visual-RFT improves accuracy by $24.3%$ over the baseline in one-shot fine-grained image classification with around 100 samples. In few-shot object detection, Visual-RFT also exceeds the baseline by $21.9$ on COCO's two-shot setting and $15.4$ on LVIS. Our Visual-RFT represents a paradigm shift in fine-tuning LVLMs, offering a data-efficient, reward-driven approach that enhances reasoning and adaptability for domain-specific tasks.

proje...

project page: https://github.com/Liuziyu77/Visual-RFT

Code Link
Mamba base PKD for efficient knowledge compression 2025-03-03
Show

Deep neural networks (DNNs) have remarkably succeeded in various image processing tasks. However, their large size and computational complexity present significant challenges for deploying them in resource-constrained environments. This paper presents an innovative approach for integrating Mamba Architecture within a Progressive Knowledge Distillation (PKD) process to address the challenge of reducing model complexity while maintaining accuracy in image classification tasks. The proposed framework distills a large teacher model into progressively smaller student models, designed using Mamba blocks. Each student model is trained using Selective-State-Space Models (S-SSM) within the Mamba blocks, focusing on important input aspects while reducing computational complexity. The work's preliminary experiments use MNIST and CIFAR-10 as datasets to demonstrate the effectiveness of this approach. For MNIST, the teacher model achieves 98% accuracy. A set of seven student models as a group retained 63% of the teacher's FLOPs, approximating the teacher's performance with 98% accuracy. The weak student used only 1% of the teacher's FLOPs and maintained 72% accuracy. Similarly, for CIFAR-10, the students achieved 1% less accuracy compared to the teacher, with the small student retaining 5% of the teacher's FLOPs to achieve 50% accuracy. These results confirm the flexibility and scalability of Mamba Architecture, which can be integrated into PKD, succeeding in the process of finding students as weak learners. The framework provides a solution for deploying complex neural networks in real-time applications with a reduction in computational cost.

A pre...

A preliminary version of this work was presented as a short poster titled "Mamba-PKD: A Framework for Efficient and Scalable Model Compression in Image Classification" at The 40th ACM/SIGAPP Symposium on Applied Computing https://doi.org/10.1145/3672608.3707887

None
Mathematical Foundation of Interpretable Equivariant Surrogate Models 2025-03-03
Show

This paper introduces a rigorous mathematical framework for neural network explainability, and more broadly for the explainability of equivariant operators called Group Equivariant Operators (GEOs) based on Group Equivariant Non-Expansive Operators (GENEOs) transformations. The central concept involves quantifying the distance between GEOs by measuring the non-commutativity of specific diagrams. Additionally, the paper proposes a definition of interpretability of GEOs according to a complexity measure that can be defined according to each user preferences. Moreover, we explore the formal properties of this framework and show how it can be applied in classical machine learning scenarios, like image classification with convolutional neural networks.

None
Saliency-Bench: A Comprehensive Benchmark for Evaluating Visual Explanations 2025-03-03
Show

Explainable AI (XAI) has gained significant attention for providing insights into the decision-making processes of deep learning models, particularly for image classification tasks through visual explanations visualized by saliency maps. Despite their success, challenges remain due to the lack of annotated datasets and standardized evaluation pipelines. In this paper, we introduce Saliency-Bench, a novel benchmark suite designed to evaluate visual explanations generated by saliency methods across multiple datasets. We curated, constructed, and annotated eight datasets, each covering diverse tasks such as scene classification, cancer diagnosis, object classification, and action classification, with corresponding ground-truth explanations. The benchmark includes a standardized and unified evaluation pipeline for assessing faithfulness and alignment of the visual explanation, providing a holistic visual explanation performance assessment. We benchmark these eight datasets with widely used saliency methods on different image classifier architectures to evaluate explanation quality. Additionally, we developed an easy-to-use API for automating the evaluation pipeline, from data accessing, and data loading, to result evaluation. The benchmark is available via our website: https://xaidataset.github.io.

None
HiBug2: Efficient and Interpretable Error Slice Discovery for Comprehensive Model Debugging 2025-03-03
Show

Despite the significant success of deep learning models in computer vision, they often exhibit systematic failures on specific data subsets, known as error slices. Identifying and mitigating these error slices is crucial to enhancing model robustness and reliability in real-world scenarios. In this paper, we introduce HiBug2, an automated framework for error slice discovery and model repair. HiBug2 first generates task-specific visual attributes to highlight instances prone to errors through an interpretable and structured process. It then employs an efficient slice enumeration algorithm to systematically identify error slices, overcoming the combinatorial challenges that arise during slice exploration. Additionally, HiBug2 extends its capabilities by predicting error slices beyond the validation set, addressing a key limitation of prior approaches. Extensive experiments across multiple domains, including image classification, pose estimation, and object detection - show that HiBug2 not only improves the coherence and precision of identified error slices but also significantly enhances the model repair capabilities.

None
Fast and Accurate Gigapixel Pathological Image Classification with Hierarchical Distillation Multi-Instance Learning 2025-03-03
Show

Although multi-instance learning (MIL) has succeeded in pathological image classification, it faces the challenge of high inference costs due to processing numerous patches from gigapixel whole slide images (WSIs). To address this, we propose HDMIL, a hierarchical distillation multi-instance learning framework that achieves fast and accurate classification by eliminating irrelevant patches. HDMIL consists of two key components: the dynamic multi-instance network (DMIN) and the lightweight instance pre-screening network (LIPN). DMIN operates on high-resolution WSIs, while LIPN operates on the corresponding low-resolution counterparts. During training, DMIN are trained for WSI classification while generating attention-score-based masks that indicate irrelevant patches. These masks then guide the training of LIPN to predict the relevance of each low-resolution patch. During testing, LIPN first determines the useful regions within low-resolution WSIs, which indirectly enables us to eliminate irrelevant regions in high-resolution WSIs, thereby reducing inference time without causing performance degradation. In addition, we further design the first Chebyshev-polynomials-based Kolmogorov-Arnold classifier in computational pathology, which enhances the performance of HDMIL through learnable activation layers. Extensive experiments on three public datasets demonstrate that HDMIL outperforms previous state-of-the-art methods, e.g., achieving improvements of 3.13% in AUC while reducing inference time by 28.6% on the Camelyon16 dataset.

11 pa...

11 pages, 4 figures, accepted by CVPR2025

None
ViKANformer: Embedding Kolmogorov Arnold Networks in Vision Transformers for Pattern-Based Learning 2025-03-03
Show

Vision Transformers (ViTs) have significantly advanced image classification by applying self-attention on patch embeddings. However, the standard MLP blocks in each Transformer layer may not capture complex nonlinear dependencies optimally. In this paper, we propose ViKANformer, a Vision Transformer where we replace the MLP sub-layers with Kolmogorov-Arnold Network (KAN) expansions, including Vanilla KAN, Efficient-KAN, Fast-KAN, SineKAN, and FourierKAN, while also examining a Flash Attention variant. By leveraging the Kolmogorov-Arnold theorem, which guarantees that multivariate continuous functions can be expressed via sums of univariate continuous functions, we aim to boost representational power. Experimental results on MNIST demonstrate that SineKAN, Fast-KAN, and a well-tuned Vanilla KAN can achieve over 97% accuracy, albeit with increased training overhead. This trade-off highlights that KAN expansions may be beneficial if computational cost is acceptable. We detail the expansions, present training/test accuracy and F1/ROC metrics, and provide pseudocode and hyperparameters for reproducibility. Finally, we compare ViKANformer to a simple MLP and a small CNN baseline on MNIST, illustrating the efficiency of Transformer-based methods even on a small-scale dataset.

This ...

This paper represents ongoing research and may be subject to revisions, refinements, and additional experiments in future updates

None
Delving into Out-of-Distribution Detection with Medical Vision-Language Models 2025-03-02
Show

Recent advances in medical vision-language models (VLMs) demonstrate impressive performance in image classification tasks, driven by their strong zero-shot generalization capabilities. However, given the high variability and complexity inherent in medical imaging data, the ability of these models to detect out-of-distribution (OOD) data in this domain remains underexplored. In this work, we conduct the first systematic investigation into the OOD detection potential of medical VLMs. We evaluate state-of-the-art VLM-based OOD detection methods across a diverse set of medical VLMs, including both general and domain-specific purposes. To accurately reflect real-world challenges, we introduce a cross-modality evaluation pipeline for benchmarking full-spectrum OOD detection, rigorously assessing model robustness against both semantic shifts and covariate shifts. Furthermore, we propose a novel hierarchical prompt-based method that significantly enhances OOD detection performance. Extensive experiments are conducted to validate the effectiveness of our approach. The codes are available at https://github.com/PyJulie/Medical-VLMs-OOD-Detection.

Code Link
MedUnifier: Unifying Vision-and-Language Pre-training on Medical Data with Vision Generation Task using Discrete Visual Representations 2025-03-02
Show

Despite significant progress in Vision-Language Pre-training (VLP), current approaches predominantly emphasize feature extraction and cross-modal comprehension, with limited attention to generating or transforming visual content. This gap hinders the model's ability to synthesize coherent and novel visual representations from textual prompts, thereby reducing the effectiveness of multi-modal learning. In this work, we propose MedUnifier, a unified VLP framework tailored for medical data. MedUnifier seamlessly integrates text-grounded image generation capabilities with multi-modal learning strategies, including image-text contrastive alignment, image-text matching and image-grounded text generation. Unlike traditional methods that reply on continuous visual representations, our approach employs visual vector quantization, which not only facilitates a more cohesive learning strategy for cross-modal understanding but also enhances multi-modal generation quality by effectively leveraging discrete representations. Our framework's effectiveness is evidenced by the experiments on established benchmarks, including uni-modal tasks (supervised fine-tuning), cross-modal tasks (image-text retrieval and zero-shot image classification), and multi-modal tasks (medical report generation, image synthesis), where it achieves state-of-the-art performance across various tasks. MedUnifier also offers a highly adaptable tool for a wide range of language and vision tasks in healthcare, marking advancement toward the development of a generalizable AI model for medical applications.

To be...

To be pubilshed in CVPR 2025

None
AMUN: Adversarial Machine UNlearning 2025-03-02
Show

Machine unlearning, where users can request the deletion of a forget dataset, is becoming increasingly important because of numerous privacy regulations. Initial works on exact'' unlearning (e.g., retraining) incur large computational overheads. However, while computationally inexpensive, approximate'' methods have fallen short of reaching the effectiveness of exact unlearning: models produced fail to obtain comparable accuracy and prediction confidence on both the forget and test (i.e., unseen) dataset. Exploiting this observation, we propose a new unlearning method, Adversarial Machine UNlearning (AMUN), that outperforms prior state-of-the-art (SOTA) methods for image classification. AMUN lowers the confidence of the model on the forget samples by fine-tuning the model on their corresponding adversarial examples. Adversarial examples naturally belong to the distribution imposed by the model on the input space; fine-tuning the model on the adversarial examples closest to the corresponding forget samples (a) localizes the changes to the decision boundary of the model around each forget sample and (b) avoids drastic changes to the global behavior of the model, thereby preserving the model's accuracy on test samples. Using AMUN for unlearning a random $10%$ of CIFAR-10 samples, we observe that even SOTA membership inference attacks cannot do better than random guessing.

None
Few-Class Arena: A Benchmark for Efficient Selection of Vision Models and Dataset Difficulty Measurement 2025-03-02
Show

We propose Few-Class Arena (FCA), as a unified benchmark with focus on testing efficient image classification models for few classes. A wide variety of benchmark datasets with many classes (80-1000) have been created to assist Computer Vision architectural evolution. An increasing number of vision models are evaluated with these many-class datasets. However, real-world applications often involve substantially fewer classes of interest (2-10). This gap between many and few classes makes it difficult to predict performance of the few-class applications using models trained on the available many-class datasets. To date, little has been offered to evaluate models in this Few-Class Regime. We conduct a systematic evaluation of the ResNet family trained on ImageNet subsets from 2 to 1000 classes, and test a wide spectrum of Convolutional Neural Networks and Transformer architectures over ten datasets by using our newly proposed FCA tool. Furthermore, to aid an up-front assessment of dataset difficulty and a more efficient selection of models, we incorporate a difficulty measure as a function of class similarity. FCA offers a new tool for efficient machine learning in the Few-Class Regime, with goals ranging from a new efficient class similarity proposal, to lightweight model architecture design, to a new scaling law. FCA is user-friendly and can be easily extended to new models and datasets, facilitating future research work. Our benchmark is available at https://github.com/bryanbocao/fca.

10 pa...

10 pages, 32 pages including References and Appendix, 19 figures, 8 tables

Code Link
Pruning Deep Neural Networks via a Combination of the Marchenko-Pastur Distribution and Regularization 2025-03-02
Show

Deep neural networks (DNNs) have brought significant advancements in various applications in recent years, such as image recognition, speech recognition, and natural language processing. In particular, Vision Transformers (ViTs) have emerged as a powerful class of models in the field of deep learning for image classification. In this work, we propose a novel Random Matrix Theory (RMT)-based method for pruning pre-trained DNNs, based on the sparsification of weights and singular vectors, and apply it to ViTs. RMT provides a robust framework to analyze the statistical properties of large matrices, which has been shown to be crucial for understanding and optimizing the performance of DNNs. We demonstrate that our RMT-based pruning can be used to reduce the number of parameters of ViT models (trained on ImageNet) by 30-50% with less than 1% loss in accuracy. To our knowledge, this represents the state-of-the-art in pruning for these ViT models. Furthermore, we provide a rigorous mathematical underpinning of the above numerical studies, namely we proved a theorem for fully connected DNNs, and other more general DNN structures, describing how the randomness in the weight matrices of a DNN decreases as the weights approach a local or global minimum (during training). We verify this theorem through numerical experiments on fully connected DNNs, providing empirical support for our theoretical findings. Moreover, we prove a theorem that describes how DNN loss decreases as we remove randomness in the weight layers, and show a monotone dependence of the decrease in loss with the amount of randomness that we remove. Our results also provide significant RMT-based insights into the role of regularization during training and pruning.

None
Parameter Expanded Stochastic Gradient Markov Chain Monte Carlo 2025-03-02
Show

Bayesian Neural Networks (BNNs) provide a promising framework for modeling predictive uncertainty and enhancing out-of-distribution robustness (OOD) by estimating the posterior distribution of network parameters. Stochastic Gradient Markov Chain Monte Carlo (SGMCMC) is one of the most powerful methods for scalable posterior sampling in BNNs, achieving efficiency by combining stochastic gradient descent with second-order Langevin dynamics. However, SGMCMC often suffers from limited sample diversity in practice, which affects uncertainty estimation and model performance. We propose a simple yet effective approach to enhance sample diversity in SGMCMC without the need for tempering or running multiple chains. Our approach reparameterizes the neural network by decomposing each of its weight matrices into a product of matrices, resulting in a sampling trajectory that better explores the target parameter space. This approach produces a more diverse set of samples, allowing faster mixing within the same computational budget. Notably, our sampler achieves these improvements without increasing the inference cost compared to the standard SGMCMC. Extensive experiments on image classification tasks, including OOD robustness, diversity, loss surface analyses, and a comparative study with Hamiltonian Monte Carlo, demonstrate the superiority of the proposed approach.

None
Transformer Meets Twicing: Harnessing Unattended Residual Information 2025-03-02
Show

Transformer-based deep learning models have achieved state-of-the-art performance across numerous language and vision tasks. While the self-attention mechanism, a core component of transformers, has proven capable of handling complex data patterns, it has been observed that the representational capacity of the attention matrix degrades significantly across transformer layers, thereby hurting its overall performance. In this work, we leverage the connection between self-attention computations and low-pass non-local means (NLM) smoothing filters and propose the Twicing Attention, a novel attention mechanism that uses kernel twicing procedure in nonparametric regression to alleviate the low-pass behavior of associated NLM smoothing with compelling theoretical guarantees and enhanced adversarial robustness. This approach enables the extraction and reuse of meaningful information retained in the residuals following the imperfect smoothing operation at each layer. Our proposed method offers two key advantages over standard self-attention: 1) a provably slower decay of representational capacity and 2) improved robustness and accuracy across various data modalities and tasks. We empirically demonstrate the performance gains of our model over baseline transformers on multiple tasks and benchmarks, including image classification and language modeling, on both clean and corrupted data.

None
HDKD: Hybrid Data-Efficient Knowledge Distillation Network for Medical Image Classification 2025-03-01
Show

Vision Transformers (ViTs) have achieved significant advancement in computer vision tasks due to their powerful modeling capacity. However, their performance notably degrades when trained with insufficient data due to lack of inherent inductive biases. Distilling knowledge and inductive biases from a Convolutional Neural Network (CNN) teacher has emerged as an effective strategy for enhancing the generalization of ViTs on limited datasets. Previous approaches to Knowledge Distillation (KD) have pursued two primary paths: some focused solely on distilling the logit distribution from CNN teacher to ViT student, neglecting the rich semantic information present in intermediate features due to the structural differences between them. Others integrated feature distillation along with logit distillation, yet this introduced alignment operations that limits the amount of knowledge transferred due to mismatched architectures and increased the computational overhead. To this end, this paper presents Hybrid Data-efficient Knowledge Distillation (HDKD) paradigm which employs a CNN teacher and a hybrid student. The choice of hybrid student serves two main aspects. First, it leverages the strengths of both convolutions and transformers while sharing the convolutional structure with the teacher model. Second, this shared structure enables the direct application of feature distillation without any information loss or additional computational overhead. Additionally, we propose an efficient light-weight convolutional block named Mobile Channel-Spatial Attention (MBCSA), which serves as the primary convolutional block in both teacher and student models. Extensive experiments on two medical public datasets showcase the superiority of HDKD over other state-of-the-art models and its computational efficiency. Source code at: https://github.com/omarsherif200/HDKD

Code Link
Provably Reliable Conformal Prediction Sets in the Presence of Data Poisoning 2025-03-01
Show

Conformal prediction provides model-agnostic and distribution-free uncertainty quantification through prediction sets that are guaranteed to include the ground truth with any user-specified probability. Yet, conformal prediction is not reliable under poisoning attacks where adversaries manipulate both training and calibration data, which can significantly alter prediction sets in practice. As a solution, we propose reliable prediction sets (RPS): the first efficient method for constructing conformal prediction sets with provable reliability guarantees under poisoning. To ensure reliability under training poisoning, we introduce smoothed score functions that reliably aggregate predictions of classifiers trained on distinct partitions of the training data. To ensure reliability under calibration poisoning, we construct multiple prediction sets, each calibrated on distinct subsets of the calibration data. We then aggregate them into a majority prediction set, which includes a class only if it appears in a majority of the individual sets. Both proposed aggregations mitigate the influence of datapoints in the training and calibration data on the final prediction set. We experimentally validate our approach on image classification tasks, achieving strong reliability while maintaining utility and preserving coverage on clean data. Overall, our approach represents an important step towards more trustworthy uncertainty quantification in the presence of data poisoning.

Accep...

Accepted at ICLR 2025 (Spotlight)

None
Model-Agnostic Meta-Policy Optimization via Zeroth-Order Estimation: A Linear Quadratic Regulator Perspective 2025-03-01
Show

Meta-learning has been proposed as a promising machine learning topic in recent years, with important applications to image classification, robotics, computer games, and control systems. In this paper, we study the problem of using meta-learning to deal with uncertainty and heterogeneity in ergodic linear quadratic regulators. We integrate the zeroth-order optimization technique with a typical meta-learning method, proposing an algorithm that omits the estimation of policy Hessian, which applies to tasks of learning a set of heterogeneous but similar linear dynamic systems. The induced meta-objective function inherits important properties of the original cost function when the set of linear dynamic systems are meta-learnable, allowing the algorithm to optimize over a learnable landscape without projection onto the feasible set. We provide stability and convergence guarantees for the exact gradient descent process by analyzing the boundedness and local smoothness of the gradient for the meta-objective, which justify the proposed algorithm with gradient estimation error being small. We provide the sample complexity conditions for these theoretical guarantees, as well as a numerical example at the end to corroborate this perspective.

None
A Survey of Adversarial Defenses in Vision-based Systems: Categorization, Methods and Challenges 2025-03-01
Show

Adversarial attacks have emerged as a major challenge to the trustworthy deployment of machine learning models, particularly in computer vision applications. These attacks have a varied level of potency and can be implemented in both white box and black box approaches. Practical attacks include methods to manipulate the physical world and enforce adversarial behaviour by the corresponding target neural network models. Multiple different approaches to mitigate different kinds of such attacks are available in the literature, each with their own advantages and limitations. In this survey, we present a comprehensive systematization of knowledge on adversarial defenses, focusing on two key computer vision tasks: image classification and object detection. We review the state-of-the-art adversarial defense techniques and categorize them for easier comparison. In addition, we provide a schematic representation of these categories within the context of the overall machine learning pipeline, facilitating clearer understanding and benchmarking of defenses. Furthermore, we map these defenses to the types of adversarial attacks and datasets where they are most effective, offering practical insights for researchers and practitioners. This study is necessary for understanding the scope of how the available defenses are able to address the adversarial threats, and their shortcomings as well, which is necessary for driving the research in this area in the most appropriate direction, with the aim of building trustworthy AI systems for regular practical use-cases.

None
Few-shot crack image classification using clip based on bayesian optimization 2025-03-01
Show

This study proposes a novel few-shot crack image classification model based on CLIP and Bayesian optimization. By combining multimodal information and Bayesian approach, the model achieves efficient classification of crack images in a small number of training samples. The CLIP model employs its robust feature extraction capabilities to facilitate precise classification with a limited number of samples. In contrast, Bayesian optimisation enhances the robustness and generalization of the model, while reducing the reliance on extensive labelled data. The results demonstrate that the model exhibits robust performance across a diverse range of dataset scales, particularly in the context of small sample sets. The study validates the potential of the method in civil engineering crack classification.

5 pag...

5 pages, 5 figures, 3 tables, submit to the 1st International Workshop on Bayesian Approach in Civil Engineering (IWOBA 2025)

None
AI-Augmented Thyroid Scintigraphy for Robust Classification 2025-03-01
Show

Thyroid scintigraphy is a key imaging modality for diagnosing thyroid disorders. Deep learning models for thyroid scintigraphy classification often face challenges due to limited and imbalanced datasets, leading to suboptimal generalization. In this study, we investigate the effectiveness of different data augmentation techniques including Stable Diffusion (SD), Flow Matching (FM), and Conventional Augmentation (CA) to enhance the performance of a ResNet18 classifier for thyroid condition classification. Our results showed that FM-based augmentation consistently outperforms SD-based approaches, particularly when combined with original (O) data and CA (O+FM+CA), achieving both high accuracy and fair classification across Diffuse Goiter (DG), Nodular Goiter (NG), Normal (NL), and Thyroiditis (TI) cases. The Wilcoxon statistical analysis further validated the superiority of O+FM and its variants (O+FM+CA) over SD-based augmentations in most scenarios. These findings highlight the potential of FM-based augmentation as a superior approach for generating high-quality synthetic thyroid scintigraphy images and improving model generalization in medical image classification.

None
UAV-Assisted Real-Time Disaster Detection Using Optimized Transformer Model 2025-02-28
Show

Dangerous surroundings and difficult-to-reach landscapes introduce significant complications for adequate disaster management and recuperation. These problems can be solved by engaging unmanned aerial vehicles (UAVs) provided with embedded platforms and optical sensors. In this work, we focus on enabling onboard aerial image processing to ensure proper and real-time disaster detection. Such a setting usually causes challenges due to the limited hardware resources of UAVs. However, privacy, connectivity, and latency issues can be avoided. We suggest a UAV-assisted edge framework for disaster detection, leveraging our proposed model optimized for onboard real-time aerial image classification. The optimization of the model is achieved using post-training quantization techniques. To address the limited number of disaster cases in existing benchmark datasets and therefore ensure real-world adoption of our model, we construct a novel dataset, DisasterEye, featuring disaster scenes captured by UAVs and individuals on-site. Experimental results reveal the efficacy of our model, reaching high accuracy with lowered inference latency and memory use on both traditional machines and resource-limited devices. This shows that the scalability and adaptability of our method make it a powerful solution for real-time disaster management on resource-constrained UAV platforms.

None
Exploring the Impact of Temperature Scaling in Softmax for Classification and Adversarial Robustness 2025-02-28
Show

The softmax function is a fundamental component in deep learning. This study delves into the often-overlooked parameter within the softmax function, known as "temperature," providing novel insights into the practical and theoretical aspects of temperature scaling for image classification. Our empirical studies, adopting convolutional neural networks and transformers on multiple benchmark datasets, reveal that moderate temperatures generally introduce better overall performance. Through extensive experiments and rigorous theoretical analysis, we explore the role of temperature scaling in model training and unveil that temperature not only influences learning step size but also shapes the model's optimization direction. Moreover, for the first time, we discover a surprising benefit of elevated temperatures: enhanced model robustness against common corruption, natural perturbation, and non-targeted adversarial attacks like Projected Gradient Descent. We extend our discoveries to adversarial training, demonstrating that, compared to the standard softmax function with the default temperature value, higher temperatures have the potential to enhance adversarial training. The insights of this work open new avenues for improving model performance and security in deep learning applications.

None
In-Model Merging for Enhancing the Robustness of Medical Imaging Classification Models 2025-02-27
Show

Model merging is an effective strategy to merge multiple models for enhancing model performances, and more efficient than ensemble learning as it will not introduce extra computation into inference. However, limited research explores if the merging process can occur within one model and enhance the model's robustness, which is particularly critical in the medical image domain. In the paper, we are the first to propose in-model merging (InMerge), a novel approach that enhances the model's robustness by selectively merging similar convolutional kernels in the deep layers of a single convolutional neural network (CNN) during the training process for classification. We also analytically reveal important characteristics that affect how in-model merging should be performed, serving as an insightful reference for the community. We demonstrate the feasibility and effectiveness of this technique for different CNN architectures on 4 prevalent datasets. The proposed InMerge-trained model surpasses the typically-trained model by a substantial margin. The code will be made public.

None
DECO: Unleashing the Potential of ConvNets for Query-based Detection and Segmentation 2025-02-27
Show

Transformer and its variants have shown great potential for various vision tasks in recent years, including image classification, object detection and segmentation. Meanwhile, recent studies also reveal that with proper architecture design, convolutional networks (ConvNets) also achieve competitive performance with transformers. However, no prior methods have explored to utilize pure convolution to build a Transformer-style Decoder module, which is essential for Encoder-Decoder architecture like Detection Transformer (DETR). To this end, in this paper we explore whether we could build query-based detection and segmentation framework with ConvNets instead of sophisticated transformer architecture. We propose a novel mechanism dubbed InterConv to perform interaction between object queries and image features via convolutional layers. Equipped with the proposed InterConv, we build Detection ConvNet (DECO), which is composed of a backbone and convolutional encoder-decoder architecture. We compare the proposed DECO against prior detectors on the challenging COCO benchmark. Despite its simplicity, our DECO achieves competitive performance in terms of detection accuracy and running speed. Specifically, with the ResNet-18 and ResNet-50 backbone, our DECO achieves $40.5%$ and $47.8%$ AP with $66$ and $34$ FPS, respectively. The proposed method is also evaluated on the segment anything task, demonstrating similar performance and higher efficiency. We hope the proposed method brings another perspective for designing architectures for vision tasks. Codes are available at https://github.com/xinghaochen/DECO and https://github.com/mindspore-lab/models/tree/master/research/huawei-noah/DECO.

ICLR 2025 Code Link
QPM: Discrete Optimization for Globally Interpretable Image Classification 2025-02-27
Show

Understanding the classifications of deep neural networks, e.g. used in safety-critical situations, is becoming increasingly important. While recent models can locally explain a single decision, to provide a faithful global explanation about an accurate model's general behavior is a more challenging open task. Towards that goal, we introduce the Quadratic Programming Enhanced Model (QPM), which learns globally interpretable class representations. QPM represents every class with a binary assignment of very few, typically 5, features, that are also assigned to other classes, ensuring easily comparable contrastive class representations. This compact binary assignment is found using discrete optimization based on predefined similarity measures and interpretability constraints. The resulting optimal assignment is used to fine-tune the diverse features, so that each of them becomes the shared general concept between the assigned classes. Extensive evaluations show that QPM delivers unprecedented global interpretability across small and large-scale datasets while setting the state of the art for the accuracy of interpretable models.

None
GPT4Image: Large Pre-trained Models Help Vision Models Learn Better on Perception Task 2025-02-27
Show

The upsurge in pre-trained large models started by ChatGPT has swept across the entire deep learning community. Such powerful models demonstrate advanced generative ability and multimodal understanding capability, which quickly set new state of the arts on a variety of benchmarks. The pre-trained LLM usually plays the role as a universal AI model that can conduct various tasks like article analysis and image comprehension. However, due to the prohibitively high memory and computational cost of implementing such a large model, the conventional models (such as CNN and ViT) are still essential for many visual perception tasks. In this paper, we propose to enhance the representation ability of ordinary vision models on perception tasks (e.g. image classification) by taking advantage of the off-the-shelf large pre-trained models. We present a new learning framework, dubbed GPT4Image, where the knowledge of the large pre-trained models are extracted to help CNNs and ViTs learn better representations and achieve higher performance. Firstly, we curate a high quality description set by prompting a multimodal LLM to generate descriptions for training images. Then, these detailed descriptions are fed into a pre-trained encoder to extract text embeddings that encodes the rich semantics of images. During training, text embeddings will serve as extra supervising signal and be aligned with image representations learned by vision models. The alignment process helps vision models achieve better performance with the aid of pre-trained LLMs. We conduct extensive experiments to verify the effectiveness of the proposed algorithm on various visual perception tasks for heterogeneous model architectures.

GitHu...

GitHub: https://github.com/huawei-noah/Efficient-Computing/tree/master/GPT4Image/

Code Link
InPK: Infusing Prior Knowledge into Prompt for Vision-Language Models 2025-02-27
Show

Prompt tuning has become a popular strategy for adapting Vision-Language Models (VLMs) to zero/few-shot visual recognition tasks. Some prompting techniques introduce prior knowledge due to its richness, but when learnable tokens are randomly initialized and disconnected from prior knowledge, they tend to overfit on seen classes and struggle with domain shifts for unseen ones. To address this issue, we propose the InPK model, which infuses class-specific prior knowledge into the learnable tokens during initialization, thus enabling the model to explicitly focus on class-relevant information. Furthermore, to mitigate the weakening of class information by multi-layer encoders, we continuously reinforce the interaction between learnable tokens and prior knowledge across multiple feature levels. This progressive interaction allows the learnable tokens to better capture the fine-grained differences and universal visual concepts within prior knowledge, enabling the model to extract more discriminative and generalized text features. Even for unseen classes, the learned interaction allows the model to capture their common representations and infer their appropriate positions within the existing semantic structure. Moreover, we introduce a learnable text-to-vision projection layer to accommodate the text adjustments, ensuring better alignment of visual-text semantics. Extensive experiments on 11 recognition datasets show that InPK significantly outperforms state-of-the-art methods in multiple zero/few-shot image classification tasks.

None
Learning Mask Invariant Mutual Information for Masked Image Modeling 2025-02-27
Show

Masked autoencoders (MAEs) represent a prominent self-supervised learning paradigm in computer vision. Despite their empirical success, the underlying mechanisms of MAEs remain insufficiently understood. Recent studies have attempted to elucidate the functioning of MAEs through contrastive learning and feature representation analysis, yet these approaches often provide only implicit insights. In this paper, we propose a new perspective for understanding MAEs by leveraging the information bottleneck principle in information theory. Our theoretical analyses reveal that optimizing the latent features to balance relevant and irrelevant information is key to improving MAE performance. Building upon our proofs, we introduce MI-MAE, a novel method that optimizes MAEs through mutual information maximization and minimization. By enhancing latent features to retain maximal relevant information between them and the output, and minimizing irrelevant information between them and the input, our approach achieves better performance. Extensive experiments on standard benchmarks show that MI-MAE significantly outperforms MAE models in tasks such as image classification, object detection, and semantic segmentation. Our findings validate the theoretical framework and highlight the practical advantages of applying the information bottleneck principle to MAEs, offering deeper insights for developing more powerful self-supervised learning models.

ICLR 2025 None
LiD-FL: Towards List-Decodable Federated Learning 2025-02-27
Show

Federated learning is often used in environments with many unverified participants. Therefore, federated learning under adversarial attacks receives significant attention. This paper proposes an algorithmic framework for list-decodable federated learning, where a central server maintains a list of models, with at least one guaranteed to perform well. The framework has no strict restriction on the fraction of honest workers, extending the applicability of Byzantine federated learning to the scenario with more than half adversaries. Under proper assumptions on the loss function, we prove a convergence theorem for our method. Experimental results, including image classification tasks with both convex and non-convex losses, demonstrate that the proposed algorithm can withstand the malicious majority under various attacks.

26 pages, 5 figures None
Spatial-Spectral Diffusion Contrastive Representation Network for Hyperspectral Image Classification 2025-02-27
Show

Although efficient extraction of discriminative spatial-spectral features is critical for hyperspectral images classification (HSIC), it is difficult to achieve these features due to factors such as the spatial-spectral heterogeneity and noise effect. This paper presents a Spatial-Spectral Diffusion Contrastive Representation Network (DiffCRN), based on denoising diffusion probabilistic model (DDPM) combined with contrastive learning (CL) for HSIC, with the following characteristics. First,to improve spatial-spectral feature representation, instead of adopting the UNets-like structure which is widely used for DDPM, we design a novel staged architecture with spatial self-attention denoising module (SSAD) and spectral group self-attention denoising module (SGSAD) in DiffCRN with improved efficiency for spectral-spatial feature learning. Second, to improve unsupervised feature learning efficiency, we design new DDPM model with logarithmic absolute error (LAE) loss and CL that improve the loss function effectiveness and increase the instance-level and inter-class discriminability. Third, to improve feature selection, we design a learnable approach based on pixel-level spectral angle mapping (SAM) for the selection of time steps in the proposed DDPM model in an adaptive and automatic manner. Last, to improve feature integration and classification, we design an Adaptive weighted addition modul (AWAM) and Cross time step Spectral-Spatial Fusion Module (CTSSFM) to fuse time-step-wise features and perform classification. Experiments conducted on widely used four HSI datasets demonstrate the improved performance of the proposed DiffCRN over the classical backbone models and state-of-the-art GAN, transformer models and other pretrained methods. The source code and pre-trained model will be made available publicly.

None
A Fusion Model for Artwork Identification Based on Convolutional Neural Networks and Transformers 2025-02-27
Show

The identification of artwork is crucial in areas like cultural heritage protection, art market analysis, and historical research. With the advancement of deep learning, Convolutional Neural Networks (CNNs) and Transformer models have become key tools for image classification. While CNNs excel in local feature extraction, they struggle with global context, and Transformers are strong in capturing global dependencies but weak in fine-grained local details. To address these challenges, this paper proposes a fusion model combining CNNs and Transformers for artwork identification. The model first extracts local features using CNNs, then captures global context with a Transformer, followed by a feature fusion mechanism to enhance classification accuracy. Experiments on Chinese and oil painting datasets show the fusion model outperforms individual CNN and Transformer models, improving classification accuracy by 9.7% and 7.1%, respectively, and increasing F1 scores by 0.06 and 0.05. The results demonstrate the model's effectiveness and potential for future improvements, such as multimodal integration and architecture optimization.

None
A Residual Multi-task Network for Joint Classification and Regression in Medical Imaging 2025-02-27
Show

Detection and classification of pulmonary nodules is a challenge in medical image analysis due to the variety of shapes and sizes of nodules and their high concealment. Despite the success of traditional deep learning methods in image classification, deep networks still struggle to perfectly capture subtle changes in lung nodule detection. Therefore, we propose a residual multi-task network (Res-MTNet) model, which combines multi-task learning and residual learning, and improves feature representation ability by sharing feature extraction layer and introducing residual connections. Multi-task learning enables the model to handle multiple tasks simultaneously, while the residual module solves the problem of disappearing gradients, ensuring stable training of deeper networks and facilitating information sharing between tasks. Res-MTNet enhances the robustness and accuracy of the model, providing a more reliable lung nodule analysis tool for clinical medicine and telemedicine.

None
Vision Transformers on the Edge: A Comprehensive Survey of Model Compression and Acceleration Strategies 2025-02-26
Show

In recent years, vision transformers (ViTs) have emerged as powerful and promising techniques for computer vision tasks such as image classification, object detection, and segmentation. Unlike convolutional neural networks (CNNs), which rely on hierarchical feature extraction, ViTs treat images as sequences of patches and leverage self-attention mechanisms. However, their high computational complexity and memory demands pose significant challenges for deployment on resource-constrained edge devices. To address these limitations, extensive research has focused on model compression techniques and hardware-aware acceleration strategies. Nonetheless, a comprehensive review that systematically categorizes these techniques and their trade-offs in accuracy, efficiency, and hardware adaptability for edge deployment remains lacking. This survey bridges this gap by providing a structured analysis of model compression techniques, software tools for inference on edge, and hardware acceleration strategies for ViTs. We discuss their impact on accuracy, efficiency, and hardware adaptability, highlighting key challenges and emerging research directions to advance ViT deployment on edge platforms, including graphics processing units (GPUs), tensor processing units (TPUs), and field-programmable gate arrays (FPGAs). The goal is to inspire further research with a contemporary guide on optimizing ViTs for efficient deployment on edge devices.

None
I Know What I Don't Know: Improving Model Cascades Through Confidence Tuning 2025-02-26
Show

Large-scale machine learning models deliver strong performance across a wide range of tasks but come with significant computational and resource constraints. To mitigate these challenges, local smaller models are often deployed alongside larger models, relying on routing and deferral mechanisms to offload complex tasks. However, existing approaches inadequately balance the capabilities of these models, often resulting in unnecessary deferrals or sub-optimal resource usage. In this work we introduce a novel loss function called Gatekeeper for calibrating smaller models in cascade setups. Our approach fine-tunes the smaller model to confidently handle tasks it can perform correctly while deferring complex tasks to the larger model. Moreover, it incorporates a mechanism for managing the trade-off between model performance and deferral accuracy, and is broadly applicable across various tasks and domains without any architectural changes. We evaluate our method on encoder-only, decoder-only, and encoder-decoder architectures. Experiments across image classification, language modeling, and vision-language tasks show that our approach substantially improves deferral performance.

None
Spatial-Mamba: Effective Visual State Space Models via Structure-aware State Fusion 2025-02-26
Show

Selective state space models (SSMs), such as Mamba, highly excel at capturing long-range dependencies in 1D sequential data, while their applications to 2D vision tasks still face challenges. Current visual SSMs often convert images into 1D sequences and employ various scanning patterns to incorporate local spatial dependencies. However, these methods are limited in effectively capturing the complex image spatial structures and the increased computational cost caused by the lengthened scanning paths. To address these limitations, we propose Spatial-Mamba, a novel approach that establishes neighborhood connectivity directly in the state space. Instead of relying solely on sequential state transitions, we introduce a structure-aware state fusion equation, which leverages dilated convolutions to capture image spatial structural dependencies, significantly enhancing the flow of visual contextual information. Spatial-Mamba proceeds in three stages: initial state computation in a unidirectional scan, spatial context acquisition through structure-aware state fusion, and final state computation using the observation equation. Our theoretical analysis shows that Spatial-Mamba unifies the original Mamba and linear attention under the same matrix multiplication framework, providing a deeper understanding of our method. Experimental results demonstrate that Spatial-Mamba, even with a single scan, attains or surpasses the state-of-the-art SSM-based models in image classification, detection and segmentation. Source codes and trained models can be found at https://github.com/EdwardChasel/Spatial-Mamba.

Accep...

Accepted by ICLR 2025

Code Link
Make LoRA Great Again: Boosting LoRA with Adaptive Singular Values and Mixture-of-Experts Optimization Alignment 2025-02-26
Show

While Low-Rank Adaptation (LoRA) enables parameter-efficient fine-tuning for Large Language Models (LLMs), its performance often falls short of Full Fine-Tuning (Full FT). Current methods optimize LoRA by initializing with static singular value decomposition (SVD) subsets, leading to suboptimal leveraging of pre-trained knowledge. Another path for improving LoRA is incorporating a Mixture-of-Experts (MoE) architecture. However, weight misalignment and complex gradient dynamics make it challenging to adopt SVD prior to the LoRA MoE architecture. To mitigate these issues, we propose \underline{G}reat L\underline{o}R\underline{A} Mixture-of-Exper\underline{t} (GOAT), a framework that (1) adaptively integrates relevant priors using an SVD-structured MoE, and (2) aligns optimization with full fine-tuned MoE by deriving a theoretical scaling factor. We demonstrate that proper scaling, without modifying the architecture or training algorithms, boosts LoRA MoE's efficiency and performance. Experiments across 25 datasets, including natural language understanding, commonsense reasoning, image classification, and natural language generation, demonstrate GOAT's state-of-the-art performance, closing the gap with Full FT.

None
Dual Selective Fusion Transformer Network for Hyperspectral Image Classification 2025-02-26
Show

Transformer has achieved satisfactory results in the field of hyperspectral image (HSI) classification. However, existing Transformer models face two key challenges when dealing with HSI scenes characterized by diverse land cover types and rich spectral information: (1) A fixed receptive field overlooks the effective contextual scales required by various HSI objects; (2) invalid self-attention features in context fusion affect model performance. To address these limitations, we propose a novel Dual Selective Fusion Transformer Network (DSFormer) for HSI classification. DSFormer achieves joint spatial and spectral contextual modeling by flexibly selecting and fusing features across different receptive fields, effectively reducing unnecessary information interference by focusing on the most relevant spatial-spectral tokens. Specifically, we design a Kernel Selective Fusion Transformer Block (KSFTB) to learn an optimal receptive field by adaptively fusing spatial and spectral features across different scales, enhancing the model's ability to accurately identify diverse HSI objects. Additionally, we introduce a Token Selective Fusion Transformer Block (TSFTB), which strategically selects and combines essential tokens during the spatial-spectral self-attention fusion process to capture the most crucial contexts. Extensive experiments conducted on four benchmark HSI datasets demonstrate that the proposed DSFormer significantly improves land cover classification accuracy, outperforming existing state-of-the-art methods. Specifically, DSFormer achieves overall accuracies of 96.59%, 97.66%, 95.17%, and 94.59% in the Pavia University, Houston, Indian Pines, and Whu-HongHu datasets, respectively, reflecting improvements of 3.19%, 1.14%, 0.91%, and 2.80% over the previous model. The code will be available online at https://github.com/YichuXu/DSFormer.

Accep...

Accepted by Neural Networks 2025

Code Link
Achieving Upper Bound Accuracy of Joint Training in Continual Learning 2025-02-26
Show

Continual learning has been an active research area in machine learning, focusing on incrementally learning a sequence of tasks. A key challenge is catastrophic forgetting (CF), and most research efforts have been directed toward mitigating this issue. However, a significant gap remains between the accuracy achieved by state-of-the-art continual learning algorithms and the ideal or upper-bound accuracy achieved by training all tasks together jointly. This gap has hindered or even prevented the adoption of continual learning in applications, as accuracy is often of paramount importance. Recently, another challenge, termed inter-task class separation (ICS), was also identified, which spurred a theoretical study into principled approaches for solving continual learning. Further research has shown that by leveraging the theory and the power of large foundation models, it is now possible to achieve upper-bound accuracy, which has been empirically validated using both text and image classification datasets. Continual learning is now ready for real-life applications. This paper surveys the main research leading to this achievement, justifies the approach both intuitively and from neuroscience research, and discusses insights gained.

None
Time Traveling to Defend Against Adversarial Example Attacks in Image Classification 2025-02-26
Show

Adversarial example attacks have emerged as a critical threat to machine learning. Adversarial attacks in image classification abuse various, minor modifications to the image that confuse the image classification neural network -- while the image still remains recognizable to humans. One important domain where the attacks have been applied is in the automotive setting with traffic sign classification. Researchers have demonstrated that adding stickers, shining light, or adding shadows are all different means to make machine learning inference algorithms mis-classify the traffic signs. This can cause potentially dangerous situations as a stop sign is recognized as a speed limit sign causing vehicles to ignore it and potentially leading to accidents. To address these attacks, this work focuses on enhancing defenses against such adversarial attacks. This work shifts the advantage to the user by introducing the idea of leveraging historical images and majority voting. While the attacker modifies a traffic sign that is currently being processed by the victim's machine learning inference, the victim can gain advantage by examining past images of the same traffic sign. This work introduces the notion of ''time traveling'' and uses historical Street View images accessible to anybody to perform inference on different, past versions of the same traffic sign. In the evaluation, the proposed defense has 100% effectiveness against latest adversarial example attack on traffic sign classification algorithm.

None
Fall Leaf Adversarial Attack on Traffic Sign Classification 2025-02-26
Show

Adversarial input image perturbation attacks have emerged as a significant threat to machine learning algorithms, particularly in image classification setting. These attacks involve subtle perturbations to input images that cause neural networks to misclassify the input images, even though the images remain easily recognizable to humans. One critical area where adversarial attacks have been demonstrated is in automotive systems where traffic sign classification and recognition is critical, and where misclassified images can cause autonomous systems to take wrong actions. This work presents a new class of adversarial attacks. Unlike existing work that has focused on adversarial perturbations that leverage human-made artifacts to cause the perturbations, such as adding stickers, paint, or shining flashlights at traffic signs, this work leverages nature-made artifacts: tree leaves. By leveraging nature-made artifacts, the new class of attacks has plausible deniability: a fall leaf stuck to a street sign could come from a near-by tree, rather than be placed there by an malicious human attacker. To evaluate the new class of the adversarial input image perturbation attacks, this work analyses how fall leaves can cause misclassification in street signs. The work evaluates various leaves from different species of trees, and considers various parameters such as size, color due to tree leaf type, and rotation. The work demonstrates high success rate for misclassification. The work also explores the correlation between successful attacks and how they affect the edge detection, which is critical in many image classification algorithms.

None
Enhancing Image Classification with Augmentation: Data Augmentation Techniques for Improved Image Classification 2025-02-25
Show

Convolutional Neural Networks (CNNs) serve as the workhorse of deep learning, finding applications in various fields that rely on images. Given sufficient data, they exhibit the capacity to learn a wide range of concepts across diverse settings. However, a notable limitation of CNNs is their susceptibility to overfitting when trained on small datasets. The augmentation of such datasets can significantly enhance CNN performance by introducing additional data points for learning. In this study, we explore the effectiveness of 11 different sets of data augmentation techniques, which include three novel sets proposed in this work. The first set of data augmentation employs pairwise channel transfer, transferring Red, Green, Blue, Hue, and Saturation values from randomly selected images in the database to all images in the dataset. The second set introduces a novel occlusion approach, where objects in the images are occluded by randomly selected objects from the dataset. The third set involves a novel masking approach, using vertical, horizontal, circular, and checkered masks to occlude portions of the images. In addition to these novel techniques, we investigate other existing augmentation methods, including rotation, horizontal and vertical flips, resizing, translation, blur, color jitter, and random erasing, and their effects on accuracy and overfitting. We fine-tune a base EfficientNet-B0 model for each augmentation method and conduct a comparative analysis to showcase their efficacy. For the evaluation and comparison of these augmentation techniques, we utilize the Caltech-101 dataset. The ensemble of image augmentation techniques proposed emerges as the most effective on the Caltech-101 dataset. The results demonstrate that diverse data augmentation techniques present a viable means of enhancing datasets for improved image classification.

None
MedKAN: An Advanced Kolmogorov-Arnold Network for Medical Image Classification 2025-02-25
Show

Recent advancements in deep learning for image classification predominantly rely on convolutional neural networks (CNNs) or Transformer-based architectures. However, these models face notable challenges in medical imaging, particularly in capturing intricate texture details and contextual features. Kolmogorov-Arnold Networks (KANs) represent a novel class of architectures that enhance nonlinear transformation modeling, offering improved representation of complex features. In this work, we present MedKAN, a medical image classification framework built upon KAN and its convolutional extensions. MedKAN features two core modules: the Local Information KAN (LIK) module for fine-grained feature extraction and the Global Information KAN (GIK) module for global context integration. By combining these modules, MedKAN achieves robust feature modeling and fusion. To address diverse computational needs, we introduce three scalable variants--MedKAN-S, MedKAN-B, and MedKAN-L. Experimental results on nine public medical imaging datasets demonstrate that MedKAN achieves superior performance compared to CNN- and Transformer-based models, highlighting its effectiveness and generalizability in medical image analysis.

None
Dual Classification Head Self-training Network for Cross-scene Hyperspectral Image Classification 2025-02-25
Show

Due to the difficulty of obtaining labeled data for hyperspectral images (HSIs), cross-scene classification has emerged as a widely adopted approach in the remote sensing community. It involves training a model using labeled data from a source domain (SD) and unlabeled data from a target domain (TD), followed by inferencing on the TD. However, variations in the reflectance spectrum of the same object between the SD and the TD, as well as differences in the feature distribution of the same land cover class, pose significant challenges to the performance of cross-scene classification. To address this issue, we propose a dual classification head self-training network (DHSNet). This method aligns class-wise features across domains, ensuring that the trained classifier can accurately classify TD data of different classes. We introduce a dual classification head self-training strategy for the first time in the cross-scene HSI classification field. The proposed approach mitigates domain gap while preventing the accumulation of incorrect pseudo-labels in the model. Additionally, we incorporate a novel central feature attention mechanism to enhance the model's capacity to learn scene-invariant features across domains. Experimental results on three cross-scene HSI datasets demonstrate that the proposed DHSNET significantly outperforms other state-of-the-art approaches. The code for DHSNet will be available at https://github.com/liurongwhm.

None
AUKT: Adaptive Uncertainty-Guided Knowledge Transfer with Conformal Prediction 2025-02-25
Show

Knowledge transfer between teacher and student models has proven effective across various machine learning applications. However, challenges arise when the teacher's predictions are noisy, or the data domain during student training shifts from the teacher's pretraining data. In such scenarios, blindly relying on the teacher's predictions can lead to suboptimal knowledge transfer. To address these challenges, we propose a novel and universal framework, Adaptive Uncertainty-guided Knowledge Transfer ($\textbf{AUKT}$), which leverages Conformal Prediction (CP) to dynamically adjust the student's reliance on the teacher's guidance based on the teacher's prediction uncertainty. CP is a distribution-free, model-agnostic approach that provides reliable prediction sets with statistical coverage guarantees and minimal computational overhead. This adaptive mechanism mitigates the risk of learning undesirable or incorrect knowledge. We validate the proposed framework across diverse applications, including image classification, imitation-guided reinforcement learning, and autonomous driving. Experimental results consistently demonstrate that our approach improves performance, robustness and transferability, offering a promising direction for enhanced knowledge transfer in real-world applications.

None
Attribute-based Visual Reprogramming for Vision-Language Models 2025-02-25
Show

Visual reprogramming (VR) reuses pre-trained vision models for downstream image classification tasks by adding trainable noise patterns to inputs. When applied to vision-language models (e.g., CLIP), existing VR approaches follow the same pipeline used in vision models (e.g., ResNet, ViT), where ground-truth class labels are inserted into fixed text templates to guide the optimization of VR patterns. This label-based approach, however, overlooks the rich information and diverse attribute-guided textual representations that CLIP can exploit, which may lead to the misclassification of samples. In this paper, we propose Attribute-based Visual Reprogramming (AttrVR) for CLIP, utilizing descriptive attributes (DesAttrs) and distinctive attributes (DistAttrs), which respectively represent common and unique feature descriptions for different classes. Besides, as images of the same class may reflect different attributes after VR, AttrVR iteratively refines patterns using the $k$-nearest DesAttrs and DistAttrs for each image sample, enabling more dynamic and sample-specific optimization. Theoretically, AttrVR is shown to reduce intra-class variance and increase inter-class separation. Empirically, it achieves superior performance in 12 downstream tasks for both ViT-based and ResNet-based CLIP. The success of AttrVR facilitates more effective integration of VR from unimodal vision models into vision-language models. Our code is available at https://github.com/tmlr-group/AttrVR.

Code Link
Can Score-Based Generative Modeling Effectively Handle Medical Image Classification? 2025-02-24
Show

The remarkable success of deep learning in recent years has prompted applications in medical image classification and diagnosis tasks. While classification models have demonstrated robustness in classifying simpler datasets like MNIST or natural images such as ImageNet, this resilience is not consistently observed in complex medical image datasets where data is more scarce and lacks diversity. Moreover, previous findings on natural image datasets have indicated a potential trade-off between data likelihood and classification accuracy. In this study, we explore the use of score-based generative models as classifiers for medical images, specifically mammographic images. Our findings suggest that our proposed generative classifier model not only achieves superior classification results on CBIS-DDSM, INbreast and Vin-Dr Mammo datasets, but also introduces a novel approach to image classification in a broader context. Our code is publicly available at https://github.com/sushmitasarker/sgc_for_medical_image_classification

Accep...

Accepted at the International Symposium on Biomedical Imaging (ISBI) 2025

Code Link
A Priori Generalizability Estimate for a CNN 2025-02-24
Show

We formulate truncated singular value decompositions of entire convolutional neural networks. We demonstrate the computed left and right singular vectors are useful in identifying which images the convolutional neural network is likely to perform poorly on. To create this diagnostic tool, we define two metrics: the Right Projection Ratio and the Left Projection Ratio. The Right (Left) Projection Ratio evaluates the fidelity of the projection of an image (label) onto the computed right (left) singular vectors. We observe that both ratios are able to identify the presence of class imbalance for an image classification problem. Additionally, the Right Projection Ratio, which only requires unlabeled data, is found to be correlated to the model's performance when applied to image segmentation. This suggests the Right Projection Ratio could be a useful metric to estimate how likely the model is to perform well on a sample.

None
Neural Attention: A Novel Mechanism for Enhanced Expressive Power in Transformer Models 2025-02-24
Show

Transformer models typically calculate attention matrices using dot products, which have limitations when capturing nonlinear relationships between embedding vectors. We propose Neural Attention, a technique that replaces dot products with feed-forward networks, enabling a more expressive representation of relationships between tokens. This approach modifies only the attention matrix calculation while preserving the matrix dimensions, making it easily adaptable to existing transformer-based architectures. We provide a detailed mathematical justification for why Neural Attention increases representational capacity and conduct controlled experiments to validate this claim. When comparing Neural Attention and Dot-Product Attention, NLP experiments on WikiText-103 show a reduction in perplexity of over 5 percent. Similarly, experiments on CIFAR-10 and CIFAR-100 show comparable improvements for image classification tasks. While Neural Attention introduces higher computational demands, we develop techniques to mitigate these challenges, ensuring practical usability without sacrificing the increased expressivity it provides. This work establishes Neural Attention as an effective means of enhancing the predictive capabilities of transformer models across a variety of applications.

None
Modeling Multi-modal Cross-interaction for Multi-label Few-shot Image Classification Based on Local Feature Selection 2025-02-24
Show

The aim of multi-label few-shot image classification (ML-FSIC) is to assign semantic labels to images, in settings where only a small number of training examples are available for each label. A key feature of the multi-label setting is that an image often has several labels, which typically refer to objects appearing in different regions of the image. When estimating label prototypes, in a metric-based setting, it is thus important to determine which regions are relevant for which labels, but the limited amount of training data and the noisy nature of local features make this highly challenging. As a solution, we propose a strategy in which label prototypes are gradually refined. First, we initialize the prototypes using word embeddings, which allows us to leverage prior knowledge about the meaning of the labels. Second, taking advantage of these initial prototypes, we then use a Loss Change Measurement (LCM) strategy to select the local features from the training images (i.e. the support set) that are most likely to be representative of a given label. Third, we construct the final prototype of the label by aggregating these representative local features using a multi-modal cross-interaction mechanism, which again relies on the initial word embedding-based prototypes. Experiments on COCO, PASCAL VOC, NUS-WIDE, and iMaterialist show that our model substantially improves the current state-of-the-art.

In Tr...

In Transactions on Multimedia Computing Communications and Applications. arXiv admin note: text overlap with arXiv:2112.01037

None
Disentangling Visual Transformers: Patch-level Interpretability for Image Classification 2025-02-24
Show

Visual transformers have achieved remarkable performance in image classification tasks, but this performance gain has come at the cost of interpretability. One of the main obstacles to the interpretation of transformers is the self-attention mechanism, which mixes visual information across the whole image in a complex way. In this paper, we propose Hindered Transformer (HiT), a novel interpretable by design architecture inspired by visual transformers. Our proposed architecture rethinks the design of transformers to better disentangle patch influences at the classification stage. Ultimately, HiT can be interpreted as a linear combination of patch-level information. We show that the advantages of our approach in terms of explicability come with a reasonable trade-off in performance, making it an attractive alternative for applications where interpretability is paramount.

None
Which Transformer to Favor: A Comparative Analysis of Efficiency in Vision Transformers 2025-02-24
Show

Self-attention in Transformers comes with a high computational cost because of their quadratic computational complexity, but their effectiveness in addressing problems in language and vision has sparked extensive research aimed at enhancing their efficiency. However, diverse experimental conditions, spanning multiple input domains, prevent a fair comparison based solely on reported results, posing challenges for model selection. To address this gap in comparability, we perform a large-scale benchmark of more than 45 models for image classification, evaluating key efficiency aspects, including accuracy, speed, and memory usage. Our benchmark provides a standardized baseline for efficiency-oriented transformers. We analyze the results based on the Pareto front -- the boundary of optimal models. Surprisingly, despite claims of other models being more efficient, ViT remains Pareto optimal across multiple metrics. We observe that hybrid attention-CNN models exhibit remarkable inference memory- and parameter-efficiency. Moreover, our benchmark shows that using a larger model in general is more efficient than using higher resolution images. Thanks to our holistic evaluation, we provide a centralized resource for practitioners and researchers, facilitating informed decisions when selecting or developing efficient transformers.

v3: n...

v3: new models, analysis of scaling behaviors; v4: WACV 2025 camera ready version, appendix added

None
A Transformer-in-Transformer Network Utilizing Knowledge Distillation for Image Recognition 2025-02-24
Show

This paper presents a novel knowledge distillation neural architecture leveraging efficient transformer networks for effective image classification. Natural images display intricate arrangements encompassing numerous extraneous elements. Vision transformers utilize localized patches to compute attention. However, exclusive dependence on patch segmentation proves inadequate in sufficiently encompassing the comprehensive nature of the image. To address this issue, we have proposed an inner-outer transformer-based architecture, which gives attention to the global and local aspects of the image. Moreover, The training of transformer models poses significant challenges due to their demanding resource, time, and data requirements. To tackle this, we integrate knowledge distillation into the architecture, enabling efficient learning. Leveraging insights from a larger teacher model, our approach enhances learning efficiency and effectiveness. Significantly, the transformer-in-transformer network acquires lightweight characteristics by means of distillation conducted within the feature extraction layer. Our featured network's robustness is established through substantial experimentation on the MNIST, CIFAR10, and CIFAR100 datasets, demonstrating commendable top-1 and top-5 accuracy. The conducted ablative analysis comprehensively validates the effectiveness of the chosen parameters and settings, showcasing their superiority against contemporary methodologies. Remarkably, the proposed Transformer-in-Transformer Network (TITN) model achieves impressive performance milestones across various datasets: securing the highest top-1 accuracy of 74.71% and a top-5 accuracy of 92.28% for the CIFAR100 dataset, attaining an unparalleled top-1 accuracy of 92.03% and top-5 accuracy of 99.80% for the CIFAR-10 dataset, and registering an exceptional top-1 accuracy of 99.56% for the MNIST dataset.

None
MOB-GCN: A Novel Multiscale Object-Based Graph Neural Network for Hyperspectral Image Classification 2025-02-22
Show

This paper introduces a novel multiscale object-based graph neural network called MOB-GCN for hyperspectral image (HSI) classification. The central aim of this study is to enhance feature extraction and classification performance by utilizing multiscale object-based image analysis (OBIA). Traditional pixel-based methods often suffer from low accuracy and speckle noise, while single-scale OBIA approaches may overlook crucial information of image objects at different levels of detail. MOB-GCN overcomes these challenges by extracting and integrating features from multiple segmentation scales, leveraging the Multiresolution Graph Network (MGN) architecture to capture both fine-grained and global spatial patterns. MOB-GCN addresses this issue by extracting and integrating features from multiple segmentation scales to improve classification results using the Multiresolution Graph Network (MGN) architecture that can model fine-grained and global spatial patterns. By constructing a dynamic multiscale graph hierarchy, MOB-GCN offers a more comprehensive understanding of the intricate details and global context of HSIs. Experimental results demonstrate that MOB-GCN consistently outperforms single-scale graph convolutional networks (GCNs) in terms of classification accuracy, computational efficiency, and noise reduction, particularly when labeled data is limited. The implementation of MOB-GCN is publicly available at https://github.com/HySonLab/MultiscaleHSI

Code Link
Multi-Teacher Knowledge Distillation with Reinforcement Learning for Visual Recognition 2025-02-22
Show

Multi-teacher Knowledge Distillation (KD) transfers diverse knowledge from a teacher pool to a student network. The core problem of multi-teacher KD is how to balance distillation strengths among various teachers. Most existing methods often develop weighting strategies from an individual perspective of teacher performance or teacher-student gaps, lacking comprehensive information for guidance. This paper proposes Multi-Teacher Knowledge Distillation with Reinforcement Learning (MTKD-RL) to optimize multi-teacher weights. In this framework, we construct both teacher performance and teacher-student gaps as state information to an agent. The agent outputs the teacher weight and can be updated by the return reward from the student. MTKD-RL reinforces the interaction between the student and teacher using an agent in an RL-based decision mechanism, achieving better matching capability with more meaningful weights. Experimental results on visual recognition tasks, including image classification, object detection, and semantic segmentation tasks, demonstrate that MTKD-RL achieves state-of-the-art performance compared to the existing multi-teacher KD works.

AAAI-2025 None
A Multi-Scale Isolation Forest Approach for Real-Time Detection and Filtering of FGSM Adversarial Attacks in Video Streams of Autonomous Vehicles 2025-02-22
Show

Deep Neural Networks (DNNs) have demonstrated remarkable success across a wide range of tasks, particularly in fields such as image classification. However, DNNs are highly susceptible to adversarial attacks, where subtle perturbations are introduced to input images, leading to erroneous model outputs. In today's digital era, ensuring the security and integrity of images processed by DNNs is of critical importance. One of the most prominent adversarial attack methods is the Fast Gradient Sign Method (FGSM), which perturbs images in the direction of the loss gradient to deceive the model. This paper presents a novel approach for detecting and filtering FGSM adversarial attacks in image processing tasks. Our proposed method evaluates 10,000 images, each subjected to five different levels of perturbation, characterized by $\epsilon$ values of 0.01, 0.02, 0.05, 0.1, and 0.2. These perturbations are applied in the direction of the loss gradient. We demonstrate that our approach effectively filters adversarially perturbed images, mitigating the impact of FGSM attacks. The method is implemented in Python, and the source code is publicly available on GitHub for reproducibility and further research.

17 pages, 7 figures None
Directional Gradient Projection for Robust Fine-Tuning of Foundation Models 2025-02-21
Show

Robust fine-tuning aims to adapt large foundation models to downstream tasks while preserving their robustness to distribution shifts. Existing methods primarily focus on constraining and projecting current model towards the pre-trained initialization based on the magnitudes between fine-tuned and pre-trained weights, which often require extensive hyper-parameter tuning and can sometimes result in underfitting. In this work, we propose Directional Gradient Projection (DiGraP), a novel layer-wise trainable method that incorporates directional information from gradients to bridge regularization and multi-objective optimization. Besides demonstrating our method on image classification, as another contribution we generalize this area to the multi-modal evaluation settings for robust fine-tuning. Specifically, we first bridge the uni-modal and multi-modal gap by performing analysis on Image Classification reformulated Visual Question Answering (VQA) benchmarks and further categorize ten out-of-distribution (OOD) VQA datasets by distribution shift types and degree (i.e. near versus far OOD). Experimental results show that DiGraP consistently outperforms existing baselines across Image Classfication and VQA tasks with discriminative and generative backbones, improving both in-distribution (ID) generalization and OOD robustness.

Accep...

Accepted to ICLR 2025

None
Optimizing Estimators of Squared Calibration Errors in Classification 2025-02-21
Show

In this work, we propose a mean-squared error-based risk that enables the comparison and optimization of estimators of squared calibration errors in practical settings. Improving the calibration of classifiers is crucial for enhancing the trustworthiness and interpretability of machine learning models, especially in sensitive decision-making scenarios. Although various calibration (error) estimators exist in the current literature, there is a lack of guidance on selecting the appropriate estimator and tuning its hyperparameters. By leveraging the bilinear structure of squared calibration errors, we reformulate calibration estimation as a regression problem with independent and identically distributed (i.i.d.) input pairs. This reformulation allows us to quantify the performance of different estimators even for the most challenging calibration criterion, known as canonical calibration. Our approach advocates for a training-validation-testing pipeline when estimating a calibration error on an evaluation dataset. We demonstrate the effectiveness of our pipeline by optimizing existing calibration estimators and comparing them with novel kernel ridge regression-based estimators on standard image classification tasks.

Publi...

Published at TMLR, see https://openreview.net/forum?id=BPDVZajOW5

None
A Novel Riemannian Sparse Representation Learning Network for Polarimetric SAR Image Classification 2025-02-21
Show

Deep learning is an effective end-to-end method for Polarimetric Synthetic Aperture Radar(PolSAR) image classification, but it lacks the guidance of related mathematical principle and is essentially a black-box model. In addition, existing deep models learn features in Euclidean space, where PolSAR complex matrix is commonly converted into a complex-valued vector as the network input, distorting matrix structure and channel relationship. However, the complex covariance matrix is Hermitian positive definite (HPD), and resides on a Riemannian manifold instead of a Euclidean one. Existing methods cannot measure the geometric distance of HPD matrices and easily cause some misclassifications due to inappropriate Euclidean measures. To address these issues, we propose a novel Riemannian Sparse Representation Learning Network (SRSR CNN) for PolSAR images. Firstly, a superpixel-based Riemannian Sparse Representation (SRSR) model is designed to learn the sparse features with Riemannian metric. Then, the optimization procedure of the SRSR model is inferred and further unfolded into an SRSRnet, which can automatically learn the sparse coefficients and dictionary atoms. Furthermore, to learn contextual high-level features, a CNN-enhanced module is added to improve classification performance. The proposed network is a Sparse Representation (SR) guided deep learning model, which can directly utilize the covariance matrix as the network input, and utilize Riemannian metric to learn geometric structure and sparse features of complex matrices in Riemannian space. Experiments on three real PolSAR datasets demonstrate that the proposed method surpasses state-of-the-art techniques in ensuring accurate edge details and correct region homogeneity for classification.

13 pages, 9 figures None
Quantum autoencoders for image classification 2025-02-21
Show

Classical machine learning often struggles with complex, high-dimensional data. Quantum machine learning offers a potential solution, promising more efficient processing. While the quantum convolutional neural network (QCNN), a hybrid quantum-classical algorithm, is suitable for current noisy intermediate-scale quantum-era hardware, its learning process relies heavily on classical computation. Future large-scale, gate-based quantum computers could unlock the full potential of quantum effects in machine learning. In contrast to QCNNs, quantum autoencoders (QAEs) leverage classical optimization solely for parameter tuning. Data compression and reconstruction are handled entirely within quantum circuits, enabling purely quantum-based feature extraction. This study introduces a novel image-classification approach using QAEs, achieving classification without requiring additional qubits compared with conventional QAE implementations. The quantum circuit structure significantly impacts classification accuracy. Unlike hybrid methods such as QCNN, QAE-based classification emphasizes quantum computation. Our experiments demonstrate high accuracy in a four-class classification task, evaluating various quantum-gate configurations to understand the impact of different parameterized quantum circuit (ansatz) structures on classification performance. Our results reveal that specific ansatz structures achieve superior accuracy, and we provide an analysis of their effectiveness. Moreover, the proposed approach achieves performance comparable to that of conventional machine-learning methods while significantly reducing the number of parameters requiring optimization. These findings indicate that QAEs can serve as efficient classification models with fewer parameters and highlight the potential of utilizing quantum circuits for complete end-to-end learning, a departure from hybrid approaches such as QCNN.

None
Steganographic Embeddings as an Effective Data Augmentation 2025-02-21
Show

Image Steganography is a cryptographic technique that embeds secret information into an image, ensuring the hidden data remains undetectable to the human eye while preserving the image's original visual integrity. Least Significant Bit (LSB) Steganography achieves this by replacing the k least significant bits of an image with the k most significant bits of a secret image, maintaining the appearance of the original image while simultaneously encoding the essential elements of the hidden data. In this work, we shift away from conventional applications of steganography in deep learning and explore its potential from a new angle. We present experimental results on CIFAR-10 showing that LSB Steganography, when used as a data augmentation strategy for downstream computer vision tasks such as image classification, can significantly improve the training efficiency of deep neural networks. It can also act as an implicit, uniformly discretized piecewise linear approximation of color augmentations such as (brightness, contrast, hue, and saturation), without introducing additional training overhead through a new joint image training regime that disregards the need for tuning sensitive augmentation hyperparameters.

10 pa...

10 pages, 4 figures. For associated code and experiments, see this http URL https://github.com/nickd16/steganographic-augmentations

Code Link
SCALES: Boost Binary Neural Network for Image Super-Resolution with Efficient Scalings 2025-02-21
Show

Deep neural networks for image super-resolution (SR) have demonstrated superior performance. However, the large memory and computation consumption hinders their deployment on resource-constrained devices. Binary neural networks (BNNs), which quantize the floating point weights and activations to 1-bit can significantly reduce the cost. Although BNNs for image classification have made great progress these days, existing BNNs for SR still suffer from a large performance gap between the FP SR networks. To this end, we observe the activation distribution in SR networks and find much larger pixel-to-pixel, channel-to-channel, layer-to-layer, and image-to-image variation in the activation distribution than image classification networks. However, existing BNNs for SR fail to capture these variations that contain rich information for image reconstruction, leading to inferior performance. To address this problem, we propose SCALES, a binarization method for SR networks that consists of the layer-wise scaling factor, the spatial re-scaling method, and the channel-wise re-scaling method, capturing the layer-wise, pixel-wise, and channel-wise variations efficiently in an input-dependent manner. We evaluate our method across different network architectures and datasets. For CNN-based SR networks, our binarization method SCALES outperforms the prior art method by 0.2dB with fewer parameters and operations. With SCALES, we achieve the first accurate binary Transformer-based SR network, improving PSNR by more than 1dB compared to the baseline method.

Accpe...

Accpeted by DATE 2025

None
Learning to Collaborate: A Capability Vectors-based Architecture for Adaptive Human-AI Decision Making 2025-02-21
Show

Human-AI collaborative decision making has emerged as a pivotal field in recent years. Existing methods treat human and AI as different entities when designing human-AI systems. However, as the decision capabilities of AI models become closer to human beings, it is necessary to build a uniform framework for capability modeling and integrating. In this study, we propose a general architecture for human-AI collaborative decision making, wherein we employ learnable capability vectors to represent the decision-making capabilities of both human experts and AI models. These capability vectors are utilized to determine the decision weights of multiple decision makers, taking into account the contextual information of each decision task. Our proposed architecture accommodates scenarios involving multiple human-AI decision makers with varying capabilities. Furthermore, we introduce a learning-free approach to establish a baseline using global collaborative weights. Experiments on image classification and hate speech detection demonstrate that our proposed architecture significantly outperforms the current state-of-the-art methods in image classification and sentiment analysis, especially for the case with large non-expertise capability levels. Overall, our method provides an effective and robust collaborative decision-making approach that integrates diverse human/AI capabilities within a unified framework.

None
TransMamba: Fast Universal Architecture Adaption from Transformers to Mamba 2025-02-21
Show

Transformers have been favored in both uni-modal and multi-modal foundation models for their flexible scalability in attention modules. Consequently, a number of pre-trained Transformer models, e.g., LLaVA, CLIP, and DEIT, are publicly available. Recent research has introduced subquadratic architectures like Mamba, which enables global awareness with linear complexity. Nevertheless, training specialized subquadratic architectures from scratch for certain tasks is both resource-intensive and time-consuming. As a motivator, we explore cross-architecture training to transfer the ready knowledge in existing Transformer models to alternative architecture Mamba, termed TransMamba. Our approach employs a two-stage strategy to expedite training new Mamba models, ensuring effectiveness in across uni-modal and cross-modal tasks. Concerning architecture disparities, we project the intermediate features into an aligned latent space before transferring knowledge. On top of that, a Weight Subcloning and Adaptive Bidirectional distillation method (WSAB) is introduced for knowledge transfer without limitations on varying layer counts. For cross-modal learning, we propose a cross-Mamba module that integrates language awareness into Mamba's visual features, enhancing the cross-modal interaction capabilities of Mamba architecture. Despite using less than 75% of the training data typically required for training from scratch, TransMamba boasts substantially stronger performance across various network architectures and downstream tasks, including image classification, visual question answering, and text-video retrieval. The code will be publicly available.

None
Stochastic Resonance Improves the Detection of Low Contrast Images in Deep Learning Models 2025-02-20
Show

Stochastic resonance describes the utility of noise in improving the detectability of weak signals in certain types of systems. It has been observed widely in natural and engineered settings, but its utility in image classification with rate-based neural networks has not been studied extensively. In this analysis a simple LSTM recurrent neural network is trained for digit recognition and classification. During the test phase, image contrast is reduced to a point where the model fails to recognize the presence of a stimulus. Controlled noise is added to partially recover classification performance. The results indicate the presence of stochastic resonance in rate-based recurrent neural networks.

MSc Course Project None
Reliable Explainability of Deep Learning Spatial-Spectral Classifiers for Improved Semantic Segmentation in Autonomous Driving 2025-02-20
Show

Integrating hyperspectral imagery (HSI) with deep neural networks (DNNs) can strengthen the accuracy of intelligent vision systems by combining spectral and spatial information, which is useful for tasks like semantic segmentation in autonomous driving. To advance research in such safety-critical systems, determining the precise contribution of spectral information to complex DNNs' output is needed. To address this, several saliency methods, such as class activation maps (CAM), have been proposed primarily for image classification. However, recent studies have raised concerns regarding their reliability. In this paper, we address their limitations and propose an alternative approach by leveraging the data provided by activations and weights from relevant DNN layers to better capture the relationship between input features and predictions. The study aims to assess the superior performance of HSI compared to 3-channel and single-channel DNNs. We also address the influence of spectral signature normalization for enhancing DNN robustness in real-world driving conditions.

None
Bridging Sensor Gaps via Attention Gated Tuning for Hyperspectral Image Classification 2025-02-19
Show

Data-hungry HSI classification methods require high-quality labeled HSIs, which are often costly to obtain. This characteristic limits the performance potential of data-driven methods when dealing with limited annotated samples. Bridging the domain gap between data acquired from different sensors allows us to utilize abundant labeled data across sensors to break this bottleneck. In this paper, we propose a novel Attention-Gated Tuning (AGT) strategy and a triplet-structured transformer model, Tri-Former, to address this issue. The AGT strategy serves as a bridge, allowing us to leverage existing labeled HSI datasets, even RGB datasets to enhance the performance on new HSI datasets with limited samples. Instead of inserting additional parameters inside the basic model, we train a lightweight auxiliary branch that takes intermediate features as input from the basic model and makes predictions. The proposed AGT resolves conflicts between heterogeneous and even cross-modal data by suppressing the disturbing information and enhances the useful information through a soft gate. Additionally, we introduce Tri-Former, a triplet-structured transformer with a spectral-spatial separation design that enhances parameter utilization and computational efficiency, enabling easier and flexible fine-tuning. Comparison experiments conducted on three representative HSI datasets captured by different sensors demonstrate the proposed Tri-Former achieves better performance compared to several state-of-the-art methods. Homologous, heterologous and cross-modal tuning experiments verified the effectiveness of the proposed AGT. Code has been released at: \href{https://github.com/Cecilia-xue/AGT}{https://github.com/Cecilia-xue/AGT}.

Code Link
Carefully Blending Adversarial Training, Purification, and Aggregation Improves Adversarial Robustness 2025-02-19
Show

In this work, we propose a novel adversarial defence mechanism for image classification - CARSO - blending the paradigms of adversarial training and adversarial purification in a synergistic robustness-enhancing way. The method builds upon an adversarially-trained classifier, and learns to map its internal representation associated with a potentially perturbed input onto a distribution of tentative clean reconstructions. Multiple samples from such distribution are classified by the same adversarially-trained model, and a carefully chosen aggregation of its outputs finally constitutes the robust prediction of interest. Experimental evaluation by a well-established benchmark of strong adaptive attacks, across different image datasets, shows that CARSO is able to defend itself against adaptive end-to-end white-box attacks devised for stochastic defences. Paying a modest clean accuracy toll, our method improves by a significant margin the state-of-the-art for Cifar-10, Cifar-100, and TinyImageNet-200 $\ell_\infty$ robust classification accuracy against AutoAttack. Code, and instructions to obtain pre-trained models are available at: https://github.com/emaballarin/CARSO .

25 pa...

25 pages, 1 figure, 16 tables

Code Link
Medical Image Classification with KAN-Integrated Transformers and Dilated Neighborhood Attention 2025-02-19
Show

Convolutional networks, transformers, hybrid models, and Mamba-based architectures have demonstrated strong performance across various medical image classification tasks. However, these methods were primarily designed to classify clean images using labeled data. In contrast, real-world clinical data often involve image corruptions that are unique to multi-center studies and stem from variations in imaging equipment across manufacturers. In this paper, we introduce the Medical Vision Transformer (MedViTV2), a novel architecture incorporating Kolmogorov-Arnold Network (KAN) layers into the transformer architecture for the first time, aiming for generalized medical image classification. We have developed an efficient KAN block to reduce computational load while enhancing the accuracy of the original MedViT. Additionally, to counteract the fragility of our MedViT when scaled up, we propose an enhanced Dilated Neighborhood Attention (DiNA), an adaptation of the efficient fused dot-product attention kernel capable of capturing global context and expanding receptive fields to scale the model effectively and addressing feature collapse issues. Moreover, a hierarchical hybrid strategy is introduced to stack our Local Feature Perception and Global Feature Perception blocks in an efficient manner, which balances local and global feature perceptions to boost performance. Extensive experiments on 17 medical image classification datasets and 12 corrupted medical image datasets demonstrate that MedViTV2 achieved state-of-the-art results in 27 out of 29 experiments with reduced computational complexity. MedViTV2 is 44% more computationally efficient than the previous version and significantly enhances accuracy, achieving improvements of 4.6% on MedMNIST, 5.8% on NonMNIST, and 13.4% on the MedMNIST-C benchmark.

None
LayerAct: Advanced Activation Mechanism for Robust Inference of CNNs 2025-02-18
Show

In this work, we propose a novel activation mechanism called LayerAct for CNNs. This approach is motivated by our theoretical and experimental analyses, which demonstrate that Layer Normalization (LN) can mitigate a limitation of existing activation functions regarding noise robustness. However, LN is known to be disadvantageous in CNNs due to its tendency to make activation outputs homogeneous. The proposed method is designed to be more robust than existing activation functions by reducing the upper bound of influence caused by input shifts without inheriting LN's limitation. We provide analyses and experiments showing that LayerAct functions exhibit superior robustness compared to ElementAct functions. Experimental results on three clean and noisy benchmark datasets for image classification tasks indicate that LayerAct functions outperform other activation functions in handling noisy datasets while achieving superior performance on clean datasets in most cases.

7 pag...

7 pages, 5 figures, 4 tables except acknowledge, reference, and appendix. Accepted for the main track of AAAI 25

None
MaxSup: Overcoming Representation Collapse in Label Smoothing 2025-02-18
Show

Label Smoothing (LS) is widely adopted to curb overconfidence in neural network predictions and enhance generalization. However, previous research shows that LS can force feature representations into excessively tight clusters, eroding intra-class distinctions. More recent findings suggest that LS also induces overconfidence in misclassifications, yet the precise mechanism remained unclear. In this work, we decompose the loss term introduced by LS, revealing two key components: (i) a regularization term that functions only when the prediction is correct, and (ii) an error-enhancement term that emerges under misclassifications. This latter term compels the model to reinforce incorrect predictions with exaggerated certainty, further collapsing the feature space. To address these issues, we propose Max Suppression (MaxSup), which uniformly applies the intended regularization to both correct and incorrect predictions by penalizing the top-1 logit instead of the ground-truth logit. Through feature analyses, we show that MaxSup restores intra-class variation and sharpens inter-class boundaries. Extensive experiments on image classification and downstream tasks confirm that MaxSup is a more robust alternative to LS. Code is available at: https://github.com/ZhouYuxuanYX/Maximum-Suppression-Regularization.

19 pa...

19 pages, 9 Tables, preliminary work under review do not distribute

Code Link
Benchmarking MedMNIST dataset on real quantum hardware 2025-02-18
Show

Quantum machine learning (QML) has emerged as a promising domain to leverage the computational capabilities of quantum systems to solve complex classification tasks. In this work, we present first comprehensive QML study by benchmarking the MedMNIST-a diverse collection of medical imaging datasets on a 127-qubit real IBM quantum hardware, to evaluate the feasibility and performance of quantum models (without any classical neural networks) in practical applications. This study explore recent advancements in quantum computing such as device-aware quantum circuits, error suppression and mitigation for medical image classification. Our methodology comprised of three stages: preprocessing, generation of noise-resilient and hardware-efficient quantum circuits, optimizing/training of quantum circuits on classical hardware, and inference on real IBM quantum hardware. Firstly, we process all input images in the preprocessing stage to reduce the spatial dimension due to the quantum hardware limitations. We generate hardware-efficient quantum circuits using backend properties expressible to learn complex patterns for medical image classification. After classical optimization of QML models, we perform the inference on real quantum hardware. We also incorporates advanced error suppression and mitigation techniques in our QML workflow including dynamical decoupling (DD), gate twirling, and matrix-free measurement mitigation (M3) to mitigate the effects of noise and improve classification performance. The experimental results showcase the potential of quantum computing for medical imaging and establishes a benchmark for future advancements in QML applied to healthcare.

None
Likelihood-Ratio Regularized Quantile Regression: Adapting Conformal Prediction to High-Dimensional Covariate Shifts 2025-02-18
Show

We consider the problem of conformal prediction under covariate shift. Given labeled data from a source domain and unlabeled data from a covariate shifted target domain, we seek to construct prediction sets with valid marginal coverage in the target domain. Most existing methods require estimating the unknown likelihood ratio function, which can be prohibitive for high-dimensional data such as images. To address this challenge, we introduce the likelihood ratio regularized quantile regression (LR-QR) algorithm, which combines the pinball loss with a novel choice of regularization in order to construct a threshold function without directly estimating the unknown likelihood ratio. We show that the LR-QR method has coverage at the desired level in the target domain, up to a small error term that we can control. Our proofs draw on a novel analysis of coverage via stability bounds from learning theory. Our experiments demonstrate that the LR-QR algorithm outperforms existing methods on high-dimensional prediction tasks, including a regression task for the Communities and Crime dataset, and an image classification task from the WILDS repository.

None
Effects of Degradations on Deep Neural Network Architectures 2025-02-18
Show

Deep convolutional neural networks (CNN) have massively influenced recent advances in large-scale image classification. More recently, a dynamic routing algorithm with capsules (groups of neurons) has shown state-of-the-art recognition performance. However, the behavior of such networks in the presence of a degrading signal (noise) is mostly unexplored. An analytical study on different network architectures toward noise robustness is essential for selecting the appropriate model in a specific application scenario. This paper presents an extensive performance analysis of six deep architectures for image classification on six most common image degradation models. In this study, we have compared VGG-16, VGG-19, ResNet-50, Inception-v3, MobileNet and CapsuleNet architectures on Gaussian white, Gaussian color, salt-and-pepper, Gaussian blur, motion blur and JPEG compression noise models.

10 pages None
RingFormer: Rethinking Recurrent Transformer with Adaptive Level Signals 2025-02-18
Show

Transformers have achieved great success in effectively processing sequential data such as text. Their architecture consisting of several attention and feedforward blocks can model relations between elements of a sequence in parallel manner, which makes them very efficient to train and effective in sequence modeling. Even though they have shown strong performance in processing sequential data, the size of their parameters is considerably larger when compared to other architectures such as RNN and CNN based models. Therefore, several approaches have explored parameter sharing and recurrence in Transformer models to address their computational demands. However, such methods struggle to maintain high performance compared to the original transformer model. To address this challenge, we propose our novel approach, RingFormer, which employs one Transformer layer that processes input repeatedly in a circular, ring-like manner, while utilizing low-rank matrices to generate input-dependent level signals. This allows us to reduce the model parameters substantially while maintaining high performance in a variety of tasks such as translation and image classification, as validated in the experiments.

None
DAMamba: Vision State Space Model with Dynamic Adaptive Scan 2025-02-18
Show

State space models (SSMs) have recently garnered significant attention in computer vision. However, due to the unique characteristics of image data, adapting SSMs from natural language processing to computer vision has not outperformed the state-of-the-art convolutional neural networks (CNNs) and Vision Transformers (ViTs). Existing vision SSMs primarily leverage manually designed scans to flatten image patches into sequences locally or globally. This approach disrupts the original semantic spatial adjacency of the image and lacks flexibility, making it difficult to capture complex image structures. To address this limitation, we propose Dynamic Adaptive Scan (DAS), a data-driven method that adaptively allocates scanning orders and regions. This enables more flexible modeling capabilities while maintaining linear computational complexity and global modeling capacity. Based on DAS, we further propose the vision backbone DAMamba, which significantly outperforms current state-of-the-art vision Mamba models in vision tasks such as image classification, object detection, instance segmentation, and semantic segmentation. Notably, it surpasses some of the latest state-of-the-art CNNs and ViTs. Code will be available at https://github.com/ltzovo/DAMamba.

Code Link
Human and AI Perceptual Differences in Image Classification Errors 2025-02-18
Show

Artificial intelligence (AI) models for computer vision trained with supervised machine learning are assumed to solve classification tasks by imitating human behavior learned from training labels. Most efforts in recent vision research focus on measuring the model task performance using standardized benchmarks such as accuracy. However, limited work has sought to understand the perceptual difference between humans and machines. To fill this gap, this study first analyzes the statistical distributions of mistakes from the two sources and then explores how task difficulty level affects these distributions. We find that even when AI learns an excellent model from the training data, one that outperforms humans in overall accuracy, these AI models have significant and consistent differences from human perception. We demonstrate the importance of studying these differences with a simple human-AI teaming algorithm that outperforms humans alone, AI alone, or AI-AI teaming.

AAAI 25 Oral None
When Segmentation Meets Hyperspectral Image: New Paradigm for Hyperspectral Image Classification 2025-02-18
Show

Hyperspectral image (HSI) classification is a cornerstone of remote sensing, enabling precise material and land-cover identification through rich spectral information. While deep learning has driven significant progress in this task, small patch-based classifiers, which account for over 90% of the progress, face limitations: (1) the small patch (e.g., 7x7, 9x9)-based sampling approach considers a limited receptive field, resulting in insufficient spatial structural information critical for object-level identification and noise-like misclassifications even within uniform regions; (2) undefined optimal patch sizes lead to coarse label predictions, which degrade performance; and (3) a lack of multi-shape awareness around objects. To address these challenges, we draw inspiration from large-scale image segmentation techniques, which excel at handling object boundaries-a capability essential for semantic labeling in HSI classification. However, their application remains under-explored in this task due to (1) the prevailing notion that larger patch sizes degrade performance, (2) the extensive unlabeled regions in HSI groundtruth, and (3) the misalignment of input shapes between HSI data and segmentation models. Thus, in this study, we propose a novel paradigm and baseline, HSIseg, for HSI classification that leverages segmentation techniques combined with a novel Dynamic Shifted Regional Transformer (DSRT) to overcome these challenges. We also introduce an intuitive progressive learning framework with adaptive pseudo-labeling to iteratively incorporate unlabeled regions into the training process, thereby advancing the application of segmentation techniques. Additionally, we incorporate auxiliary data through multi-source data collaboration, promoting better feature interaction. Validated on five public HSI datasets, our proposal outperforms state-of-the-art methods.

None
CSA: Data-efficient Mapping of Unimodal Features to Multimodal Features 2025-02-18
Show

Multimodal encoders like CLIP excel in tasks such as zero-shot image classification and cross-modal retrieval. However, they require excessive training data. We propose canonical similarity analysis (CSA), which uses two unimodal encoders to replicate multimodal encoders using limited data. CSA maps unimodal features into a multimodal space, using a new similarity score to retain only the multimodal information. CSA only involves the inference of unimodal encoders and a cubic-complexity matrix decomposition, eliminating the need for extensive GPU-based model training. Experiments show that CSA outperforms CLIP while requiring $50,000\times$ fewer multimodal data pairs to bridge the modalities given pre-trained unimodal encoders on ImageNet classification and misinformative news caption detection. CSA surpasses the state-of-the-art method to map unimodal features to multimodal features. We also demonstrate the ability of CSA with modalities beyond image and text, paving the way for future modality pairs with limited paired multimodal data but abundant unpaired unimodal data, such as lidar and text.

None
OCT Data is All You Need: How Vision Transformers with and without Pre-training Benefit Imaging 2025-02-17
Show

Optical Coherence Tomography (OCT) provides high-resolution cross-sectional images useful for diagnosing various diseases, but their distinct characteristics from natural images raise questions about whether large-scale pre-training on datasets like ImageNet is always beneficial. In this paper, we investigate the impact of ImageNet-based pre-training on Vision Transformer (ViT) performance for OCT image classification across different dataset sizes. Our experiments cover four-category retinal pathologies (CNV, DME, Drusen, Normal). Results suggest that while pre-training can accelerate convergence and potentially offer better performance in smaller datasets, training from scratch may achieve comparable or even superior accuracy when sufficient OCT data is available. Our findings highlight the importance of matching domain characteristics in pre-training and call for further study on large-scale OCT-specific pre-training.

None
Metalearning Continual Learning Algorithms 2025-02-17
Show

General-purpose learning systems should improve themselves in open-ended fashion in ever-changing environments. Conventional learning algorithms for neural networks, however, suffer from catastrophic forgetting (CF), i.e., previously acquired skills are forgotten when a new task is learned. Instead of hand-crafting new algorithms for avoiding CF, we propose Automated Continual Learning (ACL) to train self-referential neural networks to metalearn their own in-context continual (meta)learning algorithms. ACL encodes continual learning (CL) desiderata -- good performance on both old and new tasks -- into its metalearning objectives. Our experiments demonstrate that ACL effectively resolves "in-context catastrophic forgetting," a problem that naive in-context learning algorithms suffer from; ACL-learned algorithms outperform both hand-crafted learning algorithms and popular meta-continual learning methods on the Split-MNIST benchmark in the replay-free setting, and enables continual learning of diverse tasks consisting of multiple standard image classification datasets. We also discuss the current limitations of in-context CL by comparing ACL with state-of-the-art CL methods that leverage pre-trained models. Overall, we bring several novel perspectives into the long-standing problem of CL.

Accep...

Accepted to TMLR 02/2025. An earlier version of this work titled "Automating Continual Learning" was made available online in 2023

None
Knowledge Swapping via Learning and Unlearning 2025-02-17
Show

We introduce \textbf{Knowledge Swapping}, a novel task designed to selectively regulate knowledge of a pretrained model by enabling the forgetting of user-specified information, retaining essential knowledge, and acquiring new knowledge simultaneously. By delving into the analysis of knock-on feature hierarchy, we find that incremental learning typically progresses from low-level representations to higher-level semantics, whereas forgetting tends to occur in the opposite direction-starting from high-level semantics and moving down to low-level features. Building upon this, we propose to benchmark the knowledge swapping task with the strategy of \textit{Learning Before Forgetting}. Comprehensive experiments on various tasks like image classification, object detection, and semantic segmentation validate the effectiveness of the proposed strategy. The source code is available at \href{https://github.com/xingmingyu123456/KnowledgeSwapping}{https://github.com/xingmingyu123456/KnowledgeSwapping}.

10 pages Code Link
PrototypeFormer: Learning to Explore Prototype Relationships for Few-shot Image Classification 2025-02-17
Show

Few-shot image classification has received considerable attention for overcoming the challenge of limited classification performance with limited samples in novel classes. Most existing works employ sophisticated learning strategies and feature learning modules to alleviate this challenge. In this paper, we propose a novel method called PrototypeFormer, exploring the relationships among category prototypes in the few-shot scenario. Specifically, we utilize a transformer architecture to build a prototype extraction module, aiming to extract class representations that are more discriminative for few-shot classification. Besides, during the model training process, we propose a contrastive learning-based optimization approach to optimize prototype features in few-shot learning scenarios. Despite its simplicity, our method performs remarkably well, with no bells and whistles. We have experimented with our approach on several popular few-shot image classification benchmark datasets, which shows that our method outperforms all current state-of-the-art methods. In particular, our method achieves 97.07% and 90.88% on 5-way 5-shot and 5-way 1-shot tasks of miniImageNet, which surpasses the state-of-the-art results with accuracy of 0.57% and 6.84%, respectively. The code will be released later.

Submi...

Submitted to Neurocomputing

None
REP: Resource-Efficient Prompting for Rehearsal-Free Continual Learning 2025-02-17
Show

Recent rehearsal-free methods, guided by prompts, excel in vision-related continual learning (CL) with drifting data but lack resource efficiency, making real-world deployment challenging. In this paper, we introduce Resource-Efficient Prompting (REP), which improves the computational and memory efficiency of prompt-based rehearsal-free methods while minimizing accuracy trade-offs. Our approach employs swift prompt selection to refine input data using a carefully provisioned model and introduces adaptive token merging (AToM) and layer dropping (ALD) for efficient prompt updates. AToM and ALD selectively skip data and model layers while preserving task-specific features during new-task learning. Extensive experiments on multiple image classification datasets demonstrates REP's superior resource efficiency over state-of-the-art ViT- and CNN-based methods.

None
Leveraging Conditional Mutual Information to Improve Large Language Model Fine-Tuning For Classification 2025-02-16
Show

Although large language models (LLMs) have demonstrated remarkable capabilities in recent years, the potential of information theory (IT) to enhance LLM development remains underexplored. This paper introduces the information theoretic principle of Conditional Mutual Information (CMI) to LLM fine-tuning for classification tasks, exploring its promise in two main ways: minimizing CMI to improve a model's standalone performance and maximizing CMI to enhance knowledge distillation (KD) for more capable student models. To apply CMI in LLM fine-tuning, we adapt the recently proposed CMI-constrained deep learning framework, which was initially developed for image classification, with some modification. By minimizing CMI during LLM fine-tuning, we achieve superior performance gains on 6 of 8 GLUE classification tasks compared to BERT. Additionally, maximizing CMI during the KD process results in significant performance improvements in 6 of 8 GLUE classification tasks compared to DistilBERT. These findings demonstrate CMI's adaptability for optimizing both standalone LLMs and student models, showcasing its potential as a robust framework for advancing LLM fine-tuning. Our work bridges the gap between information theory and LLM development, offering new insights for building high-performing language models.

6 pages, 2 figures None
An Enhancement of CNN Algorithm for Rice Leaf Disease Image Classification in Mobile Applications 2025-02-16
Show

This study focuses on enhancing rice leaf disease image classification algorithms, which have traditionally relied on Convolutional Neural Network (CNN) models. We employed transfer learning with MobileViTV2_050 using ImageNet-1k weights, a lightweight model that integrates CNN's local feature extraction with Vision Transformers' global context learning through a separable self-attention mechanism. Our approach resulted in a significant 15.66% improvement in classification accuracy for MobileViTV2_050-A, our first enhanced model trained on the baseline dataset, achieving 93.14%. Furthermore, MobileViTV2_050-B, our second enhanced model trained on a broader rice leaf dataset, demonstrated a 22.12% improvement, reaching 99.6% test accuracy. Additionally, MobileViTV2-A attained an F1-score of 93% across four rice labels and a Receiver Operating Characteristic (ROC) curve ranging from 87% to 97%. In terms of resource consumption, our enhanced models reduced the total parameters of the baseline CNN model by up to 92.50%, from 14 million to 1.1 million. These results indicate that MobileViTV2_050 not only improves computational efficiency through its separable self-attention mechanism but also enhances global context learning. Consequently, it offers a lightweight and robust solution suitable for mobile deployment, advancing the interpretability and practicality of models in precision agriculture.

Prese...

Presented at 46th World Conference on Applied Science, Engineering & Technology (WCASET) from Institute for Educational Research and Publication (IFERP)

None
JPEG Inspired Deep Learning 2025-02-16
Show

Although it is traditionally believed that lossy image compression, such as JPEG compression, has a negative impact on the performance of deep neural networks (DNNs), it is shown by recent works that well-crafted JPEG compression can actually improve the performance of deep learning (DL). Inspired by this, we propose JPEG-DL, a novel DL framework that prepends any underlying DNN architecture with a trainable JPEG compression layer. To make the quantization operation in JPEG compression trainable, a new differentiable soft quantizer is employed at the JPEG layer, and then the quantization operation and underlying DNN are jointly trained. Extensive experiments show that in comparison with the standard DL, JPEG-DL delivers significant accuracy improvements across various datasets and model architectures while enhancing robustness against adversarial attacks. Particularly, on some fine-grained image classification datasets, JPEG-DL can increase prediction accuracy by as much as 20.9%. Our code is available on https://github.com/AhmedHussKhalifa/JPEG-Inspired-DL.git.

Code Link
Simulations of Common Unsupervised Domain Adaptation Algorithms for Image Classification 2025-02-15
Show

Traditional machine learning assumes that training and test sets are derived from the same distribution; however, this assumption does not always hold in practical applications. This distribution disparity can lead to severe performance drops when the trained model is used in new data sets. Domain adaptation (DA) is a machine learning technique that aims to address this problem by reducing the differences between domains. This paper presents simulation-based algorithms of recent DA techniques, mainly related to unsupervised domain adaptation (UDA), where labels are available only in the source domain. Our study compares these techniques with public data sets and diverse characteristics, highlighting their respective strengths and drawbacks. For example, Safe Self-Refinement for Transformer-based DA (SSRT) achieved the highest accuracy (91.6%) in the office-31 data set during our simulations, however, the accuracy dropped to 72.4% in the Office-Home data set when using limited batch sizes. In addition to improving the reader's comprehension of recent techniques in DA, our study also highlights challenges and upcoming directions for research in this domain. The codes are available at https://github.com/AIPMLab/Domain_Adaptation.

Accepted in IEEE TIM Code Link
PipeOptim: Ensuring Effective 1F1B Schedule with Optimizer-Dependent Weight Prediction 2025-02-15
Show

Asynchronous pipeline model parallelism with a "1F1B" (one forward, one backward) schedule generates little bubble overhead and always provides quite a high throughput. However, the "1F1B" schedule inevitably leads to weight inconsistency and weight staleness issues due to the cross-training of different mini-batches across GPUs. To simultaneously address these two problems, in this paper, we propose an optimizer-dependent weight prediction strategy (a.k.a PipeOptim) for asynchronous pipeline training. The key insight of our proposal is that we employ a weight prediction strategy in the forward pass to ensure that each mini-batch uses consistent and staleness-free weights to compute the forward pass. To be concrete, we first construct the weight prediction scheme based on the update rule of the used optimizer when training the deep neural network models. Then throughout the "1F1B" pipelined training, each mini-batch is mandated to execute weight prediction ahead of the forward pass, subsequently employing the predicted weights to perform the forward pass. As a result, PipeOptim 1) inherits the advantage of the "1F1B" schedule and generates pretty high throughput, and 2) can ensure effective parameter learning regardless of the type of the used optimizer. To verify the effectiveness of our proposal, we conducted extensive experimental evaluations using eight different deep-learning models spanning three machine-learning tasks including image classification, sentiment analysis, and machine translation. The experiment results demonstrate that PipeOptim outperforms the popular pipelined approaches including GPipe, PipeDream, PipeDream-2BW, and SpecTrain. The code of PipeOptim can be accessible at https://github.com/guanleics/PipeOptim.

15 pages Code Link
REAL: Realism Evaluation of Text-to-Image Generation Models for Effective Data Augmentation 2025-02-15
Show

Recent advancements in text-to-image (T2I) generation models have transformed the field. However, challenges persist in generating images that reflect demanding textual descriptions, especially for fine-grained details and unusual relationships. Existing evaluation metrics focus on text-image alignment but overlook the realism of the generated image, which can be crucial for downstream applications like data augmentation in machine learning. To address this gap, we propose REAL, an automatic evaluation framework that assesses realism of T2I outputs along three dimensions: fine-grained visual attributes, unusual visual relationships, and visual styles. REAL achieves a Spearman's rho score of up to 0.62 in alignment with human judgement and demonstrates utility in ranking and filtering augmented data for tasks like image captioning, classification, and visual relationship detection. Empirical results show that high-scoring images evaluated by our metrics improve F1 scores of image classification by up to 11.3%, while low-scoring ones degrade that by up to 4.95%. We benchmark four major T2I models across the realism dimensions, providing insights for future improvements in T2I output realism.

None
Segment Any Change 2025-02-15
Show

Visual foundation models have achieved remarkable results in zero-shot image classification and segmentation, but zero-shot change detection remains an open problem. In this paper, we propose the segment any change models (AnyChange), a new type of change detection model that supports zero-shot prediction and generalization on unseen change types and data distributions. AnyChange is built on the segment anything model (SAM) via our training-free adaptation method, bitemporal latent matching. By revealing and exploiting intra-image and inter-image semantic similarities in SAM's latent space, bitemporal latent matching endows SAM with zero-shot change detection capabilities in a training-free way. We also propose a point query mechanism to enable AnyChange's zero-shot object-centric change detection capability. We perform extensive experiments to confirm the effectiveness of AnyChange for zero-shot change detection. AnyChange sets a new record on the SECOND benchmark for unsupervised change detection, exceeding the previous SOTA by up to 4.4% F$_1$ score, and achieving comparable accuracy with negligible manual annotations (1 pixel per image) for supervised change detection. Code is available at https://github.com/Z-Zheng/pytorch-change-models.

NeurIPS 2024 Code Link
Mitigating the Impact of Labeling Errors on Training via Rockafellian Relaxation 2025-02-14
Show

Labeling errors in datasets are common, arising in a variety of contexts, such as human labeling, noisy labeling, and weak labeling (i.e., image classification). Although neural networks (NNs) can tolerate modest amounts of these errors, their performance degrades substantially once error levels exceed a certain threshold. We propose a new loss reweighting, architecture-independent methodology, Rockafellian Relaxation Method (RRM) for neural network training. Experiments indicate RRM can enhance neural network methods to achieve robust performance across classification tasks in computer vision and natural language processing (sentiment analysis). We find that RRM can mitigate the effects of dataset contamination stemming from both (heavy) labeling error and/or adversarial perturbation, demonstrating effectiveness across a variety of data domains and machine learning tasks.

None
Simplifying DINO via Coding Rate Regularization 2025-02-14
Show

DINO and DINOv2 are two model families being widely used to learn representations from unlabeled imagery data at large scales. Their learned representations often enable state-of-the-art performance for downstream tasks, such as image classification and segmentation. However, they employ many empirically motivated design choices and their training pipelines are highly complex and unstable -- many hyperparameters need to be carefully tuned to ensure that the representations do not collapse -- which poses considerable difficulty to improving them or adapting them to new domains. In this work, we posit that we can remove most such-motivated idiosyncrasies in the pre-training pipelines, and only need to add an explicit coding rate term in the loss function to avoid collapse of the representations. As a result, we obtain highly simplified variants of the DINO and DINOv2 which we call SimDINO and SimDINOv2, respectively. Remarkably, these simplified models are more robust to different design choices, such as network architecture and hyperparameters, and they learn even higher-quality representations, measured by performance on downstream tasks, offering a Pareto improvement over the corresponding DINO and DINOv2 models. This work highlights the potential of using simplifying design principles to improve the empirical practice of deep learning.

17 pages, 5 figures None
Ocular Disease Classification Using CNN with Deep Convolutional Generative Adversarial Network 2025-02-14
Show

The Convolutional Neural Network (CNN) has shown impressive performance in image classification because of its strong learning capabilities. However, it demands a substantial and balanced dataset for effective training. Otherwise, networks frequently exhibit over fitting and struggle to generalize to new examples. Publicly available dataset of fundus images of ocular disease is insufficient to train any classification model to achieve satisfactory accuracy. So, we propose Generative Adversarial Network(GAN) based data generation technique to synthesize dataset for training CNN based classification model and later use original disease containing ocular images to test the model. During testing the model classification accuracy with the original ocular image, the model achieves an accuracy rate of 78.6% for myopia, 88.6% for glaucoma, and 84.6% for cataract, with an overall classification accuracy of 84.6%.

None
SeWA: Selective Weight Average via Probabilistic Masking 2025-02-14
Show

Weight averaging has become a standard technique for enhancing model performance. However, methods such as Stochastic Weight Averaging (SWA) and Latest Weight Averaging (LAWA) often require manually designed procedures to sample from the training trajectory, and the results depend heavily on hyperparameter tuning. To minimize human effort, this paper proposes a simple yet efficient algorithm called Selective Weight Averaging (SeWA), which adaptively selects checkpoints during the final stages of training for averaging. Based on SeWA, we show that only a few points are needed to achieve better generalization and faster convergence. Theoretically, solving the discrete subset selection problem is inherently challenging. To address this, we transform it into a continuous probabilistic optimization framework and employ the Gumbel-Softmax estimator to learn the non-differentiable mask for each checkpoint. Further, we theoretically derive the SeWA's stability-based generalization bounds, which are sharper than that of SGD under both convex and non-convex assumptions. Finally, solid extended experiments in various domains, including behavior cloning, image classification, and text classification, further validate the effectiveness of our approach.

None
HaSPeR: An Image Repository for Hand Shadow Puppet Recognition 2025-02-14
Show

Hand shadow puppetry, also known as shadowgraphy or ombromanie, is a form of theatrical art and storytelling where hand shadows are projected onto flat surfaces to create illusions of living creatures. The skilled performers create these silhouettes by hand positioning, finger movements, and dexterous gestures to resemble shadows of animals and objects. Due to the lack of practitioners and a seismic shift in people's entertainment standards, this art form is on the verge of extinction. To facilitate its preservation and proliferate it to a wider audience, we introduce ${\rm H{\small A}SP{\small E}R}$, a novel dataset consisting of 15,000 images of hand shadow puppets across 15 classes extracted from both professional and amateur hand shadow puppeteer clips. We provide a detailed statistical analysis of the dataset and employ a range of pretrained image classification models to establish baselines. Our findings show a substantial performance superiority of skip-connected convolutional models over attention-based transformer architectures. We also find that lightweight models, such as MobileNetV2, suited for mobile applications and embedded devices, perform comparatively well. We surmise that such low-latency architectures can be useful in developing ombromanie teaching tools, and we create a prototype application to explore this surmission. Keeping the best-performing model ResNet34 under the limelight, we conduct comprehensive feature-spatial, explainability, and error analyses to gain insights into its decision-making process. To the best of our knowledge, this is the first documented dataset and research endeavor to preserve this dying art for future generations, with computer vision approaches. Our code and data will be publicly available.

Submi...

Submitted to Machine Vision and Applications, 13 pages, 105 figures, 2 tables

None
Self-Training: A Survey 2025-02-14
Show

Semi-supervised algorithms aim to learn prediction functions from a small set of labeled observations and a large set of unlabeled observations. Because this framework is relevant in many applications, they have received a lot of interest in both academia and industry. Among the existing techniques, self-training methods have undoubtedly attracted greater attention in recent years. These models are designed to find the decision boundary on low density regions without making additional assumptions about the data distribution, and use the unsigned output score of a learned classifier, or its margin, as an indicator of confidence. The working principle of self-training algorithms is to learn a classifier iteratively by assigning pseudo-labels to the set of unlabeled training samples with a margin greater than a certain threshold. The pseudo-labeled examples are then used to enrich the labeled training data and to train a new classifier in conjunction with the labeled training set. In this paper, we present self-training methods for binary and multi-class classification; as well as their variants and two related approaches, namely consistency-based approaches and transductive learning. We examine the impact of significant self-training features on various methods, using different general and image classification benchmarks, and we discuss our ideas for future research in self-training. To the best of our knowledge, this is the first thorough and complete survey on this subject.

43 pages, 1 figure None
On Space Folds of ReLU Neural Networks 2025-02-14
Show

Recent findings suggest that the consecutive layers of ReLU neural networks can be understood geometrically as space folding transformations of the input space, revealing patterns of self-similarity. In this paper, we present the first quantitative analysis of this space folding phenomenon in ReLU neural networks. Our approach focuses on examining how straight paths in the Euclidean input space are mapped to their counterparts in the Hamming activation space. In this process, the convexity of straight lines is generally lost, giving rise to non-convex folding behavior. To quantify this effect, we introduce a novel measure based on range metrics, similar to those used in the study of random walks, and provide the proof for the equivalence of convexity notions between the input and activation spaces. Furthermore, we provide empirical analysis on a geometrical analysis benchmark (CantorNet) as well as an image classification benchmark (MNIST). Our work advances the understanding of the activation space in ReLU neural networks by leveraging the phenomena of geometric folding, providing valuable insights on how these models process input information.

Accep...

Accepted at Transactions on Machine Learning Research (TMLR), 2025

None
Low Tensor-Rank Adaptation of Kolmogorov--Arnold Networks 2025-02-14
Show

Kolmogorov--Arnold networks (KANs) have demonstrated their potential as an alternative to multi-layer perceptions (MLPs) in various domains, especially for science-related tasks. However, transfer learning of KANs remains a relatively unexplored area. In this paper, inspired by Tucker decomposition of tensors and evidence on the low tensor-rank structure in KAN parameter updates, we develop low tensor-rank adaptation (LoTRA) for fine-tuning KANs. We study the expressiveness of LoTRA based on Tucker decomposition approximations. Furthermore, we provide a theoretical analysis to select the learning rates for each LoTRA component to enable efficient training. Our analysis also shows that using identical learning rates across all components leads to inefficient training, highlighting the need for an adaptive learning rate strategy. Beyond theoretical insights, we explore the application of LoTRA for efficiently solving various partial differential equations (PDEs) by fine-tuning KANs. Additionally, we propose Slim KANs that incorporate the inherent low-tensor-rank properties of KAN parameter tensors to reduce model size while maintaining superior performance. Experimental results validate the efficacy of the proposed learning rate selection strategy and demonstrate the effectiveness of LoTRA for transfer learning of KANs in solving PDEs. Further evaluations on Slim KANs for function representation and image classification tasks highlight the expressiveness of LoTRA and the potential for parameter reduction through low tensor-rank decomposition.

None
A CNN Approach to Automated Detection and Classification of Brain Tumors 2025-02-13
Show

Brain tumors require an assessment to ensure timely diagnosis and effective patient treatment. Morphological factors such as size, location, texture, and variable appearance complicate tumor inspection. Medical imaging presents challenges, including noise and incomplete images. This research article presents a methodology for processing Magnetic Resonance Imaging (MRI) data, encompassing techniques for image classification and denoising. The effective use of MRI images allows medical professionals to detect brain disorders, including tumors. This research aims to categorize healthy brain tissue and brain tumors by analyzing the provided MRI data. Unlike alternative methods like Computed Tomography (CT), MRI technology offers a more detailed representation of internal anatomical components, making it a suitable option for studying data related to brain tumors. The MRI picture is first subjected to a denoising technique utilizing an Anisotropic diffusion filter. The dataset utilized for the models creation is a publicly accessible and validated Brain Tumour Classification (MRI) database, comprising 3,264 brain MRI scans. SMOTE was employed for data augmentation and dataset balancing. Convolutional Neural Networks(CNN) such as ResNet152V2, VGG, ViT, and EfficientNet were employed for the classification procedure. EfficientNet attained an accuracy of 98%, the highest recorded.

None
GAIA: A Global, Multi-modal, Multi-scale Vision-Language Dataset for Remote Sensing Image Analysis 2025-02-13
Show

The continuous operation of Earth-orbiting satellites generates vast and ever-growing archives of Remote Sensing (RS) images. Natural language presents an intuitive interface for accessing, querying, and interpreting the data from such archives. However, existing Vision-Language Models (VLMs) are predominantly trained on web-scraped, noisy image-text data, exhibiting limited exposure to the specialized domain of RS. This deficiency results in poor performance on RS-specific tasks, as commonly used datasets often lack detailed, scientifically accurate textual descriptions and instead emphasize solely on attributes like date and location. To bridge this critical gap, we introduce GAIA, a novel dataset designed for multi-scale, multi-sensor, and multi-modal RS image analysis. GAIA comprises of 205,150 meticulously curated RS image-text pairs, representing a diverse range of RS modalities associated to different spatial resolutions. Unlike existing vision-language datasets in RS, GAIA specifically focuses on capturing a diverse range of RS applications, providing unique information about environmental changes, natural disasters, and various other dynamic phenomena. The dataset provides a spatially and temporally balanced distribution, spanning across the globe, covering the last 25 years with a balanced temporal distribution of observations. GAIA's construction involved a two-stage process: (1) targeted web-scraping of images and accompanying text from reputable RS-related sources, and (2) generation of five high-quality, scientifically grounded synthetic captions for each image using carefully crafted prompts that leverage the advanced vision-language capabilities of GPT-4o. Our extensive experiments, including fine-tuning of CLIP and BLIP2 models, demonstrate that GAIA significantly improves performance on RS image classification, cross-modal retrieval and image captioning tasks.

22 pages, 13 figures None
Heuristical Comparison of Vision Transformers Against Convolutional Neural Networks for Semantic Segmentation on Remote Sensing Imagery 2025-02-13
Show

Vision Transformers (ViT) have recently brought a new wave of research in the field of computer vision. These models have performed particularly well in image classification and segmentation. Research on semantic and instance segmentation has accelerated with the introduction of the new architecture, with over 80% of the top 20 benchmarks for the iSAID dataset based on either the ViT architecture or the attention mechanism behind its success. This paper focuses on the heuristic comparison of three key factors of using (or not using) ViT for semantic segmentation of remote sensing aerial images on the iSAID dataset. The experimental results observed during this research were analyzed based on three objectives. First, we studied the use of a weighted fused loss function to maximize the mean Intersection over Union (mIoU) score and Dice score while minimizing entropy or class representation loss. Second, we compared transfer learning on Meta's MaskFormer, a ViT-based semantic segmentation model, against a generic UNet Convolutional Neural Network (CNN) based on mIoU, Dice scores, training efficiency, and inference time. Third, we examined the trade-offs between the two models in comparison to current state-of-the-art segmentation models. We show that the novel combined weighted loss function significantly boosts the CNN model's performance compared to transfer learning with ViT. The code for this implementation can be found at: https://github.com/ashimdahal/ViT-vs-CNN-Image-Segmentation.

Code Link
Impact of Batch Normalization on Convolutional Network Representations 2025-02-13
Show

Batch normalization (BatchNorm) is a popular layer normalization technique used when training deep neural networks. It has been shown to enhance the training speed and accuracy of deep learning models. However, the mechanics by which BatchNorm achieves these benefits is an active area of research, and different perspectives have been proposed. In this paper, we investigate the effect of BatchNorm on the resulting hidden representations, that is, the vectors of activation values formed as samples are processed at each hidden layer. Specifically, we consider the sparsity of these representations, as well as their implicit clustering -- the creation of groups of representations that are similar to some extent. We contrast image classification models trained with and without batch normalization and highlight consistent differences observed. These findings highlight that BatchNorm's effect on representational sparsity is not a significant factor affecting generalization, while the representations of models trained with BatchNorm tend to show more advantageous clustering characteristics.

None
Feature-based Graph Attention Networks Improve Online Continual Learning 2025-02-13
Show

Online continual learning for image classification is crucial for models to adapt to new data while retaining knowledge of previously learned tasks. This capability is essential to address real-world challenges involving dynamic environments and evolving data distributions. Traditional approaches predominantly employ Convolutional Neural Networks, which are limited to processing images as grids and primarily capture local patterns rather than relational information. Although the emergence of transformer architectures has improved the ability to capture relationships, these models often require significantly larger resources. In this paper, we present a novel online continual learning framework based on Graph Attention Networks (GATs), which effectively capture contextual relationships and dynamically update the task-specific representation via learned attention weights. Our approach utilizes a pre-trained feature extractor to convert images into graphs using hierarchical feature maps, representing information at varying levels of granularity. These graphs are then processed by a GAT and incorporate an enhanced global pooling strategy to improve classification performance for continual learning. In addition, we propose the rehearsal memory duplication technique that improves the representation of the previous tasks while maintaining the memory budget. Comprehensive evaluations on benchmark datasets, including SVHN, CIFAR10, CIFAR100, and MiniImageNet, demonstrate the superiority of our method compared to the state-of-the-art methods.

16 pages None
Hierarchical Vision Transformer with Prototypes for Interpretable Medical Image Classification 2025-02-13
Show

Explainability is a highly demanded requirement for applications in high-risk areas such as medicine. Vision Transformers have mainly been limited to attention extraction to provide insight into the model's reasoning. Our approach combines the high performance of Vision Transformers with the introduction of new explainability capabilities. We present HierViT, a Vision Transformer that is inherently interpretable and adapts its reasoning to that of humans. A hierarchical structure is used to process domain-specific features for prediction. It is interpretable by design, as it derives the target output with human-defined features that are visualized by exemplary images (prototypes). By incorporating domain knowledge about these decisive features, the reasoning is semantically similar to human reasoning and therefore intuitive. Moreover, attention heatmaps visualize the crucial regions for identifying each feature, thereby providing HierViT with a versatile tool for validating predictions. Evaluated on two medical benchmark datasets, LIDC-IDRI for lung nodule assessment and derm7pt for skin lesion classification, HierViT achieves superior and comparable prediction accuracy, respectively, while offering explanations that align with human reasoning.

None
For Better or For Worse? Learning Minimum Variance Features With Label Augmentation 2025-02-13
Show

Data augmentation has been pivotal in successfully training deep learning models on classification tasks over the past decade. An important subclass of data augmentation techniques - which includes both label smoothing and Mixup - involves modifying not only the input data but also the input label during model training. In this work, we analyze the role played by the label augmentation aspect of such methods. We first prove that linear models on binary classification data trained with label augmentation learn only the minimum variance features in the data, while standard training (which includes weight decay) can learn higher variance features. We then use our techniques to show that even for nonlinear models and general data distributions, the label smoothing and Mixup losses are lower bounded by a function of the model output variance. Lastly, we demonstrate empirically that this aspect of label smoothing and Mixup can be a positive and a negative. On the one hand, we show that the strong performance of label smoothing and Mixup on image classification benchmarks is correlated with learning low variance hidden representations. On the other hand, we show that Mixup and label smoothing can be more susceptible to low variance spurious correlations in the training data.

ICLR ...

ICLR 2025, 25 pages, 8 figures

None
ViLa-MIL: Dual-scale Vision-Language Multiple Instance Learning for Whole Slide Image Classification 2025-02-12
Show

Multiple instance learning (MIL)-based framework has become the mainstream for processing the whole slide image (WSI) with giga-pixel size and hierarchical image context in digital pathology. However, these methods heavily depend on a substantial number of bag-level labels and solely learn from the original slides, which are easily affected by variations in data distribution. Recently, vision language model (VLM)-based methods introduced the language prior by pre-training on large-scale pathological image-text pairs. However, the previous text prompt lacks the consideration of pathological prior knowledge, therefore does not substantially boost the model's performance. Moreover, the collection of such pairs and the pre-training process are very time-consuming and source-intensive.To solve the above problems, we propose a dual-scale vision-language multiple instance learning (ViLa-MIL) framework for whole slide image classification. Specifically, we propose a dual-scale visual descriptive text prompt based on the frozen large language model (LLM) to boost the performance of VLM effectively. To transfer the VLM to process WSI efficiently, for the image branch, we propose a prototype-guided patch decoder to aggregate the patch features progressively by grouping similar patches into the same prototype; for the text branch, we introduce a context-guided text decoder to enhance the text features by incorporating the multi-granular image contexts. Extensive studies on three multi-cancer and multi-center subtyping datasets demonstrate the superiority of ViLa-MIL.

CVPR ...

CVPR 2024 (Updated version with corrections for typos and errors.)

None
Robust Visual Representation Learning with Multi-modal Prior Knowledge for Image Classification Under Distribution Shift 2025-02-12
Show

Despite the remarkable success of deep neural networks (DNNs) in computer vision, they fail to remain high-performing when facing distribution shifts between training and testing data. In this paper, we propose Knowledge-Guided Visual representation learning (KGV) - a distribution-based learning approach leveraging multi-modal prior knowledge - to improve generalization under distribution shift. It integrates knowledge from two distinct modalities: 1) a knowledge graph (KG) with hierarchical and association relationships; and 2) generated synthetic images of visual elements semantically represented in the KG. The respective embeddings are generated from the given modalities in a common latent space, i.e., visual embeddings from original and synthetic images as well as knowledge graph embeddings (KGEs). These embeddings are aligned via a novel variant of translation-based KGE methods, where the node and relation embeddings of the KG are modeled as Gaussian distributions and translations, respectively. We claim that incorporating multi-model prior knowledge enables more regularized learning of image representations. Thus, the models are able to better generalize across different data distributions. We evaluate KGV on different image classification tasks with major or minor distribution shifts, namely road sign classification across datasets from Germany, China, and Russia, image classification with the mini-ImageNet dataset and its variants, as well as the DVM-CAR dataset. The results demonstrate that KGV consistently exhibits higher accuracy and data efficiency across all experiments.

None
Keep your distance: learning dispersed embeddings on $\mathbb{S}_d$ 2025-02-12
Show

Learning well-separated features in high-dimensional spaces, such as text or image embeddings, is crucial for many machine learning applications. Achieving such separation can be effectively accomplished through the dispersion of embeddings, where unrelated vectors are pushed apart as much as possible. By constraining features to be on a hypersphere, we can connect dispersion to well-studied problems in mathematics and physics, where optimal solutions are known for limited low-dimensional cases. However, in representation learning we typically deal with a large number of features in high-dimensional space, and moreover, dispersion is usually traded off with some other task-oriented training objective, making existing theoretical and numerical solutions inapplicable. Therefore, it is common to rely on gradient-based methods to encourage dispersion, usually by minimizing some function of the pairwise distances. In this work, we first give an overview of existing methods from disconnected literature, making new connections and highlighting similarities. Next, we introduce some new angles. We propose to reinterpret pairwise dispersion using a maximum mean discrepancy (MMD) motivation. We then propose an online variant of the celebrated Lloyd's algorithm, of K-Means fame, as an effective alternative regularizer for dispersion on generic domains. Finally, we derive a novel dispersion method that directly exploits properties of the hypersphere. Our experiments show the importance of dispersion in image classification and natural language processing tasks, and how algorithms exhibit different trade-offs in different regimes.

None
From Layers to States: A State Space Model Perspective to Deep Neural Network Layer Dynamics 2025-02-12
Show

The depth of neural networks is a critical factor for their capability, with deeper models often demonstrating superior performance. Motivated by this, significant efforts have been made to enhance layer aggregation - reusing information from previous layers to better extract features at the current layer, to improve the representational power of deep neural networks. However, previous works have primarily addressed this problem from a discrete-state perspective which is not suitable as the number of network layers grows. This paper novelly treats the outputs from layers as states of a continuous process and considers leveraging the state space model (SSM) to design the aggregation of layers in very deep neural networks. Moreover, inspired by its advancements in modeling long sequences, the Selective State Space Models (S6) is employed to design a new module called Selective State Space Model Layer Aggregation (S6LA). This module aims to combine traditional CNN or transformer architectures within a sequential framework, enhancing the representational capabilities of state-of-the-art vision networks. Extensive experiments show that S6LA delivers substantial improvements in both image classification and detection tasks, highlighting the potential of integrating SSMs with contemporary deep learning techniques.

None
Riemannian Complex Hermit Positive Definite Convolution Network for Polarimetric SAR Image Classification 2025-02-12
Show

Deep learning can learn high-level semantic features in Euclidean space effectively for PolSAR images, while they need to covert the complex covariance matrix into a feature vector or complex-valued vector as the network input. However, the complex covariance matrices are essentially a complex Hermit positive definite (HPD) matrix endowed in Riemannian manifold rather than Euclidean space. The matrix's real and imagery parts are with the same significance, as the imagery part represents the phase information. The matrix vectorization will destroy the geometric structure and manifold characteristics of complex covariance matrices. To learn complex HPD matrices directly, we propose a Riemannian complex HPD convolution network(HPD_CNN) for PolSAR images. This method consists of a complex HPD unfolding network(HPDnet) and a CV-3DCNN enhanced network. The proposed complex HPDnet defines the HPD mapping, rectifying and the logEig layers to learn geometric features of complex matrices. In addition, a fast eigenvalue decomposition method is designed to reduce computation burden. Finally, a Riemannian-to-Euclidean enhanced network is defined to enhance contextual information for classification. Experimental results on two real PolSSAR datasets demonstrate the proposed method can achieve superior performance than the state-of-the-art methods especially in heterogeneous regions.

9 pages, 4 figures None
Quaternion-Hadamard Network: A Novel Defense Against Adversarial Attacks with a New Dataset 2025-02-12
Show

This paper addresses the vulnerability of deep-learning models designed for rain, snow, and haze removal. Despite enhancing image quality in adverse weather, these models are susceptible to adversarial attacks that compromise their effectiveness. Traditional defenses such as adversarial training and model distillation often require extensive retraining, making them costly and impractical for real-world deployment. While denoising and super-resolution techniques can aid image classification models, they impose high computational demands and introduce visual artifacts that hinder image processing tasks. We propose a model-agnostic defense against first-order white-box adversarial attacks using the Quaternion-Hadamard Network (QHNet) to tackle these challenges. White-box attacks are particularly difficult to defend against since attackers have full access to the model's architecture, weights, and training procedures. Our defense introduces the Quaternion Hadamard Denoising Convolutional Block (QHDCB) and the Quaternion Denoising Residual Block (QDRB), leveraging polynomial thresholding. QHNet incorporates these blocks within an encoder-decoder architecture, enhanced by feature refinement, to effectively neutralize adversarial noise. Additionally, we introduce the Adversarial Weather Conditions Vision Dataset (AWCVD), created by applying first-order gradient attacks on state-of-the-art weather removal techniques in scenarios involving haze, rain streaks, and snow. Using PSNR and SSIM metrics, we demonstrate that QHNet significantly enhances the robustness of low-level computer vision models against adversarial attacks compared with state-of-the-art denoising and super-resolution techniques. The source code and dataset will be released alongside the final version of this paper.

None
More Experts Than Galaxies: Conditionally-overlapping Experts With Biologically-Inspired Fixed Routing 2025-02-11
Show

The evolution of biological neural systems has led to both modularity and sparse coding, which enables energy efficiency and robustness across the diversity of tasks in the lifespan. In contrast, standard neural networks rely on dense, non-specialized architectures, where all model parameters are simultaneously updated to learn multiple tasks, leading to interference. Current sparse neural network approaches aim to alleviate this issue but are hindered by limitations such as 1) trainable gating functions that cause representation collapse, 2) disjoint experts that result in redundant computation and slow learning, and 3) reliance on explicit input or task IDs that limit flexibility and scalability. In this paper we propose Conditionally Overlapping Mixture of ExperTs (COMET), a general deep learning method that addresses these challenges by inducing a modular, sparse architecture with an exponential number of overlapping experts. COMET replaces the trainable gating function used in Sparse Mixture of Experts with a fixed, biologically inspired random projection applied to individual input representations. This design causes the degree of expert overlap to depend on input similarity, so that similar inputs tend to share more parameters. This results in faster learning per update step and improved out-of-sample generalization. We demonstrate the effectiveness of COMET on a range of tasks, including image classification, language modeling, and regression, using several popular deep learning architectures.

Publi...

Published as a conference paper at ICLR 2025

None
ESPFormer: Doubly-Stochastic Attention with Expected Sliced Transport Plans 2025-02-11
Show

While self-attention has been instrumental in the success of Transformers, it can lead to over-concentration on a few tokens during training, resulting in suboptimal information flow. Enforcing doubly-stochastic constraints in attention matrices has been shown to improve structure and balance in attention distributions. However, existing methods rely on iterative Sinkhorn normalization, which is computationally costly. In this paper, we introduce a novel, fully parallelizable doubly-stochastic attention mechanism based on sliced optimal transport, leveraging Expected Sliced Transport Plans (ESP). Unlike prior approaches, our method enforces double stochasticity without iterative Sinkhorn normalization, significantly enhancing efficiency. To ensure differentiability, we incorporate a temperature-based soft sorting technique, enabling seamless integration into deep learning models. Experiments across multiple benchmark datasets, including image classification, point cloud classification, sentiment analysis, and neural machine translation, demonstrate that our enhanced attention regularization consistently improves performance across diverse applications.

None
From Pixels to Components: Eigenvector Masking for Visual Representation Learning 2025-02-11
Show

Predicting masked from visible parts of an image is a powerful self-supervised approach for visual representation learning. However, the common practice of masking random patches of pixels exhibits certain failure modes, which can prevent learning meaningful high-level features, as required for downstream tasks. We propose an alternative masking strategy that operates on a suitable transformation of the data rather than on the raw pixels. Specifically, we perform principal component analysis and then randomly mask a subset of components, which accounts for a fixed ratio of the data variance. The learning task then amounts to reconstructing the masked components from the visible ones. Compared to local patches of pixels, the principal components of images carry more global information. We thus posit that predicting masked from visible components involves more high-level features, allowing our masking strategy to extract more useful representations. This is corroborated by our empirical findings which demonstrate improved image classification performance for component over pixel masking. Our method thus constitutes a simple and robust data-driven alternative to traditional masked image modeling approaches.

Prepr...

Preprint. Under review

None
Efficient Image-to-Image Diffusion Classifier for Adversarial Robustness 2025-02-11
Show

Diffusion models (DMs) have demonstrated great potential in the field of adversarial robustness, where DM-based defense methods can achieve superior defense capability without adversarial training. However, they all require huge computational costs due to the usage of large-scale pre-trained DMs, making it difficult to conduct full evaluation under strong attacks and compare with traditional CNN-based methods. Simply reducing the network size and timesteps in DMs could significantly harm the image generation quality, which invalidates previous frameworks. To alleviate this issue, we redesign the diffusion framework from generating high-quality images to predicting distinguishable image labels. Specifically, we employ an image translation framework to learn many-to-one mapping from input samples to designed orthogonal image labels. Based on this framework, we introduce an efficient Image-to-Image diffusion classifier with a pruned U-Net structure and reduced diffusion timesteps. Besides the framework, we redesign the optimization objective of DMs to fit the target of image classification, where a new classification loss is incorporated in the DM-based image translation framework to distinguish the generated label from those of other classes. We conduct sufficient evaluations of the proposed classifier under various attacks on popular benchmarks. Extensive experiments show that our method achieves better adversarial robustness with fewer computational costs than DM-based and CNN-based methods. The code is available at https://github.com/hfmei/IDC

Code Link
Optimizing Knowledge Distillation in Transformers: Enabling Multi-Head Attention without Alignment Barriers 2025-02-11
Show

Knowledge distillation (KD) in transformers often faces challenges due to misalignment in the number of attention heads between teacher and student models. Existing methods either require identical head counts or introduce projectors to bridge dimensional gaps, limiting flexibility and efficiency. We propose Squeezing-Heads Distillation (SHD), a novel approach that enables seamless knowledge transfer between models with varying head counts by compressing multi-head attention maps via efficient linear approximation. Unlike prior work, SHD eliminates alignment barriers without additional parameters or architectural modifications. Our method dynamically approximates the combined effect of multiple teacher heads into fewer student heads, preserving fine-grained attention patterns while reducing redundancy. Experiments across language (LLaMA, GPT) and vision (DiT, MDT) generative and vision (DeiT) discriminative tasks demonstrate SHD's effectiveness: it outperforms logit-based and feature-alignment KD baselines, achieving state-of-the-art results in image classification, image generation language fine-tuning, and language pre-training. The key innovations of flexible head compression, projector-free design, and linear-time complexity make SHD a versatile and scalable solution for distilling modern transformers. This work bridges a critical gap in KD, enabling efficient deployment of compact models without compromising performance.

None
MoENAS: Mixture-of-Expert based Neural Architecture Search for jointly Accurate, Fair, and Robust Edge Deep Neural Networks 2025-02-11
Show

There has been a surge in optimizing edge Deep Neural Networks (DNNs) for accuracy and efficiency using traditional optimization techniques such as pruning, and more recently, employing automatic design methodologies. However, the focus of these design techniques has often overlooked critical metrics such as fairness, robustness, and generalization. As a result, when evaluating SOTA edge DNNs' performance in image classification using the FACET dataset, we found that they exhibit significant accuracy disparities (14.09%) across 10 different skin tones, alongside issues of non-robustness and poor generalizability. In response to these observations, we introduce Mixture-of-Experts-based Neural Architecture Search (MoENAS), an automatic design technique that navigates through a space of mixture of experts to discover accurate, fair, robust, and general edge DNNs. MoENAS improves the accuracy by 4.02% compared to SOTA edge DNNs and reduces the skin tone accuracy disparities from 14.09% to 5.60%, while enhancing robustness by 3.80% and minimizing overfitting to 0.21%, all while keeping model size close to state-of-the-art models average size (+0.4M). With these improvements, MoENAS establishes a new benchmark for edge DNN design, paving the way for the development of more inclusive and robust edge DNNs.

None
MGPATH: Vision-Language Model with Multi-Granular Prompt Learning for Few-Shot WSI Classification 2025-02-11
Show

Whole slide pathology image classification presents challenges due to gigapixel image sizes and limited annotation labels, hindering model generalization. This paper introduces a prompt learning method to adapt large vision-language models for few-shot pathology classification. We first extend the Prov-GigaPath vision foundation model, pre-trained on 1.3 billion pathology image tiles, into a vision-language model by adding adaptors and aligning it with medical text encoders via contrastive learning on 923K image-text pairs. The model is then used to extract visual features and text embeddings from few-shot annotations and fine-tunes with learnable prompt embeddings. Unlike prior methods that combine prompts with frozen features using prefix embeddings or self-attention, we propose multi-granular attention that compares interactions between learnable prompts with individual image patches and groups of them. This approach improves the model's ability to capture both fine-grained details and broader context, enhancing its recognition of complex patterns across sub-regions. To further improve accuracy, we leverage (unbalanced) optimal transport-based visual-text distance to secure model robustness by mitigating perturbations that might occur during the data augmentation process. Empirical experiments on lung, kidney, and breast pathology modalities validate the effectiveness of our approach; thereby, we surpass several of the latest competitors and consistently improve performance across diverse architectures, including CLIP, PLIP, and Prov-GigaPath integrated PLIP. We release our implementations and pre-trained models at this MGPATH.

first version None
Err on the Side of Texture: Texture Bias on Real Data 2025-02-10
Show

Bias significantly undermines both the accuracy and trustworthiness of machine learning models. To date, one of the strongest biases observed in image classification models is texture bias-where models overly rely on texture information rather than shape information. Yet, existing approaches for measuring and mitigating texture bias have not been able to capture how textures impact model robustness in real-world settings. In this work, we introduce the Texture Association Value (TAV), a novel metric that quantifies how strongly models rely on the presence of specific textures when classifying objects. Leveraging TAV, we demonstrate that model accuracy and robustness are heavily influenced by texture. Our results show that texture bias explains the existence of natural adversarial examples, where over 90% of these samples contain textures that are misaligned with the learned texture of their true label, resulting in confident mispredictions.

Accep...

Accepted to IEEE Secure and Trustworthy Machine Learning (SaTML)

None
GIST: Greedy Independent Set Thresholding for Diverse Data Summarization 2025-02-10
Show

We introduce a novel subset selection problem called min-distance diversification with monotone submodular utility ($\textsf{MDMS}$), which has a wide variety of applications in machine learning, e.g., data sampling and feature selection. Given a set of points in a metric space, the goal of $\textsf{MDMS}$ is to maximize an objective function combining a monotone submodular utility term and a min-distance diversity term between any pair of selected points, subject to a cardinality constraint. We propose the $\texttt{GIST}$ algorithm, which achieves a $\frac{1}{2}$-approximation guarantee for $\textsf{MDMS}$ by approximating a series of maximum independent set problems with a bicriteria greedy algorithm. We also prove that it is NP-hard to approximate to within a factor of $0.5584$. Finally, we demonstrate that $\texttt{GIST}$ outperforms existing benchmarks for on a real-world image classification task that studies single-shot subset selection for ImageNet.

19 pages, 3 figures None
From Image to Video: An Empirical Study of Diffusion Representations 2025-02-10
Show

Diffusion models have revolutionized generative modeling, enabling unprecedented realism in image and video synthesis. This success has sparked interest in leveraging their representations for visual understanding tasks. While recent works have explored this potential for image generation, the visual understanding capabilities of video diffusion models remain largely uncharted. To address this gap, we systematically compare the same model architecture trained for video versus image generation, analyzing the performance of their latent representations on various downstream tasks including image classification, action recognition, depth estimation, and tracking. Results show that video diffusion models consistently outperform their image counterparts, though we find a striking range in the extent of this superiority. We further analyze features extracted from different layers and with varying noise levels, as well as the effect of model size and training budget on representation and generation quality. This work marks the first direct comparison of video and image diffusion objectives for visual understanding, offering insights into the role of temporal information in representation learning.

None
Enhancing Performance of Explainable AI Models with Constrained Concept Refinement 2025-02-10
Show

The trade-off between accuracy and interpretability has long been a challenge in machine learning (ML). This tension is particularly significant for emerging interpretable-by-design methods, which aim to redesign ML algorithms for trustworthy interpretability but often sacrifice accuracy in the process. In this paper, we address this gap by investigating the impact of deviations in concept representations-an essential component of interpretable models-on prediction performance and propose a novel framework to mitigate these effects. The framework builds on the principle of optimizing concept embeddings under constraints that preserve interpretability. Using a generative model as a test-bed, we rigorously prove that our algorithm achieves zero loss while progressively enhancing the interpretability of the resulting model. Additionally, we evaluate the practical performance of our proposed framework in generating explainable predictions for image classification tasks across various benchmarks. Compared to existing explainable methods, our approach not only improves prediction accuracy while preserving model interpretability across various large-scale benchmarks but also achieves this with significantly lower computational cost.

None
Krum Federated Chain (KFC): Using blockchain to defend against adversarial attacks in Federated Learning 2025-02-10
Show

Federated Learning presents a nascent approach to machine learning, enabling collaborative model training across decentralized devices while safeguarding data privacy. However, its distributed nature renders it susceptible to adversarial attacks. Integrating blockchain technology with Federated Learning offers a promising avenue to enhance security and integrity. In this paper, we tackle the potential of blockchain in defending Federated Learning against adversarial attacks. First, we test Proof of Federated Learning, a well known consensus mechanism designed ad-hoc to federated contexts, as a defense mechanism demonstrating its efficacy against Byzantine and backdoor attacks when at least one miner remains uncompromised. Second, we propose Krum Federated Chain, a novel defense strategy combining Krum and Proof of Federated Learning, valid to defend against any configuration of Byzantine or backdoor attacks, even when all miners are compromised. Our experiments conducted on image classification datasets validate the effectiveness of our proposed approaches.

Submi...

Submitted to Neural Networks

None
Visual Prompt Engineering for Vision Language Models in Radiology 2025-02-10
Show

Medical image classification plays a crucial role in clinical decision-making, yet most models are constrained to a fixed set of predefined classes, limiting their adaptability to new conditions. Contrastive Language-Image Pretraining (CLIP) offers a promising solution by enabling zero-shot classification through multimodal large-scale pretraining. However, while CLIP effectively captures global image content, radiology requires a more localized focus on specific pathology regions to enhance both interpretability and diagnostic accuracy. To address this, we explore the potential of incorporating visual cues into zero-shot classification, embedding visual markers $\unicode{x2013}$ such as arrows, bounding boxes, and circles $\unicode{x2013}$ directly into radiological images to guide model attention. Evaluating across four public chest X-ray datasets, we demonstrate that visual markers improve AUROC by up to 0.185, highlighting their effectiveness in enhancing classification performance. Furthermore, attention map analysis confirms that visual cues help models focus on clinically relevant areas, leading to more interpretable predictions. To support further research, we use public datasets and will release our code and preprocessing pipeline, providing a reference point for future work on localized classification in medical imaging.

Accep...

Accepted at ECCV 2024 Workshop on Emergent Visual Abilities and Limits of Foundation Models

None
Hybrid State-Space and GRU-based Graph Tokenization Mamba for Hyperspectral Image Classification 2025-02-10
Show

Hyperspectral image (HSI) classification plays a pivotal role in domains such as environmental monitoring, agriculture, and urban planning. However, it faces significant challenges due to the high-dimensional nature of the data and the complex spectral-spatial relationships inherent in HSI. Traditional methods, including conventional machine learning and convolutional neural networks (CNNs), often struggle to effectively capture these intricate spectral-spatial features and global contextual information. Transformer-based models, while powerful in capturing long-range dependencies, often demand substantial computational resources, posing challenges in scenarios where labeled datasets are limited, as is commonly seen in HSI applications. To overcome these challenges, this work proposes GraphMamba, a hybrid model that combines spectral-spatial token generation, graph-based token prioritization, and cross-attention mechanisms. The model introduces a novel hybridization of state-space modeling and Gated Recurrent Units (GRU), capturing both linear and nonlinear spatial-spectral dynamics. GraphMamba enhances the ability to model complex spatial-spectral relationships while maintaining scalability and computational efficiency across diverse HSI datasets. Through comprehensive experiments, we demonstrate that GraphMamba outperforms existing state-of-the-art models, offering a scalable and robust solution for complex HSI classification tasks.

None
Amnesia as a Catalyst for Enhancing Black Box Pixel Attacks in Image Classification and Object Detection 2025-02-10
Show

It is well known that query-based attacks tend to have relatively higher success rates in adversarial black-box attacks. While research on black-box attacks is actively being conducted, relatively few studies have focused on pixel attacks that target only a limited number of pixels. In image classification, query-based pixel attacks often rely on patches, which heavily depend on randomness and neglect the fact that scattered pixels are more suitable for adversarial attacks. Moreover, to the best of our knowledge, query-based pixel attacks have not been explored in the field of object detection. To address these issues, we propose a novel pixel-based black-box attack called Remember and Forget Pixel Attack using Reinforcement Learning(RFPAR), consisting of two main components: the Remember and Forget processes. RFPAR mitigates randomness and avoids patch dependency by leveraging rewards generated through a one-step RL algorithm to perturb pixels. RFPAR effectively creates perturbed images that minimize the confidence scores while adhering to limited pixel constraints. Furthermore, we advance our proposed attack beyond image classification to object detection, where RFPAR reduces the confidence scores of detected objects to avoid detection. Experiments on the ImageNet-1K dataset for classification show that RFPAR outperformed state-of-the-art query-based pixel attacks. For object detection, using the MSCOCO dataset with YOLOv8 and DDQ, RFPAR demonstrates comparable mAP reduction to state-of-the-art query-based attack while requiring fewer query. Further experiments on the Argoverse dataset using YOLOv8 confirm that RFPAR effectively removed objects on a larger scale dataset. Our code is available at https://github.com/KAU-QuantumAILab/RFPAR.

Accep...

Accepted as a poster at NeurIPS 2024

Code Link
Provably Near-Optimal Federated Ensemble Distillation with Negligible Overhead 2025-02-10
Show

Federated ensemble distillation addresses client heterogeneity by generating pseudo-labels for an unlabeled server dataset based on client predictions and training the server model using the pseudo-labeled dataset. The unlabeled server dataset can either be pre-existing or generated through a data-free approach. The effectiveness of this approach critically depends on the method of assigning weights to client predictions when creating pseudo-labels, especially in highly heterogeneous settings. Inspired by theoretical results from GANs, we propose a provably near-optimal weighting method that leverages client discriminators trained with a server-distributed generator and local datasets. Our experiments on various image classification tasks demonstrate that the proposed method significantly outperforms baselines. Furthermore, we show that the additional communication cost, client-side privacy leakage, and client-side computational overhead introduced by our method are negligible, both in scenarios with and without a pre-existing server dataset.

None
Beyond Batch Learning: Global Awareness Enhanced Domain Adaptation 2025-02-10
Show

In domain adaptation (DA), the effectiveness of deep learning-based models is often constrained by batch learning strategies that fail to fully apprehend the global statistical and geometric characteristics of data distributions. Addressing this gap, we introduce 'Global Awareness Enhanced Domain Adaptation' (GAN-DA), a novel approach that transcends traditional batch-based limitations. GAN-DA integrates a unique predefined feature representation (PFR) to facilitate the alignment of cross-domain distributions, thereby achieving a comprehensive global statistical awareness. This representation is innovatively expanded to encompass orthogonal and common feature aspects, which enhances the unification of global manifold structures and refines decision boundaries for more effective DA. Our extensive experiments, encompassing 27 diverse cross-domain image classification tasks, demonstrate GAN-DA's remarkable superiority, outperforming 24 established DA methods by a significant margin. Furthermore, our in-depth analyses shed light on the decision-making processes, revealing insights into the adaptability and efficiency of GAN-DA. This approach not only addresses the limitations of existing DA methodologies but also sets a new benchmark in the realm of domain adaptation, offering broad implications for future research and applications in this field.

None
Multi-Scale Transformer Architecture for Accurate Medical Image Classification 2025-02-10
Show

This study introduces an AI-driven skin lesion classification algorithm built on an enhanced Transformer architecture, addressing the challenges of accuracy and robustness in medical image analysis. By integrating a multi-scale feature fusion mechanism and refining the self-attention process, the model effectively extracts both global and local features, enhancing its ability to detect lesions with ambiguous boundaries and intricate structures. Performance evaluation on the ISIC 2017 dataset demonstrates that the improved Transformer surpasses established AI models, including ResNet50, VGG19, ResNext, and Vision Transformer, across key metrics such as accuracy, AUC, F1-Score, and Precision. Grad-CAM visualizations further highlight the interpretability of the model, showcasing strong alignment between the algorithm's focus areas and actual lesion sites. This research underscores the transformative potential of advanced AI models in medical imaging, paving the way for more accurate and reliable diagnostic tools. Future work will explore the scalability of this approach to broader medical imaging tasks and investigate the integration of multimodal data to enhance AI-driven diagnostic frameworks for intelligent healthcare.

None
Lightweight Dataset Pruning without Full Training via Example Difficulty and Prediction Uncertainty 2025-02-10
Show

Recent advances in deep learning rely heavily on massive datasets, leading to substantial storage and training costs. Dataset pruning aims to alleviate this demand by discarding redundant examples. However, many existing methods require training a model with a full dataset over a large number of epochs before being able to prune the dataset, which ironically makes the pruning process more expensive than just training the model on the entire dataset. To overcome this limitation, we introduce a Difficulty and Uncertainty-Aware Lightweight (DUAL) score, which aims to identify important samples from the early training stage by considering both example difficulty and prediction uncertainty. To address a catastrophic accuracy drop at an extreme pruning, we further propose a ratio-adaptive sampling using Beta distribution. Experiments on various datasets and learning scenarios such as image classification with label noise and image corruption, and model architecture generalization demonstrate the superiority of our method over previous state-of-the-art (SOTA) approaches. Specifically, on ImageNet-1k, our method reduces the time cost for pruning to 66% compared to previous methods while achieving a SOTA, specifically 60% test accuracy at a 90% pruning ratio. On CIFAR datasets, the time cost is reduced to just 15% while maintaining SOTA performance.

None
Acquisition through My Eyes and Steps: A Joint Predictive Agent Model in Egocentric Worlds 2025-02-09
Show

This paper addresses the task of learning an agent model behaving like humans, which can jointly perceive, predict, and act in egocentric worlds. Previous methods usually train separate models for these three abilities, leading to information silos among them, which prevents these abilities from learning from each other and collaborating effectively. In this paper, we propose a joint predictive agent model, named EgoAgent, that simultaneously learns to represent the world, predict future states, and take reasonable actions with a single transformer. EgoAgent unifies the representational spaces of the three abilities by mapping them all into a sequence of continuous tokens. Learnable query tokens are appended to obtain current states, future states, and next actions. With joint supervision, our agent model establishes the internal relationship among these three abilities and effectively mimics the human inference and learning processes. Comprehensive evaluations of EgoAgent covering image classification, egocentric future state prediction, and 3D human motion prediction tasks demonstrate the superiority of our method. The code and trained model will be released for reproducibility.

None
Explaining the Unexplained: Revealing Hidden Correlations for Better Interpretability 2025-02-09
Show

Deep learning has achieved remarkable success in processing and managing unstructured data. However, its "black box" nature imposes significant limitations, particularly in sensitive application domains. While existing interpretable machine learning methods address some of these issues, they often fail to adequately consider feature correlations and provide insufficient evaluation of model decision paths. To overcome these challenges, this paper introduces Real Explainer (RealExp), an interpretability computation method that decouples the Shapley Value into individual feature importance and feature correlation importance. By incorporating feature similarity computations, RealExp enhances interpretability by precisely quantifying both individual feature contributions and their interactions, leading to more reliable and nuanced explanations. Additionally, this paper proposes a novel interpretability evaluation criterion focused on elucidating the decision paths of deep learning models, going beyond traditional accuracy-based metrics. Experimental validations on two unstructured data tasks -- image classification and text sentiment analysis -- demonstrate that RealExp significantly outperforms existing methods in interpretability. Case studies further illustrate its practical value: in image classification, RealExp aids in selecting suitable pre-trained models for specific tasks from an interpretability perspective; in text classification, it enables the optimization of models and approximates the performance of a fine-tuned GPT-Ada model using traditional bag-of-words approaches.

This ...

This work has been submitted to the Elsevier for possible publication

None
Disentangling CLIP Features for Enhanced Localized Understanding 2025-02-08
Show

Vision-language models (VLMs) demonstrate impressive capabilities in coarse-grained tasks like image classification and retrieval. However, they struggle with fine-grained tasks that require localized understanding. To investigate this weakness, we comprehensively analyze CLIP features and identify an important issue: semantic features are highly correlated. Specifically, the features of a class encode information about other classes, which we call mutual feature information (MFI). This mutual information becomes evident when we query a specific class and unrelated objects are activated along with the target class. To address this issue, we propose Unmix-CLIP, a novel framework designed to reduce MFI and improve feature disentanglement. We introduce MFI loss, which explicitly separates text features by projecting them into a space where inter-class similarity is minimized. To ensure a corresponding separation in image features, we use multi-label recognition (MLR) to align the image features with the separated text features. This ensures that both image and text features are disentangled and aligned across modalities, improving feature separation for downstream tasks. For the COCO- 14 dataset, Unmix-CLIP reduces feature similarity by 24.9%. We demonstrate its effectiveness through extensive evaluations of MLR and zeroshot semantic segmentation (ZS3). In MLR, our method performs competitively on the VOC2007 and surpasses SOTA approaches on the COCO-14 dataset, using fewer training parameters. Additionally, Unmix-CLIP consistently outperforms existing ZS3 methods on COCO and VOC

None
Reflexive Guidance: Improving OoDD in Vision-Language Models via Self-Guided Image-Adaptive Concept Generation 2025-02-08
Show

With the recent emergence of foundation models trained on internet-scale data and demonstrating remarkable generalization capabilities, such foundation models have become more widely adopted, leading to an expanding range of application domains. Despite this rapid proliferation, the trustworthiness of foundation models remains underexplored. Specifically, the out-of-distribution detection (OoDD) capabilities of large vision-language models (LVLMs), such as GPT-4o, which are trained on massive multi-modal data, have not been sufficiently addressed. The disparity between their demonstrated potential and practical reliability raises concerns regarding the safe and trustworthy deployment of foundation models. To address this gap, we evaluate and analyze the OoDD capabilities of various proprietary and open-source LVLMs. Our investigation contributes to a better understanding of how these foundation models represent confidence scores through their generated natural language responses. Furthermore, we propose a self-guided prompting approach, termed Reflexive Guidance (ReGuide), aimed at enhancing the OoDD capability of LVLMs by leveraging self-generated image-adaptive concept suggestions. Experimental results demonstrate that our ReGuide enhances the performance of current LVLMs in both image classification and OoDD tasks. The lists of sampled images, along with the prompts and responses for each sample are available at https://github.com/daintlab/ReGuide.

Accep...

Accepted at ICLR 2025. The first two authors contributed equally

Code Link
Data Augmentation Policy Search for Long-Term Forecasting 2025-02-08
Show

Data augmentation serves as a popular regularization technique to combat overfitting challenges in neural networks. While automatic augmentation has demonstrated success in image classification tasks, its application to time-series problems, particularly in long-term forecasting, has received comparatively less attention. To address this gap, we introduce a time-series automatic augmentation approach named TSAA, which is both efficient and easy to implement. The solution involves tackling the associated bilevel optimization problem through a two-step process: initially training a non-augmented model for a limited number of epochs, followed by an iterative split procedure. During this iterative process, we alternate between identifying a robust augmentation policy through Bayesian optimization and refining the model while discarding suboptimal runs. Extensive evaluations on challenging univariate and multivariate forecasting benchmark problems demonstrate that TSAA consistently outperforms several robust baselines, suggesting its potential integration into prediction pipelines. Code is available at this repository: https://github.com/azencot-group/TSAA.

TMLR 2025 Code Link
Semantic Data Augmentation Enhanced Invariant Risk Minimization for Medical Image Domain Generalization 2025-02-08
Show

Deep learning has achieved remarkable success in medical image classification. However, its clinical application is often hindered by data heterogeneity caused by variations in scanner vendors, imaging protocols, and operators. Approaches such as invariant risk minimization (IRM) aim to address this challenge of out-of-distribution generalization. For instance, VIRM improves upon IRM by tackling the issue of insufficient feature support overlap, demonstrating promising potential. Nonetheless, these methods face limitations in medical imaging due to the scarcity of annotated data and the inefficiency of augmentation strategies. To address these issues, we propose a novel domain-oriented direction selector to replace the random augmentation strategy used in VIRM. Our method leverages inter-domain covariance as a guider for augmentation direction, guiding data augmentation towards the target domain. This approach effectively reduces domain discrepancies and enhances generalization performance. Experiments on a multi-center diabetic retinopathy dataset demonstrate that our method outperforms state-of-the-art approaches, particularly under limited data conditions and significant domain heterogeneity.

None
Evaluation of Vision Transformers for Multimodal Image Classification: A Case Study on Brain, Lung, and Kidney Tumors 2025-02-08
Show

Neural networks have become the standard technique for medical diagnostics, especially in cancer detection and classification. This work evaluates the performance of Vision Transformers architectures, including Swin Transformer and MaxViT, in several datasets of magnetic resonance imaging (MRI) and computed tomography (CT) scans. We used three training sets of images with brain, lung, and kidney tumors. Each dataset includes different classification labels, from brain gliomas and meningiomas to benign and malignant lung conditions and kidney anomalies such as cysts and cancers. This work aims to analyze the behavior of the neural networks in each dataset and the benefits of combining different image modalities and tumor classes. We designed several experiments by fine-tuning the models on combined and individual image modalities. The results revealed that the Swin Transformer provided high accuracy, achieving up to 99.9% for kidney tumor classification and 99.3% accuracy in a combined dataset. MaxViT also provided excellent results in individual datasets but performed poorly when data is combined. This research highlights the adaptability of Transformer-based models to various image modalities and features. However, challenges persist, including limited annotated data and interpretability issues. Future works will expand this study by incorporating other image modalities and enhancing diagnostic capabilities. Integrating these models across diverse datasets could mark a pivotal advance in precision medicine, paving the way for more efficient and comprehensive healthcare solutions.

13 pa...

13 pages, 3 figures, 8 tables

None
Convolutional Neural Network Segmentation for Satellite Imagery Data to Identify Landforms Using U-Net Architecture 2025-02-08
Show

This study demonstrates a novel use of the U-Net architecture in the field of semantic segmentation to detect landforms using preprocessed satellite imagery. The study applies the U-Net model for effective feature extraction by using Convolutional Neural Network (CNN) segmentation techniques. Dropout is strategically used for regularization to improve the model's perseverance, and the Adam optimizer is used for effective training. The study thoroughly assesses the performance of the U-Net architecture utilizing a large sample of preprocessed satellite topographical images. The model excels in semantic segmentation tasks, displaying high-resolution outputs, quick feature extraction, and flexibility to a wide range of applications. The findings highlight the U-Net architecture's substantial contribution to the advancement of machine learning and image processing technologies. The U-Net approach, which emphasizes pixel-wise categorization and comprehensive segmentation map production, is helpful in practical applications such as autonomous driving, disaster management, and land use planning. This study not only investigates the complexities of U-Net architecture for semantic segmentation, but also highlights its real-world applications in image classification, analysis, and landform identification. The study demonstrates the U-Net model's key significance in influencing the environment of modern technology.

6th I...

6th International Conference on Computational Intelligence and Pattern Recognition

None
Scalable Framework for Classifying AI-Generated Content Across Modalities 2025-02-08
Show

The rapid growth of generative AI technologies has heightened the importance of effectively distinguishing between human and AI-generated content, as well as classifying outputs from diverse generative models. This paper presents a scalable framework that integrates perceptual hashing, similarity measurement, and pseudo-labeling to address these challenges. Our method enables the incorporation of new generative models without retraining, ensuring adaptability and robustness in dynamic scenarios. Comprehensive evaluations on the Defactify4 dataset demonstrate competitive performance in text and image classification tasks, achieving high accuracy across both distinguishing human and AI-generated content and classifying among generative methods. These results highlight the framework's potential for real-world applications as generative AI continues to evolve. Source codes are publicly available at https://github.com/ffyyytt/defactify4.

Defac...

Defactify4 @ AAAI 2025

Code Link
Interpretable Failure Detection with Human-Level Concepts 2025-02-07
Show

Reliable failure detection holds paramount importance in safety-critical applications. Yet, neural networks are known to produce overconfident predictions for misclassified samples. As a result, it remains a problematic matter as existing confidence score functions rely on category-level signals, the logits, to detect failures. This research introduces an innovative strategy, leveraging human-level concepts for a dual purpose: to reliably detect when a model fails and to transparently interpret why. By integrating a nuanced array of signals for each category, our method enables a finer-grained assessment of the model's confidence. We present a simple yet highly effective approach based on the ordinal ranking of concept activation to the input image. Without bells and whistles, our method significantly reduce the false positive rate across diverse real-world image classification benchmarks, specifically by 3.7% on ImageNet and 9% on EuroSAT.

None
Long-tailed Medical Diagnosis with Relation-aware Representation Learning and Iterative Classifier Calibration 2025-02-07
Show

Recently computer-aided diagnosis has demonstrated promising performance, effectively alleviating the workload of clinicians. However, the inherent sample imbalance among different diseases leads algorithms biased to the majority categories, leading to poor performance for rare categories. Existing works formulated this challenge as a long-tailed problem and attempted to tackle it by decoupling the feature representation and classification. Yet, due to the imbalanced distribution and limited samples from tail classes, these works are prone to biased representation learning and insufficient classifier calibration. To tackle these problems, we propose a new Long-tailed Medical Diagnosis (LMD) framework for balanced medical image classification on long-tailed datasets. In the initial stage, we develop a Relation-aware Representation Learning (RRL) scheme to boost the representation ability by encouraging the encoder to capture intrinsic semantic features through different data augmentations. In the subsequent stage, we propose an Iterative Classifier Calibration (ICC) scheme to calibrate the classifier iteratively. This is achieved by generating a large number of balanced virtual features and fine-tuning the encoder using an Expectation-Maximization manner. The proposed ICC compensates for minority categories to facilitate unbiased classifier optimization while maintaining the diagnostic knowledge in majority classes. Comprehensive experiments on three public long-tailed medical datasets demonstrate that our LMD framework significantly surpasses state-of-the-art approaches. The source code can be accessed at https://github.com/peterlipan/LMD.

This ...

This work has been accepted in Computers in Biology and Medicine

Code Link
TLXML: Task-Level Explanation of Meta-Learning via Influence Functions 2025-02-07
Show

The scheme of adaptation via meta-learning is seen as an ingredient for solving the problem of data shortage or distribution shift in real-world applications, but it also brings the new risk of inappropriate updates of the model in the user environment, which increases the demand for explainability. Among the various types of XAI methods, establishing a method of explanation based on past experience in meta-learning requires special consideration due to its bi-level structure of training, which has been left unexplored. In this work, we propose influence functions for explaining meta-learning that measure the sensitivities of training tasks to adaptation and inference. We also argue that the approximation of the Hessian using the Gauss-Newton matrix resolves computational barriers peculiar to meta-learning. We demonstrate the adequacy of the method through experiments on task distinction and task distribution distinction using image classification tasks with MAML and Prototypical Network.

22 pa...

22 pages; v2: modification in metadata

None
Training-free Neural Architecture Search through Variance of Knowledge of Deep Network Weights 2025-02-07
Show

Deep learning has revolutionized computer vision, but it achieved its tremendous success using deep network architectures which are mostly hand-crafted and therefore likely suboptimal. Neural Architecture Search (NAS) aims to bridge this gap by following a well-defined optimization paradigm which systematically looks for the best architecture, given objective criterion such as maximal classification accuracy. The main limitation of NAS is however its astronomical computational cost, as it typically requires training each candidate network architecture from scratch. In this paper, we aim to alleviate this limitation by proposing a novel training-free proxy for image classification accuracy based on Fisher Information. The proposed proxy has a strong theoretical background in statistics and it allows estimating expected image classification accuracy of a given deep network without training the network, thus significantly reducing computational cost of standard NAS algorithms. Our training-free proxy achieves state-of-the-art results on three public datasets and in two search spaces, both when evaluated using previously proposed metrics, as well as using a new metric that we propose which we demonstrate is more informative for practical NAS applications. The source code is publicly available at http://www.github.com/ondratybl/VKDNW

Code Link
A Label Propagation Strategy for CutMix in Multi-Label Remote Sensing Image Classification 2025-02-07
Show

The development of supervised deep learning-based methods for multi-label scene classification (MLC) is one of the prominent research directions in remote sensing (RS). Yet, collecting annotations for large RS image archives is time-consuming and costly. To address this issue, several data augmentation methods have been introduced in RS. Among others, the data augmentation technique CutMix, which combines parts of two existing training images to generate an augmented image, stands out as a particularly effective approach. However, the direct application of CutMix in RS MLC can lead to the erasure or addition of class labels (i.e., label noise) in the augmented (i.e., combined) training image. To address this problem, we introduce a label propagation (LP) strategy that allows the effective application of CutMix in the context of MLC problems in RS without being affected by label noise. To this end, our proposed LP strategy exploits pixel-level class positional information to update the multi-label of the augmented training image. We propose to access such class positional information from reference maps associated to each training image (e.g., thematic products) or from class explanation masks provided by an explanation method if no reference maps are available. Similarly to pairing two training images, our LP strategy carries out a pairing operation on the associated pixel-level class positional information to derive the updated multi-label for the augmented image. Experimental results show the effectiveness of our LP strategy in general and its robustness in the case of various simulated and real scenarios with noisy class positional information in particular.

None
A Multi-Scale Feature Fusion Framework Integrating Frequency Domain and Cross-View Attention for Dual-View X-ray Security Inspections 2025-02-07
Show

With the rapid development of modern transportation systems and the exponential growth of logistics volumes, intelligent X-ray-based security inspection systems play a crucial role in public safety. Although single-view X-ray equipment is widely deployed, it struggles to accurately identify contraband in complex stacking scenarios due to strong viewpoint dependency and inadequate feature representation. To address this, we propose an innovative multi-scale interactive feature fusion framework tailored for dual-view X-ray security inspection image classification. The framework comprises three core modules: the Frequency Domain Interaction Module (FDIM) enhances frequency-domain features through Fourier transform; the Multi-Scale Cross-View Feature Enhancement (MSCFE) leverages cross-view attention mechanisms to strengthen feature interactions; and the Convolutional Attention Fusion Module (CAFM) efficiently fuses features by integrating channel attention with depthwise-separable convolutions. Experimental results demonstrate that our method outperforms existing state-of-the-art approaches across multiple backbone architectures, particularly excelling in complex scenarios with occlusions and object stacking.

None
A Bayesian Approach to OOD Robustness in Image Classification 2025-02-07
Show

An important and unsolved problem in computer vision is to ensure that the algorithms are robust to changes in image domains. We address this problem in the scenario where we have access to images from the target domains but no annotations. Motivated by the challenges of the OOD-CV benchmark where we encounter real world Out-of-Domain (OOD) nuisances and occlusion, we introduce a novel Bayesian approach to OOD robustness for object classification. Our work extends Compositional Neural Networks (CompNets), which have been shown to be robust to occlusion but degrade badly when tested on OOD data. We exploit the fact that CompNets contain a generative head defined over feature vectors represented by von Mises-Fisher (vMF) kernels, which correspond roughly to object parts, and can be learned without supervision. We obverse that some vMF kernels are similar between different domains, while others are not. This enables us to learn a transitional dictionary of vMF kernels that are intermediate between the source and target domains and train the generative model on this dictionary using the annotations on the source domain, followed by iterative refinement. This approach, termed Unsupervised Generative Transition (UGT), performs very well in OOD scenarios even when occlusion is present. UGT is evaluated on different OOD benchmarks including the OOD-CV dataset, several popular datasets (e.g., ImageNet-C [9]), artificial image corruptions (including adding occluders), and synthetic-to-real domain transfer, and does well in all scenarios outperforming SOTA alternatives (e.g. up to 10% top-1 accuracy on Occluded OOD-CV dataset).

None
Enhancing Multimodal Medical Image Classification using Cross-Graph Modal Contrastive Learning 2025-02-07
Show

The classification of medical images is a pivotal aspect of disease diagnosis, often enhanced by deep learning techniques. However, traditional approaches typically focus on unimodal medical image data, neglecting the integration of diverse non-image patient data. This paper proposes a novel Cross-Graph Modal Contrastive Learning (CGMCL) framework for multimodal structured data from different data domains to improve medical image classification. The model effectively integrates both image and non-image data by constructing cross-modality graphs and leveraging contrastive learning to align multimodal features in a shared latent space. An inter-modality feature scaling module further optimizes the representation learning process by reducing the gap between heterogeneous modalities. The proposed approach is evaluated on two datasets: a Parkinson's disease (PD) dataset and a public melanoma dataset. Results demonstrate that CGMCL outperforms conventional unimodal methods in accuracy, interpretability, and early disease prediction. Additionally, the method shows superior performance in multi-class melanoma classification. The CGMCL framework provides valuable insights into medical image classification while offering improved disease interpretability and predictive capabilities.

None
SeaFormer++: Squeeze-enhanced Axial Transformer for Mobile Visual Recognition 2025-02-07
Show

Since the introduction of Vision Transformers, the landscape of many computer vision tasks (e.g., semantic segmentation), which has been overwhelmingly dominated by CNNs, recently has significantly revolutionized. However, the computational cost and memory requirement renders these methods unsuitable on the mobile device. In this paper, we introduce a new method squeeze-enhanced Axial Transformer (SeaFormer) for mobile visual recognition. Specifically, we design a generic attention block characterized by the formulation of squeeze Axial and detail enhancement. It can be further used to create a family of backbone architectures with superior cost-effectiveness. Coupled with a light segmentation head, we achieve the best trade-off between segmentation accuracy and latency on the ARM-based mobile devices on the ADE20K, Cityscapes, Pascal Context and COCO-Stuff datasets. Critically, we beat both the mobilefriendly rivals and Transformer-based counterparts with better performance and lower latency without bells and whistles. Furthermore, we incorporate a feature upsampling-based multi-resolution distillation technique, further reducing the inference latency of the proposed framework. Beyond semantic segmentation, we further apply the proposed SeaFormer architecture to image classification and object detection problems, demonstrating the potential of serving as a versatile mobile-friendly backbone. Our code and models are made publicly available at https://github.com/fudan-zvg/SeaFormer.

V4 is...

V4 is the ICLR 2023 conference version, and V6 is the IJCV 2025 version

Code Link
AIQViT: Architecture-Informed Post-Training Quantization for Vision Transformers 2025-02-07
Show

Post-training quantization (PTQ) has emerged as a promising solution for reducing the storage and computational cost of vision transformers (ViTs). Recent advances primarily target at crafting quantizers to deal with peculiar activations characterized by ViTs. However, most existing methods underestimate the information loss incurred by weight quantization, resulting in significant performance deterioration, particularly in low-bit cases. Furthermore, a common practice in quantizing post-Softmax activations of ViTs is to employ logarithmic transformations, which unfortunately prioritize less informative values around zero. This approach introduces additional redundancies, ultimately leading to suboptimal quantization efficacy. To handle these, this paper proposes an innovative PTQ method tailored for ViTs, termed AIQViT (Architecture-Informed Post-training Quantization for ViTs). First, we design an architecture-informed low rank compensation mechanism, wherein learnable low-rank weights are introduced to compensate for the degradation caused by weight quantization. Second, we design a dynamic focusing quantizer to accommodate the unbalanced distribution of post-Softmax activations, which dynamically selects the most valuable interval for higher quantization resolution. Extensive experiments on five vision tasks, including image classification, object detection, instance segmentation, point cloud classification, and point cloud part segmentation, demonstrate the superiority of AIQViT over state-of-the-art PTQ methods.

None
Augmented Conditioning Is Enough For Effective Training Image Generation 2025-02-06
Show

Image generation abilities of text-to-image diffusion models have significantly advanced, yielding highly photo-realistic images from descriptive text and increasing the viability of leveraging synthetic images to train computer vision models. To serve as effective training data, generated images must be highly realistic while also sufficiently diverse within the support of the target data distribution. Yet, state-of-the-art conditional image generation models have been primarily optimized for creative applications, prioritizing image realism and prompt adherence over conditional diversity. In this paper, we investigate how to improve the diversity of generated images with the goal of increasing their effectiveness to train downstream image classification models, without fine-tuning the image generation model. We find that conditioning the generation process on an augmented real image and text prompt produces generations that serve as effective synthetic datasets for downstream training. Conditioning on real training images contextualizes the generation process to produce images that are in-domain with the real image distribution, while data augmentations introduce visual diversity that improves the performance of the downstream classifier. We validate augmentation-conditioning on a total of five established long-tail and few-shot image classification benchmarks and show that leveraging augmentations to condition the generation process results in consistent improvements over the state-of-the-art on the long-tailed benchmark and remarkable gains in extreme few-shot regimes of the remaining four benchmarks. These results constitute an important step towards effectively leveraging synthetic data for downstream training.

None
Cross the Gap: Exposing the Intra-modal Misalignment in CLIP via Modality Inversion 2025-02-06
Show

Pre-trained multi-modal Vision-Language Models like CLIP are widely used off-the-shelf for a variety of applications. In this paper, we show that the common practice of individually exploiting the text or image encoders of these powerful multi-modal models is highly suboptimal for intra-modal tasks like image-to-image retrieval. We argue that this is inherently due to the CLIP-style inter-modal contrastive loss that does not enforce any intra-modal constraints, leading to what we call intra-modal misalignment. To demonstrate this, we leverage two optimization-based modality inversion techniques that map representations from their input modality to the complementary one without any need for auxiliary data or additional trained adapters. We empirically show that, in the intra-modal tasks of image-to-image and text-to-text retrieval, approaching these tasks inter-modally significantly improves performance with respect to intra-modal baselines on more than fifteen datasets. Additionally, we demonstrate that approaching a native inter-modal task (e.g. zero-shot image classification) intra-modally decreases performance, further validating our findings. Finally, we show that incorporating an intra-modal term in the pre-training objective or narrowing the modality gap between the text and image feature embedding spaces helps reduce the intra-modal misalignment. The code is publicly available at: https://github.com/miccunifi/Cross-the-Gap.

Accep...

Accepted for publication at ICLR 2025

Code Link
Expanding Training Data for Endoscopic Phenotyping of Eosinophilic Esophagitis 2025-02-06
Show

Eosinophilic esophagitis (EoE) is a chronic esophageal disorder marked by eosinophil-dominated inflammation. Diagnosing EoE usually involves endoscopic inspection of the esophageal mucosa and obtaining esophageal biopsies for histologic confirmation. Recent advances have seen AI-assisted endoscopic imaging, guided by the EREFS system, emerge as a potential alternative to reduce reliance on invasive histological assessments. Despite these advancements, significant challenges persist due to the limited availability of data for training AI models - a common issue even in the development of AI for more prevalent diseases. This study seeks to improve the performance of deep learning-based EoE phenotype classification by augmenting our training data with a diverse set of images from online platforms, public datasets, and electronic textbooks increasing our dataset from 435 to 7050 images. We utilized the Data-efficient Image Transformer for image classification and incorporated attention map visualizations to boost interpretability. The findings show that our expanded dataset and model enhancements improved diagnostic accuracy, robustness, and comprehensive analysis, enhancing patient outcomes.

None
Conditional Diffusion Models are Medical Image Classifiers that Provide Explainability and Uncertainty for Free 2025-02-06
Show

Discriminative classifiers have become a foundational tool in deep learning for medical imaging, excelling at learning separable features of complex data distributions. However, these models often need careful design, augmentation, and training techniques to ensure safe and reliable deployment. Recently, diffusion models have become synonymous with generative modeling in 2D. These models showcase robustness across a range of tasks including natural image classification, where classification is performed by comparing reconstruction errors across images generated for each possible conditioning input. This work presents the first exploration of the potential of class conditional diffusion models for 2D medical image classification. First, we develop a novel majority voting scheme shown to improve the performance of medical diffusion classifiers. Next, extensive experiments on the CheXpert and ISIC Melanoma skin cancer datasets demonstrate that foundation and trained-from-scratch diffusion models achieve competitive performance against SOTA discriminative classifiers without the need for explicit supervision. In addition, we show that diffusion classifiers are intrinsically explainable, and can be used to quantify the uncertainty of their predictions, increasing their trustworthiness and reliability in safety-critical, clinical contexts. Further information is available on our project page: https://faverogian.github.io/med-diffusion-classifier.github.io/

Code Link
A Study in Dataset Distillation for Image Super-Resolution 2025-02-05
Show

Dataset distillation is the concept of condensing large datasets into smaller but highly representative synthetic samples. While previous research has primarily focused on image classification, its application to image Super-Resolution (SR) remains underexplored. This exploratory work studies multiple dataset distillation techniques applied to SR, including pixel- and latent-space approaches under different aspects. Our experiments demonstrate that a 91.12% dataset size reduction can be achieved while maintaining comparable SR performance to the full dataset. We further analyze initialization strategies and distillation methods to optimize memory efficiency and computational costs. Our findings provide new insights into dataset distillation for SR and set the stage for future advancements.

None
Gompertz Linear Units: Leveraging Asymmetry for Enhanced Learning Dynamics 2025-02-05
Show

Activation functions are fundamental elements of deep learning architectures as they significantly influence training dynamics. ReLU, while widely used, is prone to the dying neuron problem, which has been mitigated by variants such as LeakyReLU, PReLU, and ELU that better handle negative neuron outputs. Recently, self-gated activations like GELU and Swish have emerged as state-of-the-art alternatives, leveraging their smoothness to ensure stable gradient flow and prevent neuron inactivity. In this work, we introduce the Gompertz Linear Unit (GoLU), a novel self-gated activation function defined as $\mathrm{GoLU}(x) = x , \mathrm{Gompertz}(x)$, where $\mathrm{Gompertz}(x) = e^{-e^{-x}}$. The GoLU activation leverages the asymmetry in the Gompertz function to reduce variance in the latent space more effectively compared to GELU and Swish, while preserving robust gradient flow. Extensive experiments across diverse tasks, including Image Classification, Language Modeling, Semantic Segmentation, Object Detection, Instance Segmentation, and Diffusion, highlight GoLU's superior performance relative to state-of-the-art activation functions, establishing GoLU as a robust alternative to existing activation functions.

8 pag...

8 pages, excluding references and appendix

None
Control-oriented Clustering of Visual Latent Representation 2025-02-05
Show

We initiate a study of the geometry of the visual representation space -- the information channel from the vision encoder to the action decoder -- in an image-based control pipeline learned from behavior cloning. Inspired by the phenomenon of neural collapse (NC) in image classification (arXiv:2008.08186), we empirically demonstrate the prevalent emergence of a similar law of clustering in the visual representation space. Specifically, in discrete image-based control (e.g., Lunar Lander), the visual representations cluster according to the natural discrete action labels; in continuous image-based control (e.g., Planar Pushing and Block Stacking), the clustering emerges according to "control-oriented" classes that are based on (a) the relative pose between the object and the target in the input or (b) the relative pose of the object induced by expert actions in the output. Each of the classes corresponds to one relative pose orthant (REPO). Beyond empirical observation, we show such a law of clustering can be leveraged as an algorithmic tool to improve test-time performance when training a policy with limited expert demonstrations. Particularly, we pretrain the vision encoder using NC as a regularization to encourage control-oriented clustering of the visual features. Surprisingly, such an NC-pretrained vision encoder, when finetuned end-to-end with the action decoder, boosts the test-time performance by 10% to 35%. Real-world vision-based planar pushing experiments confirmed the surprising advantage of control-oriented visual representation pretraining.

Websi...

Website: https://computationalrobotics.seas.harvard.edu/ControlOriented_NC

None
Clinically-Inspired Hierarchical Multi-Label Classification of Chest X-rays with a Penalty-Based Loss Function 2025-02-05
Show

In this work, we present a novel approach to multi-label chest X-ray (CXR) image classification that enhances clinical interpretability while maintaining a streamlined, single-model, single-run training pipeline. Leveraging the CheXpert dataset and VisualCheXbert-derived labels, we incorporate hierarchical label groupings to capture clinically meaningful relationships between diagnoses. To achieve this, we designed a custom hierarchical binary cross-entropy (HBCE) loss function that enforces label dependencies using either fixed or data-driven penalty types. Our model achieved a mean area under the receiver operating characteristic curve (AUROC) of 0.903 on the test set. Additionally, we provide visual explanations and uncertainty estimations to further enhance model interpretability. All code, model configurations, and experiment details are made available.

9 pag...

9 pages with 3 figures, for associated implementation see https://github.com/the-mercury/CIHMLC

Code Link
An Optimized Toolbox for Advanced Image Processing with Tsetlin Machine Composites 2025-02-05
Show

The Tsetlin Machine (TM) has achieved competitive results on several image classification benchmarks, including MNIST, K-MNIST, F-MNIST, and CIFAR-2. However, color image classification is arguably still in its infancy for TMs, with CIFAR-10 being a focal point for tracking progress. Over the past few years, TM's CIFAR-10 accuracy has increased from around 61% in 2020 to 75.1% in 2023 with the introduction of Drop Clause. In this paper, we leverage the recently proposed TM Composites architecture and introduce a range of TM Specialists that use various image processing techniques. These include Canny edge detection, Histogram of Oriented Gradients, adaptive mean thresholding, adaptive Gaussian thresholding, Otsu's thresholding, color thermometers, and adaptive color thermometers. In addition, we conduct a rigorous hyperparameter search, where we uncover optimal hyperparameters for several of the TM Specialists. The result is a toolbox that provides new state-of-the-art results on CIFAR-10 for TMs with an accuracy of 82.8%. In conclusion, our toolbox of TM Specialists forms a foundation for new TM applications and a landmark for further research on TM Composites in image analysis.

8 pages, 3 figures None
Learning with SASQuaTCh: a Novel Variational Quantum Transformer Architecture with Kernel-Based Self-Attention 2025-02-05
Show

The recent exploding growth in size of state-of-the-art machine learning models highlights a well-known issue where exponential parameter growth, which has grown to trillions as in the case of the Generative Pre-trained Transformer (GPT), leads to training time and memory requirements which limit their advancement in the near term. The predominant models use the so-called transformer network and have a large field of applicability, including predicting text and images, classification, and even predicting solutions to the dynamics of physical systems. Here we present a variational quantum circuit architecture named Self-Attention Sequential Quantum Transformer Channel (SASQuaTCh), which builds networks of qubits that perform analogous operations of the transformer network, namely the keystone self-attention operation, and leads to an exponential improvement in parameter complexity and run-time complexity over its classical counterpart. Our approach leverages recent insights from kernel-based operator learning in the context of predicting spatiotemporal systems to represent deep layers of a vision transformer network using simple gate operations and a set of multi-dimensional quantum Fourier transforms. To validate our approach, we consider image classification tasks in simulation and with hardware, where with only 9 qubits and a handful of parameters we are able to simultaneously embed and classify a grayscale image of handwritten digits with high accuracy.

12 pages, 8 figures None
Optimal Task Order for Continual Learning of Multiple Tasks 2025-02-05
Show

Continual learning of multiple tasks remains a major challenge for neural networks. Here, we investigate how task order influences continual learning and propose a strategy for optimizing it. Leveraging a linear teacher-student model with latent factors, we derive an analytical expression relating task similarity and ordering to learning performance. Our analysis reveals two principles that hold under a wide parameter range: (1) tasks should be arranged from the least representative to the most typical, and (2) adjacent tasks should be dissimilar. We validate these rules on both synthetic data and real-world image classification datasets (Fashion-MNIST, CIFAR-10, CIFAR-100), demonstrating consistent performance improvements in both multilayer perceptrons and convolutional neural networks. Our work thus presents a generalizable framework for task-order optimization in task-incremental continual learning.

None
DILLEMA: Diffusion and Large Language Models for Multi-Modal Augmentation 2025-02-05
Show

Ensuring the robustness of deep learning models requires comprehensive and diverse testing. Existing approaches, often based on simple data augmentation techniques or generative adversarial networks, are limited in producing realistic and varied test cases. To address these limitations, we present a novel framework for testing vision neural networks that leverages Large Language Models and control-conditioned Diffusion Models to generate synthetic, high-fidelity test cases. Our approach begins by translating images into detailed textual descriptions using a captioning model, allowing the language model to identify modifiable aspects of the image and generate counterfactual descriptions. These descriptions are then used to produce new test images through a text-to-image diffusion process that preserves spatial consistency and maintains the critical elements of the scene. We demonstrate the effectiveness of our method using two datasets: ImageNet1K for image classification and SHIFT for semantic segmentation in autonomous driving. The results show that our approach can generate significant test cases that reveal weaknesses and improve the robustness of the model through targeted retraining. We conducted a human assessment using Mechanical Turk to validate the generated images. The responses from the participants confirmed, with high agreement among the voters, that our approach produces valid and realistic images.

None
Adversarial Dependence Minimization 2025-02-05
Show

Many machine learning techniques rely on minimizing the covariance between output feature dimensions to extract minimally redundant representations from data. However, these methods do not eliminate all dependencies/redundancies, as linearly uncorrelated variables can still exhibit nonlinear relationships. This work provides a differentiable and scalable algorithm for dependence minimization that goes beyond linear pairwise decorrelation. Our method employs an adversarial game where small networks identify dependencies among feature dimensions, while the encoder exploits this information to reduce dependencies. We provide empirical evidence of the algorithm's convergence and demonstrate its utility in three applications: extending PCA to nonlinear decorrelation, improving the generalization of image classification methods, and preventing dimensional collapse in self-supervised representation learning.

None
Hybrid Deep Learning Framework for Classification of Kidney CT Images: Diagnosis of Stones, Cysts, and Tumors 2025-02-05
Show

Medical image classification is a vital research area that utilizes advanced computational techniques to improve disease diagnosis and treatment planning. Deep learning models, especially Convolutional Neural Networks (CNNs), have transformed this field by providing automated and precise analysis of complex medical images. This study introduces a hybrid deep learning model that integrates a pre-trained ResNet101 with a custom CNN to classify kidney CT images into four categories: normal, stone, cyst, and tumor. The proposed model leverages feature fusion to enhance classification accuracy, achieving 99.73% training accuracy and 100% testing accuracy. Using a dataset of 12,446 CT images and advanced feature mapping techniques, the hybrid CNN model outperforms standalone ResNet101. This architecture delivers a robust and efficient solution for automated kidney disease diagnosis, providing improved precision, recall, and reduced testing time, making it highly suitable for clinical applications.

None
Slowing Learning by Erasing Simple Features 2025-02-05
Show

Prior work suggests that neural networks tend to learn low-order moments of the data distribution first, before moving on to higher-order correlations. In this work, we derive a novel closed-form concept erasure method, QLEACE, which surgically removes all quadratically available information about a concept from a representation. Through comparisons with linear erasure (LEACE) and two approximate forms of quadratic erasure, we explore whether networks can still learn when low-order statistics are removed from image classification datasets. We find that while LEACE consistently slows learning, quadratic erasure can exhibit both positive and negative effects on learning speed depending on the choice of dataset, model architecture, and erasure method. Use of QLEACE consistently slows learning in feedforward architectures, but more sophisticated architectures learn to use injected higher order Shannon information about class labels. Its approximate variants avoid injecting information, but surprisingly act as data augmentation techniques on some datasets, enhancing learning speed compared to LEACE.

None
Compelling ReLU Networks to Exhibit Exponentially Many Linear Regions at Initialization and During Training 2025-02-04
Show

In a neural network with ReLU activations, the number of piecewise linear regions in the output can grow exponentially with depth. However, this is highly unlikely to happen when the initial parameters are sampled randomly, which therefore often leads to the use of networks that are unnecessarily large. To address this problem, we introduce a novel parameterization of the network that restricts its weights so that a depth $d$ network produces exactly $2^d$ linear regions at initialization and maintains those regions throughout training under the parameterization. This approach allows us to learn approximations of convex, one dimensional functions that are several orders of magnitude more accurate than their randomly initialized counterparts. We further demonstrate how to extend our approach to multidimensional and non-convex functions, allowing it to replace the dense layers in other networks; preliminary improvements are shown for image classification on CIFAR-10 and ImageNet.

24 pages, 16 figures None
The Skin Game: Revolutionizing Standards for AI Dermatology Model Comparison 2025-02-04
Show

Deep Learning approaches in dermatological image classification have shown promising results, yet the field faces significant methodological challenges that impede proper evaluation. This paper presents a dual contribution: first, a systematic analysis of current methodological practices in skin disease classification research, revealing substantial inconsistencies in data preparation, augmentation strategies, and performance reporting; second, a comprehensive training and evaluation framework demonstrated through experiments with the DINOv2-Large vision transformer across three benchmark datasets (HAM10000, DermNet, ISIC Atlas). The analysis identifies concerning patterns, including pre-split data augmentation and validation-based reporting, potentially leading to overestimated metrics, while highlighting the lack of unified methodology standards. The experimental results demonstrate DINOv2's performance in skin disease classification, achieving macro-averaged F1-scores of 0.85 (HAM10000), 0.71 (DermNet), and 0.84 (ISIC Atlas). Attention map analysis reveals critical patterns in the model's decision-making, showing sophisticated feature recognition in typical presentations but significant vulnerabilities with atypical cases and composite images. Our findings highlight the need for standardized evaluation protocols and careful implementation strategies in clinical settings. We propose comprehensive methodological recommendations for model development, evaluation, and clinical deployment, emphasizing rigorous data preparation, systematic error analysis, and specialized protocols for different image types. To promote reproducibility, we provide our implementation code through GitHub. This work establishes a foundation for rigorous evaluation standards in dermatological image classification and provides insights for responsible AI implementation in clinical dermatology.

60 pages, 69 figures None
CoRPA: Adversarial Image Generation for Chest X-rays Using Concept Vector Perturbations and Generative Models 2025-02-04
Show

Deep learning models for medical image classification tasks are becoming widely implemented in AI-assisted diagnostic tools, aiming to enhance diagnostic accuracy, reduce clinician workloads, and improve patient outcomes. However, their vulnerability to adversarial attacks poses significant risks to patient safety. Current attack methodologies use general techniques such as model querying or pixel value perturbations to generate adversarial examples designed to fool a model. These approaches may not adequately address the unique characteristics of clinical errors stemming from missed or incorrectly identified clinical features. We propose the Concept-based Report Perturbation Attack (CoRPA), a clinically-focused black-box adversarial attack framework tailored to the medical imaging domain. CoRPA leverages clinical concepts to generate adversarial radiological reports and images that closely mirror realistic clinical misdiagnosis scenarios. We demonstrate the utility of CoRPA using the MIMIC-CXR-JPG dataset of chest X-rays and radiological reports. Our evaluation reveals that deep learning models exhibiting strong resilience to conventional adversarial attacks are significantly less robust when subjected to CoRPA's clinically-focused perturbations. This underscores the importance of addressing domain-specific vulnerabilities in medical AI systems. By introducing a specialized adversarial attack framework, this study provides a foundation for developing robust, real-world-ready AI models in healthcare, ensuring their safe and reliable deployment in high-stakes clinical environments.

12 pages, 5 figures None
Coherence Awareness in Diffractive Neural Networks 2025-02-04
Show

Diffractive neural networks hold great promise for applications requiring intensive computational processing. Considerable attention has focused on diffractive networks for either spatially coherent or spatially incoherent illumination. Here we illustrate that, as opposed to imaging systems, in diffractive networks the degree of spatial coherence has a dramatic effect. In particular, we show that when the spatial coherence length on the object is comparable to the minimal feature size preserved by the optical system, neither the incoherent nor the coherent extremes serve as acceptable approximations. Importantly, this situation is inherent to many settings involving active illumination, including reflected light microscopy, autonomous vehicles and smartphones. Following this observation, we propose a general framework for training diffractive networks for any specified degree of spatial and temporal coherence, supporting all types of linear and nonlinear layers. Using our method, we numerically optimize networks for image classification, and thoroughly investigate their performance dependence on the illumination coherence properties. We further introduce the concept of coherence-blind networks, which have enhanced resilience to changes in illumination conditions. Our findings serve as a steppingstone toward adopting all-optical neural networks in real-world applications, leveraging nothing but natural light.

Proje...

Project's code https://github.com/matankleiner/Coherence-Awareness-in-Diffractive-Neural-Networks

Code Link
BRIDLE: Generalized Self-supervised Learning with Quantization 2025-02-04
Show

Self-supervised learning has been a powerful approach for learning meaningful representations from unlabeled data across various domains, reducing the reliance on large labeled datasets. Inspired by BERT's success in capturing deep bidirectional contexts in natural language processing, similar frameworks have been adapted to other modalities such as audio, with models like BEATs extending the bidirectional training paradigm to audio signals using vector quantization (VQ). However, these frameworks face challenges, notably their dependence on a single codebook for quantization, which may not capture the complex, multifaceted nature of signals. In addition, inefficiencies in codebook utilization lead to underutilized code vectors. To address these limitations, we introduce BRIDLE (Bidirectional Residual Quantization Interleaved Discrete Learning Encoder), a self-supervised encoder pretraining framework that incorporates residual quantization (RQ) into the bidirectional training process, and is generalized for pretraining with audio, image, and video. Using multiple hierarchical codebooks, RQ enables fine-grained discretization in the latent space, enhancing representation quality. BRIDLE involves an interleaved training procedure between the encoder and tokenizer. We evaluate BRIDLE on audio understanding tasks using classification benchmarks, achieving state-of-the-art results, and demonstrate competitive performance on image classification and video classification tasks, showing consistent improvements over traditional VQ methods in downstream performance.

None
Model Input-Output Configuration Search with Embedded Feature Selection for Sensor Time-series and Image Classification 2025-02-04
Show

Machine learning is a powerful tool for extracting valuable information and making various predictions from diverse datasets. Traditional machine learning algorithms rely on well-defined input and output variables; however, there are scenarios where the separation between the input and output variables and the underlying, associated input and output layers of the model are unknown. Feature Selection (FS) and Neural Architecture Search (NAS) have emerged as promising solutions in such scenarios. This paper proposes MICS-EFS, a Model Input-Output Configuration Search with Embedded Feature Selection. The methodology explores internal dependencies in the complete input parameter space for classification tasks involving both 1D sensor time-series and 2D image data. MICS-EFS employs a modified encoder-decoder model and the Sequential Forward Search (SFS) algorithm, combining input-output configuration search with embedded feature selection. Experimental results demonstrate the superior performance of MICS-EFS compared to other FS algorithms. Across all tested datasets, MICS-EFS delivered an average accuracy improvement of 1.5% over baseline models, with the accuracy gains ranging from 0.5% to 5.9%. Moreover, the algorithm reduced feature dimensionality to just 2-5% of the original data, significantly enhancing computational efficiency. These results highlight the potential of MICS-EFS to improve model accuracy and efficiency in various machine learning tasks. Furthermore, the proposed method has been validated in a real-world industrial application focused on machining processes, underscoring its effectiveness and practicality in addressing complex input-output challenges.

19 pa...

19 pages, 19 figures + appendix, the related software code can be found under the link: https://github.com/viharoszsolt/IDENAS

Code Link
PaRCE: Probabilistic and Reconstruction-based Competency Estimation for CNN-based Image Classification 2025-02-04
Show

Convolutional neural networks (CNNs) are extremely popular and effective for image classification tasks but tend to be overly confident in their predictions. Various works have sought to quantify uncertainty associated with these models, detect out-of-distribution (OOD) inputs, or identify anomalous regions in an image, but limited work has sought to develop a holistic approach that can accurately estimate perception model confidence across various sources of uncertainty. We develop a probabilistic and reconstruction-based competency estimation (PaRCE) method and compare it to existing approaches for uncertainty quantification and OOD detection. We find that our method can best distinguish between correctly classified, misclassified, and OOD samples with anomalous regions, as well as between samples with visual image modifications resulting in high, medium, and low prediction accuracy. We describe how to extend our approach for anomaly localization tasks and demonstrate the ability of our approach to distinguish between regions in an image that are familiar to the perception model from those that are unfamiliar. We find that our method generates interpretable scores that most reliably capture a holistic notion of perception model confidence.

arXiv...

arXiv admin note: text overlap with arXiv:2409.06111

None
DCT-Mamba3D: Spectral Decorrelation and Spatial-Spectral Feature Extraction for Hyperspectral Image Classification 2025-02-04
Show

Hyperspectral image classification presents challenges due to spectral redundancy and complex spatial-spectral dependencies. This paper proposes a novel framework, DCT-Mamba3D, for hyperspectral image classification. DCT-Mamba3D incorporates: (1) a 3D spectral-spatial decorrelation module that applies 3D discrete cosine transform basis functions to reduce both spectral and spatial redundancy, enhancing feature clarity across dimensions; (2) a 3D-Mamba module that leverages a bidirectional state-space model to capture intricate spatial-spectral dependencies; and (3) a global residual enhancement module that stabilizes feature representation, improving robustness and convergence. Extensive experiments on benchmark datasets show that our DCT-Mamba3D outperforms the state-of-the-art methods in challenging scenarios such as the same object in different spectra and different objects in the same spectra.

None
Generative Data Mining with Longtail-Guided Diffusion 2025-02-04
Show

It is difficult to anticipate the myriad challenges that a predictive model will encounter once deployed. Common practice entails a reactive, cyclical approach: model deployment, data mining, and retraining. We instead develop a proactive longtail discovery process by imagining additional data during training. In particular, we develop general model-based longtail signals, including a differentiable, single forward pass formulation of epistemic uncertainty that does not impact model parameters or predictive performance but can flag rare or hard inputs. We leverage these signals as guidance to generate additional training data from a latent diffusion model in a process we call Longtail Guidance (LTG). Crucially, we can perform LTG without retraining the diffusion model or the predictive model, and we do not need to expose the predictive model to intermediate diffusion states. Data generated by LTG exhibit semantically meaningful variation, yield significant generalization improvements on image classification benchmarks, and can be analyzed to proactively discover, explain, and address conceptual gaps in a predictive model.

20 pages None
Using Stochastic Gradient Descent to Smooth Nonconvex Functions: Analysis of Implicit Graduated Optimization 2025-02-03
Show

The graduated optimization approach is a method for finding global optimal solutions for nonconvex functions by using a function smoothing operation with stochastic noise. We show that stochastic noise in stochastic gradient descent (SGD) has the effect of smoothing the objective function, the degree of which is determined by the learning rate, batch size, and variance of the stochastic gradient. Using this finding, we propose and analyze a new graduated optimization algorithm that varies the degree of smoothing by varying the learning rate and batch size, and provide experimental results on image classification tasks with ResNets that support our theoretical findings. We further show that there is an interesting relationship between the degree of smoothing by SGD's stochastic noise, the well-studied ``sharpness'' indicator, and the generalization performance of the model.

The l...

The latest version was updated in February 2025. Under review

None
Activation by Interval-wise Dropout: A Simple Way to Prevent Neural Networks from Plasticity Loss 2025-02-03
Show

Plasticity loss, a critical challenge in neural network training, limits a model's ability to adapt to new tasks or shifts in data distribution. This paper introduces AID (Activation by Interval-wise Dropout), a novel method inspired by Dropout, designed to address plasticity loss. Unlike Dropout, AID generates subnetworks by applying Dropout with different probabilities on each preactivation interval. Theoretical analysis reveals that AID regularizes the network, promoting behavior analogous to that of deep linear networks, which do not suffer from plasticity loss. We validate the effectiveness of AID in maintaining plasticity across various benchmarks, including continual learning tasks on standard image classification datasets such as CIFAR10, CIFAR100, and TinyImageNet. Furthermore, we show that AID enhances reinforcement learning performance in the Arcade Learning Environment benchmark.

None
Jacobian Descent for Multi-Objective Optimization 2025-02-03
Show

Many optimization problems require balancing multiple conflicting objectives. As gradient descent is limited to single-objective optimization, we introduce its direct generalization: Jacobian descent (JD). This algorithm iteratively updates parameters using the Jacobian matrix of a vector-valued objective function, in which each row is the gradient of an individual objective. While several methods to combine gradients already exist in the literature, they are generally hindered when the objectives conflict. In contrast, we propose projecting gradients to fully resolve conflict while ensuring that they preserve an influence proportional to their norm. We prove significantly stronger convergence guarantees with this approach, supported by our empirical results. Our method also enables instance-wise risk minimization (IWRM), a novel learning paradigm in which the loss of each training example is considered a separate objective. Applied to simple image classification tasks, IWRM exhibits promising results compared to the direct minimization of the average loss. Additionally, we outline an efficient implementation of JD using the Gramian of the Jacobian matrix to reduce time and memory requirements.

39 pa...

39 pages, 10 figures, conference

None
A Framework for Double-Blind Federated Adaptation of Foundation Models 2025-02-03
Show

The availability of foundational models (FMs) pre-trained on large-scale data has advanced the state-of-the-art in many computer vision tasks. While FMs have demonstrated good zero-shot performance on many image classification tasks, there is often scope for performance improvement by adapting the FM to the downstream task. However, the data that is required for this adaptation typically exists in silos across multiple entities (data owners) and cannot be collated at a central location due to regulations and privacy concerns. At the same time, a learning service provider (LSP) who owns the FM cannot share the model with the data owners due to proprietary reasons. In some cases, the data owners may not even have the resources to store such large FMs. Hence, there is a need for algorithms to adapt the FM in a double-blind federated manner, i.e., the data owners do not know the FM or each other's data, and the LSP does not see the data for the downstream tasks. In this work, we propose a framework for double-blind federated adaptation of FMs using fully homomorphic encryption (FHE). The proposed framework first decomposes the FM into a sequence of FHE-friendly blocks through knowledge distillation. The resulting FHE-friendly model is adapted for the downstream task via low-rank parallel adapters that can be learned without backpropagation through the FM. Since the proposed framework requires the LSP to share intermediate representations with the data owners, we design a privacy-preserving permutation scheme to prevent the data owners from learning the FM through model extraction attacks. Finally, a secure aggregation protocol is employed for federated learning of the low-rank parallel adapters. Empirical results on four datasets demonstrate the practical feasibility of the proposed framework.

None
Enabling Small Models for Zero-Shot Selection and Reuse through Model Label Learning 2025-02-02
Show

Vision-language models (VLMs) like CLIP have demonstrated impressive zero-shot ability in image classification tasks by aligning text and images but suffer inferior performance compared with task-specific expert models. On the contrary, expert models excel in their specialized domains but lack zero-shot ability for new tasks. How to obtain both the high performance of expert models and zero-shot ability is an important research direction. In this paper, we attempt to demonstrate that by constructing a model hub and aligning models with their functionalities using model labels, new tasks can be solved in a zero-shot manner by effectively selecting and reusing models in the hub. We introduce a novel paradigm, Model Label Learning (MLL), which bridges the gap between models and their functionalities through a Semantic Directed Acyclic Graph (SDAG) and leverages an algorithm, Classification Head Combination Optimization (CHCO), to select capable models for new tasks. Compared with the foundation model paradigm, it is less costly and more scalable, i.e., the zero-shot ability grows with the sizes of the model hub. Experiments on seven real-world datasets validate the effectiveness and efficiency of MLL, demonstrating that expert models can be effectively reused for zero-shot tasks. Our code will be released publicly.

None
Information Bottleneck Approach to Spatial Attention Learning 2025-02-02
Show

The selective visual attention mechanism in the human visual system (HVS) restricts the amount of information to reach visual awareness for perceiving natural scenes, allowing near real-time information processing with limited computational capacity [Koch and Ullman, 1987]. This kind of selectivity acts as an 'Information Bottleneck (IB)', which seeks a trade-off between information compression and predictive accuracy. However, such information constraints are rarely explored in the attention mechanism for deep neural networks (DNNs). In this paper, we propose an IB-inspired spatial attention module for DNN structures built for visual recognition. The module takes as input an intermediate representation of the input image, and outputs a variational 2D attention map that minimizes the mutual information (MI) between the attention-modulated representation and the input, while maximizing the MI between the attention-modulated representation and the task label. To further restrict the information bypassed by the attention map, we quantize the continuous attention scores to a set of learnable anchor values during training. Extensive experiments show that the proposed IB-inspired spatial attention mechanism can yield attention maps that neatly highlight the regions of interest while suppressing backgrounds, and bootstrap standard DNN structures for visual recognition tasks (e.g., image classification, fine-grained recognition, cross-domain classification). The attention maps are interpretable for the decision making of the DNNs as verified in the experiments. Our code is available at https://github.com/ashleylqx/AIB.git.

Accep...

Accepted to IJCAI 2021; Update supplementary

Code Link
Enhanced Convolutional Neural Networks for Improved Image Classification 2025-02-02
Show

Image classification is a fundamental task in computer vision with diverse applications, ranging from autonomous systems to medical imaging. The CIFAR-10 dataset is a widely used benchmark to evaluate the performance of classification models on small-scale, multi-class datasets. Convolutional Neural Networks (CNNs) have demonstrated state-of-the-art results; however, they often suffer from overfitting and suboptimal feature representation when applied to challenging datasets like CIFAR-10. In this paper, we propose an enhanced CNN architecture that integrates deeper convolutional blocks, batch normalization, and dropout regularization to achieve superior performance. The proposed model achieves a test accuracy of 84.95%, outperforming baseline CNN architectures. Through detailed ablation studies, we demonstrate the effectiveness of the enhancements and analyze the hierarchical feature representations. This work highlights the potential of refined CNN architectures for tackling small-scale image classification problems effectively.

None