From abaa6ccfaa598a9caf5f3cd02425d02aa7564e9b Mon Sep 17 00:00:00 2001 From: Paul Vicol Date: Thu, 28 Nov 2024 00:06:26 -0500 Subject: [PATCH] Added new papers --- index.html | 30 +++++ paper_pages/IJlbuSrXmk.html | 217 ++++++++++++++++++++++++++++++++++++ 2 files changed, 247 insertions(+) create mode 100644 paper_pages/IJlbuSrXmk.html diff --git a/index.html b/index.html index 26b8043..4c5f6a5 100644 --- a/index.html +++ b/index.html @@ -493,6 +493,36 @@ +
+
+ Audio-Visual Dataset Distillation +
+
+
Saksham Singh Kushwaha · Siva Sai Nagender Vasireddy · Kai Wang · Yapeng Tian
+
+
+ + + + + + + +
+
+

In this article, we introduce \textit{audio-visual dataset distillation}, a task to construct a smaller yet representative synthetic audio-visual dataset that maintains the cross-modal semantic association between audio and visual modalities. Dataset distillation techniques have primarily focused on image classification. However, with the growing capabilities of audio-visual models and the vast datasets required for their training, it is necessary to explore distillation methods beyond the visual modality. Our approach builds upon the foundation of Distribution Matching (DM), extending it to handle the unique challenges of audio-visual data. A key challenge is to jointly learn synthetic data that distills both the modality-wise information and natural alignment from real audio-visual data. We introduce a vanilla audio-visual distribution matching framework that separately trains visual-only and audio-only DM components, enabling us to investigate the effectiveness of audio-visual integration and various multimodal fusion methods. To address the limitations of unimodal distillation, we propose two novel matching losses: implicit cross-matching and cross-modal gap matching. These losses work in conjunction with the vanilla unimodal distribution matching loss to enforce cross-modal alignment and enhance the audio-visual dataset distillation process. Extensive audio-visual classification and retrieval experiments on four audio-visual datasets, AVE, MUSIC-21, VGGSound, and VGGSound-10K, demonstrate the effectiveness of our proposed matching approaches and validate the benefits of audio-visual integration with condensed data. This work establishes a new frontier in audio-visual dataset distillation, paving the way for further advancements in this exciting field. \textit{Our source code and pre-trained models will be released}.

+
+
+
+
Growing Tiny Networks: Spotting Expressivity Bottlenecks and Fixing Them Optimally diff --git a/paper_pages/IJlbuSrXmk.html b/paper_pages/IJlbuSrXmk.html new file mode 100644 index 0000000..80379e2 --- /dev/null +++ b/paper_pages/IJlbuSrXmk.html @@ -0,0 +1,217 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ +
+
+ + +

+ Audio-Visual Dataset Distillation +

+ +

+ Saksham Singh Kushwaha · Siva Sai Nagender Vasireddy · Kai Wang · Yapeng Tian +

+ + +
+
+
+
+
+
+ + + +
+

Video

+ +
+ +
+

Paper PDF

+ Thumbnail of paper pages +
+ +
+
+

+

+

Abstract

+

+ In this article, we introduce \textit{audio-visual dataset distillation}, a task to construct a smaller yet representative synthetic audio-visual dataset that maintains the cross-modal semantic association between audio and visual modalities. Dataset distillation techniques have primarily focused on image classification. However, with the growing capabilities of audio-visual models and the vast datasets required for their training, it is necessary to explore distillation methods beyond the visual modality. Our approach builds upon the foundation of Distribution Matching (DM), extending it to handle the unique challenges of audio-visual data. A key challenge is to jointly learn synthetic data that distills both the modality-wise information and natural alignment from real audio-visual data. We introduce a vanilla audio-visual distribution matching framework that separately trains visual-only and audio-only DM components, enabling us to investigate the effectiveness of audio-visual integration and various multimodal fusion methods. To address the limitations of unimodal distillation, we propose two novel matching losses: implicit cross-matching and cross-modal gap matching. These losses work in conjunction with the vanilla unimodal distribution matching loss to enforce cross-modal alignment and enhance the audio-visual dataset distillation process. Extensive audio-visual classification and retrieval experiments on four audio-visual datasets, AVE, MUSIC-21, VGGSound, and VGGSound-10K, demonstrate the effectiveness of our proposed matching approaches and validate the benefits of audio-visual integration with condensed data. This work establishes a new frontier in audio-visual dataset distillation, paving the way for further advancements in this exciting field. \textit{Our source code and pre-trained models will be released}. +

+
+

+ +
+
+
+ + + + + + + + + + +