diff --git a/demo-anony.html b/demo-anony.html deleted file mode 100644 index 50bd11b..0000000 --- a/demo-anony.html +++ /dev/null @@ -1,1740 +0,0 @@ - - - -
- - - -This demonstration page presents the generations from 50 randomly selected prompts from the AudioCaps test set.
-We present four audio sources: the consistency model fine-tuned with CLAP, - the consistency model without CLAP-fine-tuning, the diffusion baseline model, and the ground truth.
-The diffusion baseline queries the neural network 400 times per audio clip, - while the consistency models query a same-sized network only one time.
-Since the models are not trained on speech data, we do not expect them to produce meaningful speeches.
- -ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
This demonstration page presents the generation diversity of the proposed consistency TTA model. - The generations correspond to the first 50 AudioCaps test prompts, - and are from our consistency model with four different random seeds.
-For quantitative evidence, we standardize each generated Mel spectrogram, - calculate the standard deviation across different seeds, - and average the standard deviation across all Mel spectrogram points of the 50 examples. - The averaged number is 0.871, demonstrating non-trivial generation diversity.
-Please listen to the following audio clips to confirm the generation quality of these seeds. - Since the model are not trained on speech data, we do not expect it to produce meaningful speech.
- -Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Rain and thunder
- - -A loud bang followed by an engine idling loudly
- - -A man speaking while water runs in the background
- - -An electric motor runs then a person speaks
- - -A helicopter engine operating while wind blows heavily into a microphone
- - -A sewing machine sews followed by a man talking
- - -A woman talks briefly as several goats bleat including one that has high pitched bleats. A crunch is followed by a man speaking
- - -High pressure liquid spraying as a radio plays in the background
- - -Male speech and then scraping
- - -Mechanical rotation and then a loud click occurs
- - -A loud bang followed by an engine idling loudly
- - -Humming from a large engine
- - -A motor vehicle engine is revving
- - -A bus engine driving in the distance then nearby followed by compressed air releasing while a woman and a child talk in the distance
- - -A woman speaks, and a motor vehicle revs its engine
- - -A vehicle accelerating then driving by as gusts of wind blow and leaves rustle in the distance
- - -A car engine idling then starts to rev shortly after
- - -Rain and thunder
- - -A man talking followed by a camera muffling and footsteps shuffling then wood lightly clanking
- - -An electric motor runs then a person speaks
- - -A helicopter engine operating while wind blows heavily into a microphone
- - -Mechanical rotation and then a loud click occurs
- - -A machine motor running as a man is speaking followed by rapid buzzing
- - -A vehicle accelerating then driving by as gusts of wind blow and leaves rustle in the distance
- - -Train passing followed by short honk
- - -A woman speaks, and a motor vehicle revs its engine
- - -Several puppies yapping
- - -A person gulping followed by glass tapping then liquid shaking in a container proceeded by liquid pouring before plastic thumps on paper
- - -A nearby insect buzzes with nearby vibrations
- - -A bus engine driving in the distance then nearby followed by compressed air releasing while a woman and a child talk in the distance
- - -A bus engine driving in the distance then nearby followed by compressed air releasing while a woman and a child talk in the distance
- - -High pressure liquid spraying as a radio plays in the background
- - -A loud bang followed by an engine idling loudly
- - -Mechanical rotation and then a loud click occurs
- - -A motor vehicle engine is revving
- - -A woman speaks, and a motor vehicle revs its engine
- - -An electric motor runs then a person speaks
- - -A man speaking while water runs in the background
- - -Man talking in the wind and someone yells in the background while an engine makes squealing and air puffing sounds
- - -A person gulping followed by glass tapping then liquid shaking in a container proceeded by liquid pouring before plastic thumps on paper
- - -Male speech and then scraping
- - -Mechanical rotation and then a loud click occurs
- - -Several puppies yapping
- - -Train passing followed by short honk
- - -An baby laughing
- - -Humming from a large engine
- - -An baby laughing
- - -A man speaking while water runs in the background
- - -A man talking followed by a camera muffling and footsteps shuffling then wood lightly clanking
- - -A horse gallops then trot on grass as gusts of wind blow and thunderclaps in the distance
- - -A sewing machine sews followed by a man talking
- - -An baby laughing
- - -A horse gallops then trot on grass as gusts of wind blow and thunderclaps in the distance
- - -Train passing followed by short honk
- - -A man speaking while water runs in the background
- - -Several puppies yapping
- - -Several puppies yapping
- - -A person gulping followed by glass tapping then liquid shaking in a container proceeded by liquid pouring before plastic thumps on paper
- - -A woman talks briefly as several goats bleat including one that has high pitched bleats. A crunch is followed by a man speaking
- - -Rain and thunder
- - -Humming from a large engine
- - -A car engine idling then starts to rev shortly after
- - -High pressure liquid spraying as a radio plays in the background
- - -A woman speaks, and a motor vehicle revs its engine
- - -A nearby insect buzzes with nearby vibrations
- - -Train passing followed by short honk
- - -Rain and thunder
- - -A bus engine driving in the distance then nearby followed by compressed air releasing while a woman and a child talk in the distance
- - -Male speech and then scraping
- - -An electric motor runs then a person speaks
- - -A machine motor running as a man is speaking followed by rapid buzzing
- - -A vehicle accelerating then driving by as gusts of wind blow and leaves rustle in the distance
- - -A machine motor running as a man is speaking followed by rapid buzzing
- - -A car engine idling then starts to rev shortly after
- - -A helicopter engine operating while wind blows heavily into a microphone
- - -A man talking followed by a camera muffling and footsteps shuffling then wood lightly clanking
- - -A vehicle accelerating then driving by as gusts of wind blow and leaves rustle in the distance
- - -A motor vehicle engine is revving
- - -High pressure liquid spraying as a radio plays in the background
- - -Man talking in the wind and someone yells in the background while an engine makes squealing and air puffing sounds
- - -A woman talks briefly as several goats bleat including one that has high pitched bleats. A crunch is followed by a man speaking
- - -A sewing machine sews followed by a man talking
- - -A machine motor running as a man is speaking followed by rapid buzzing
- - -A loud bang followed by an engine idling loudly
- - -Man talking in the wind and someone yells in the background while an engine makes squealing and air puffing sounds
- - -Male speech and then scraping
- - -An baby laughing
- - -A nearby insect buzzes with nearby vibrations
- - -A horse gallops then trot on grass as gusts of wind blow and thunderclaps in the distance
- - -Humming from a large engine
- - -A nearby insect buzzes with nearby vibrations
- - -A motor vehicle engine is revving
- - -A car engine idling then starts to rev shortly after
- - -A helicopter engine operating while wind blows heavily into a microphone
- - -A horse gallops then trot on grass as gusts of wind blow and thunderclaps in the distance
- - -A man talking followed by a camera muffling and footsteps shuffling then wood lightly clanking
- - -A sewing machine sews followed by a man talking
- - -A person gulping followed by glass tapping then liquid shaking in a container proceeded by liquid pouring before plastic thumps on paper
- - -Man talking in the wind and someone yells in the background while an engine makes squealing and air puffing sounds
- - -A woman talks briefly as several goats bleat including one that has high pitched bleats. A crunch is followed by a man speaking
- - -- Diffusion models power a vast majority of the text-to-audio generation methods. - Unfortunately, diffusion models suffer from a slow inference speed due to iteratively querying the - underlying denoising network, thus unsuitable for applications with time or computational constraints. - This work proposes text-to-audio models that only require a single non-autoregressive neural network - query, accelerating the generation hundreds of times and enabling on-device audio generation. -
-- By incorporating classifier-free guidance into the distillation framework, our models retain - diffusion models' impressive generation quality and diversity. Furthermore, the non-recurrent - differentiable structure resulting from the distillation allows fine-tuning with novel loss functions. - We use the CLAP loss as an example, confirming that end-to-end fine-tuning further boosts the generation quality. -
-- Our method reduce the computation of the core step of diffusion-based text-to-audio generation by - a factor of 400, while observing minimal performance degradation in terms of - Fréchet Audio Distance (FAD), Fréchet Distance (FD), KL Divergence, and CLAP Scores. -
-# queries (↓) | -CLAPT (↑) | CLAPA (↑) | -FAD (↓) | FD (↓) | KLD (↓) | -|
---|---|---|---|---|---|---|
Diffusion (Baseline) | 400 | -24.57 | 72.79 | -1.908 | 19.57 | 1.350 | -
Consistency + CLAP FT (Ours) | 1 | -24.69 | 72.54 | -2.406 | 20.97 | 1.358 | -
Consistency (Ours) | 1 | -22.50 | 72.30 | -2.575 | 22.08 | 1.354 | -
- Consistency models demonstrate non-trivial generation diversity, as do diffusion models. - In this page, we present 50 groups of generations from - four different random seeds to demonstrate this diversity, showing that our method - combines the diversity of diffusion models and the efficiency of single-step models. -
-- ConsistencyTTA's performance is verified via extensive human evaluation. - Audio clips generated from ConsistencyTTA and baseline methods are mixed and shown to the evaluators, - who are then asked to rate the audio clips based on their quality and correspondence with the textual prompt. - A sample of the evaluation form is shown on this page. -
-@inproceedings{bai2024accelerating, +@inproceedings{bai2024consistencytta, author = {Bai, Yatong and Dang, Trung and Tran, Dung and Koishida, Kazuhito and Sojoudi, Somayeh}, title = {ConsistencyTTA: Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation}, booktitle = {INTERSPEECH}, diff --git a/styles.css b/styles.css index ca1d616..7f667eb 100644 --- a/styles.css +++ b/styles.css @@ -7,6 +7,9 @@ @import url( 'https://cdnjs.cloudflare.com/ajax/libs/font-awesome/6.1.0/css/all.min.css' ); +@import url( + "https://cdn.jsdelivr.net/gh/jpswalsh/academicons@1/css/academicons.min.css" +); body { @@ -197,6 +200,9 @@ a .fab.fa-github { font-size: 24px; /* adjust size as needed */ margin: 0px 7px; } +a .ai-arxiv { + color: #E7352B; /* Set your desired color (arXiv red) */ +} .eval-button-small { background-color: #5d5d5d;