diff --git a/README.md b/README.md index 2553c5b..9b0e2ff 100644 --- a/README.md +++ b/README.md @@ -1,7 +1,7 @@ # ConsistencyTTA: Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation This is the [**official website**](https://consistency-tta.github.io) for the paper \ -"Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation" \ +*ConsistencyTTA: Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation* \ from Microsoft Applied Science Group and UC Berkeley \ by [Yatong Bai](https://bai-yt.github.io), [Trung Dang](https://www.microsoft.com/applied-sciences/people/trung-dang), @@ -9,10 +9,11 @@ by [Yatong Bai](https://bai-yt.github.io), [Kazuhito Koishida](https://www.microsoft.com/applied-sciences/people/kazuhito-koishida), and [Somayeh Sojoudi](https://people.eecs.berkeley.edu/~sojoudi/). -**[[Preprint Paper](https://arxiv.org/abs/2309.10740)]** -**[[Project Homepage](https://consistency-tta.github.io)]** -**[[Code](https://github.com/Bai-YT/ConsistencyTTA)]** -**[[Model Checkpoints](https://huggingface.co/Bai-YT/ConsistencyTTA)]** +**[[Live Demo](https://huggingface.co/spaces/Bai-YT/ConsistencyTTA)]** +**[[Preprint Paper](https://arxiv.org/abs/2309.10740)]** +**[[Project Homepage](https://consistency-tta.github.io)]** +**[[Code](https://github.com/Bai-YT/ConsistencyTTA)]** +**[[Model Checkpoints](https://huggingface.co/Bai-YT/ConsistencyTTA)]** **[[Generation Examples](https://consistency-tta.github.io/demo.html)]** @@ -35,8 +36,8 @@ single-step models stack up with previous methods, most of which mostly require ### Cite Our Work (BibTeX) ```bibtex -@article{bai2023accelerating, - title={Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation}, +@article{bai2023consistencytta, + title={ConsistencyTTA: Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation}, author={Bai, Yatong and Dang, Trung and Tran, Dung and Koishida, Kazuhito and Sojoudi, Somayeh}, journal={arXiv preprint arXiv:2309.10740}, year={2023} diff --git a/demo-anony.html b/demo-anony.html deleted file mode 100644 index e54a3c6..0000000 --- a/demo-anony.html +++ /dev/null @@ -1,1741 +0,0 @@ - - - -
- - - -This demonstration page presents the generations from 50 randomly selected prompts from the AudioCaps test set.
-We present four audio sources: the consistency model fine-tuned with CLAP, - the consistency model without CLAP-fine-tuning, the diffusion baseline model, and the ground truth.
-The diffusion baseline queries the neural network 400 times per audio clip, - while the consistency models query a same-sized network only one time.
-Since the models are not trained on speech data, we do not expect them to produce meaningful speeches.
- -ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
ConsistencyTTA (ours); | -- |
ConsistencyTTA + CLAP-FT (ours) | -- |
Diffusion baseline (TANGO) | -- |
Ground truth | -- |
This demonstration page presents the generation diversity of the proposed consistency TTA model. - The generations correspond to the first 50 AudioCaps test prompts, - and are from our consistency model with four different random seeds.
-For quantitative evidence, we standardize each generated Mel spectrogram, - calculate the standard deviation across different seeds, - and average the standard deviation across all Mel spectrogram points of the 50 examples. - The averaged number is 0.871, demonstrating non-trivial generation diversity.
-Please listen to the following audio clips to confirm the generation quality of these seeds. - Since the model are not trained on speech data, we do not expect it to produce meaningful speech.
- -Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Seed I | -- |
Seed II | -- |
Seed III | -- |
Seed IV | -- |
Rain and thunder
- - -A loud bang followed by an engine idling loudly
- - -A man speaking while water runs in the background
- - -An electric motor runs then a person speaks
- - -A helicopter engine operating while wind blows heavily into a microphone
- - -A sewing machine sews followed by a man talking
- - -A woman talks briefly as several goats bleat including one that has high pitched bleats. A crunch is followed by a man speaking
- - -High pressure liquid spraying as a radio plays in the background
- - -Male speech and then scraping
- - -Mechanical rotation and then a loud click occurs
- - -A loud bang followed by an engine idling loudly
- - -Humming from a large engine
- - -A motor vehicle engine is revving
- - -A bus engine driving in the distance then nearby followed by compressed air releasing while a woman and a child talk in the distance
- - -A woman speaks, and a motor vehicle revs its engine
- - -A vehicle accelerating then driving by as gusts of wind blow and leaves rustle in the distance
- - -A car engine idling then starts to rev shortly after
- - -Rain and thunder
- - -A man talking followed by a camera muffling and footsteps shuffling then wood lightly clanking
- - -An electric motor runs then a person speaks
- - -A helicopter engine operating while wind blows heavily into a microphone
- - -Mechanical rotation and then a loud click occurs
- - -A machine motor running as a man is speaking followed by rapid buzzing
- - -A vehicle accelerating then driving by as gusts of wind blow and leaves rustle in the distance
- - -Train passing followed by short honk
- - -A woman speaks, and a motor vehicle revs its engine
- - -Several puppies yapping
- - -A person gulping followed by glass tapping then liquid shaking in a container proceeded by liquid pouring before plastic thumps on paper
- - -A nearby insect buzzes with nearby vibrations
- - -A bus engine driving in the distance then nearby followed by compressed air releasing while a woman and a child talk in the distance
- - -A bus engine driving in the distance then nearby followed by compressed air releasing while a woman and a child talk in the distance
- - -High pressure liquid spraying as a radio plays in the background
- - -A loud bang followed by an engine idling loudly
- - -Mechanical rotation and then a loud click occurs
- - -A motor vehicle engine is revving
- - -A woman speaks, and a motor vehicle revs its engine
- - -An electric motor runs then a person speaks
- - -A man speaking while water runs in the background
- - -Man talking in the wind and someone yells in the background while an engine makes squealing and air puffing sounds
- - -A person gulping followed by glass tapping then liquid shaking in a container proceeded by liquid pouring before plastic thumps on paper
- - -Male speech and then scraping
- - -Mechanical rotation and then a loud click occurs
- - -Several puppies yapping
- - -Train passing followed by short honk
- - -An baby laughing
- - -Humming from a large engine
- - -An baby laughing
- - -A man speaking while water runs in the background
- - -A man talking followed by a camera muffling and footsteps shuffling then wood lightly clanking
- - -A horse gallops then trot on grass as gusts of wind blow and thunderclaps in the distance
- - -A sewing machine sews followed by a man talking
- - -An baby laughing
- - -A horse gallops then trot on grass as gusts of wind blow and thunderclaps in the distance
- - -Train passing followed by short honk
- - -A man speaking while water runs in the background
- - -Several puppies yapping
- - -Several puppies yapping
- - -A person gulping followed by glass tapping then liquid shaking in a container proceeded by liquid pouring before plastic thumps on paper
- - -A woman talks briefly as several goats bleat including one that has high pitched bleats. A crunch is followed by a man speaking
- - -Rain and thunder
- - -Humming from a large engine
- - -A car engine idling then starts to rev shortly after
- - -High pressure liquid spraying as a radio plays in the background
- - -A woman speaks, and a motor vehicle revs its engine
- - -A nearby insect buzzes with nearby vibrations
- - -Train passing followed by short honk
- - -Rain and thunder
- - -A bus engine driving in the distance then nearby followed by compressed air releasing while a woman and a child talk in the distance
- - -Male speech and then scraping
- - -An electric motor runs then a person speaks
- - -A machine motor running as a man is speaking followed by rapid buzzing
- - -A vehicle accelerating then driving by as gusts of wind blow and leaves rustle in the distance
- - -A machine motor running as a man is speaking followed by rapid buzzing
- - -A car engine idling then starts to rev shortly after
- - -A helicopter engine operating while wind blows heavily into a microphone
- - -A man talking followed by a camera muffling and footsteps shuffling then wood lightly clanking
- - -A vehicle accelerating then driving by as gusts of wind blow and leaves rustle in the distance
- - -A motor vehicle engine is revving
- - -High pressure liquid spraying as a radio plays in the background
- - -Man talking in the wind and someone yells in the background while an engine makes squealing and air puffing sounds
- - -A woman talks briefly as several goats bleat including one that has high pitched bleats. A crunch is followed by a man speaking
- - -A sewing machine sews followed by a man talking
- - -A machine motor running as a man is speaking followed by rapid buzzing
- - -A loud bang followed by an engine idling loudly
- - -Man talking in the wind and someone yells in the background while an engine makes squealing and air puffing sounds
- - -Male speech and then scraping
- - -An baby laughing
- - -A nearby insect buzzes with nearby vibrations
- - -A horse gallops then trot on grass as gusts of wind blow and thunderclaps in the distance
- - -Humming from a large engine
- - -A nearby insect buzzes with nearby vibrations
- - -A motor vehicle engine is revving
- - -A car engine idling then starts to rev shortly after
- - -A helicopter engine operating while wind blows heavily into a microphone
- - -A horse gallops then trot on grass as gusts of wind blow and thunderclaps in the distance
- - -A man talking followed by a camera muffling and footsteps shuffling then wood lightly clanking
- - -A sewing machine sews followed by a man talking
- - -A person gulping followed by glass tapping then liquid shaking in a container proceeded by liquid pouring before plastic thumps on paper
- - -Man talking in the wind and someone yells in the background while an engine makes squealing and air puffing sounds
- - -A woman talks briefly as several goats bleat including one that has high pitched bleats. A crunch is followed by a man speaking
- - -- Diffusion models power a vast majority of the text-to-audio generation methods. - Unfortunately, diffusion models suffer from a slow inference speed due to iteratively querying the - underlying denoising network, thus unsuitable for applications with time or computational constraints. - This work modifies the recently proposed "consistency distillation" framework to train text-to-audio - models that only require a single neural network query, accelerating the generation hundreds of times. -
-- By incorporating classifier-free guidance into the distillation framework, our models retain - diffusion models' impressive generation quality and diversity. Furthermore, the non-recurrent - differentiable structure resulting from the distillation allows fine-tuning with novel loss functions. - We use the CLAP loss as an example, confirming that end-to-end fine-tuning further boosts the generation quality. -
-- Our method reduce the computation of the core step of diffusion-based text-to-audio generation by - a factor of 400, while observing minimal performance degradation in terms of - Fréchet Audio Distance (FAD), Fréchet Distance (FD), KL Divergence, and CLAP Scores. -
-# queries (↓) | -CLAPT (↑) | CLAPA (↑) | -FAD (↓) | FD (↓) | KLD (↓) | -|
---|---|---|---|---|---|---|
Diffusion (Baseline) | 400 | -24.57 | 72.79 | -1.908 | 19.57 | 1.350 | -
Consistency + CLAP FT (Ours) | 1 | -24.69 | 72.54 | -2.406 | 20.97 | 1.358 | -
Consistency (Ours) | 1 | -22.50 | 72.30 | -2.575 | 22.08 | 1.354 | -
- Consistency models demonstrate non-trivial generation diversity, as do diffusion models. - In this page, we present 50 groups of generations from - four different random seeds to demonstrate this diversity, showing that our method - combines the diversity of diffusion models and the efficiency of single-step models. -
-- ConsistencyTTA's performance is verified via extensive human evaluation. - Audio clips generated from ConsistencyTTA and baseline methods are mixed and shown to the evaluators, - who are then asked to rate the audio clips based on their quality and correspondence with the textual prompt. - A sample of the evaluation form is shown on this page. -
-