You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Please check whether this paper is about 'Voice Conversion' or not.
article info.
title: ZSVC: Zero-shot Style Voice Conversion with Disentangled Latent Diffusion Models and Adversarial Training
summary: Style voice conversion aims to transform the speaking style of source speech
into a desired style while keeping the original speaker's identity. However,
previous style voice conversion approaches primarily focus on well-defined
domains such as emotional aspects, limiting their practical applications. In
this study, we present ZSVC, a novel Zero-shot Style Voice Conversion approach
that utilizes a speech codec and a latent diffusion model with speech prompting
mechanism to facilitate in-context learning for speaking style conversion. To
disentangle speaking style and speaker timbre, we introduce information
bottleneck to filter speaking style in the source speech and employ Uncertainty
Modeling Adaptive Instance Normalization (UMAdaIN) to perturb the speaker
timbre in the style prompt. Moreover, we propose a novel adversarial training
strategy to enhance in-context learning and improve style similarity.
Experiments conducted on 44,000 hours of speech data demonstrate the superior
performance of ZSVC in generating speech with diverse speaking styles in
zero-shot scenarios.
Please check whether this paper is about 'Voice Conversion' or not.
article info.
title: ZSVC: Zero-shot Style Voice Conversion with Disentangled Latent Diffusion Models and Adversarial Training
summary: Style voice conversion aims to transform the speaking style of source speech
into a desired style while keeping the original speaker's identity. However,
previous style voice conversion approaches primarily focus on well-defined
domains such as emotional aspects, limiting their practical applications. In
this study, we present ZSVC, a novel Zero-shot Style Voice Conversion approach
that utilizes a speech codec and a latent diffusion model with speech prompting
mechanism to facilitate in-context learning for speaking style conversion. To
disentangle speaking style and speaker timbre, we introduce information
bottleneck to filter speaking style in the source speech and employ Uncertainty
Modeling Adaptive Instance Normalization (UMAdaIN) to perturb the speaker
timbre in the style prompt. Moreover, we propose a novel adversarial training
strategy to enhance in-context learning and improve style similarity.
Experiments conducted on 44,000 hours of speech data demonstrate the superior
performance of ZSVC in generating speech with diverse speaking styles in
zero-shot scenarios.
id: http://arxiv.org/abs/2501.04416v1
judge
Write [vclab::confirmed] or [vclab::excluded] in comment.
The text was updated successfully, but these errors were encountered: