Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

关于 some Natural Speech Features Of Microsoft代码 #92

Open
chusevip8 opened this issue Jun 17, 2023 · 2 comments
Open

关于 some Natural Speech Features Of Microsoft代码 #92

chusevip8 opened this issue Jun 17, 2023 · 2 comments

Comments

@chusevip8
Copy link

作者好,关于 some Natural Speech Features Of Microsoft
这部分的优化代码是哪一部分呢,没有找到,请指示一下。

@MaxMax2016
Copy link
Collaborator

227124187-5861f698-b0f5-4661-b7b9-d04a6ffda25e

z_p = self.flow(z, y_mask, g=g)
z_r = m_p + torch.randn_like(m_p) * torch.exp(logs_p)
z_r = self.flow(z_r, y_mask, g=g, reverse=True)
return o, l_length, attn, ids_slice, x_mask, y_mask, (z, z_p, z_r, m_p, logs_p, m_q, logs_q)

            loss_kl = kl_loss(z_p, logs_q, m_p, logs_p, z_mask) * hps.train.c_kl
            if z_r == None:
                loss_kl_r = 0
            else:
                loss_kl_r = kl_loss(z_r, logs_p, m_q, logs_q, z_mask) * hps.train.c_kl
            loss_fm = feature_loss(fmap_r, fmap_g)
            loss_gen, losses_gen = generator_loss(y_d_hat_g)
            loss_gen_all = loss_gen + loss_fm + loss_mel + loss_dur + loss_kl + loss_kl_r

@nshmyrev
Copy link

nshmyrev commented Jan 2, 2024

Hey @MaxMax2016 thanks for the code. I've tried to play with the current implementation a bit and honestly it doesn't really work as intended. Here are the reasons:

  1. It needs weights in loss (usually has to be much smaller than 1.0) and also Gaussian weight similar to inference (noise_scale)
z_r = m_p + torch.randn_like(m_p) * torch.exp(logs_p) * noise_scale
  1. Because sound is time-variable it needs SoftDTW for KL loss otherwise it pushes to make speech very uniform. Paper mentions that.

Without SoftDTW after loss is applied automated evaluation CER goes down, Mel loss goes significantly up and Frechet score also goes significantly up. This is because speech is not following target audio anymore.

More advanced implementation of backward loss is here: heatz123/naturalspeech#12 but also not straigth to make it work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants