-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some doubt about any to any voice conversion #6
Comments
In addition, the mapping network training is even worse |
Whether to remove the mapping network, if I only use it with reference audio. |
What you are asking is an open research question that nobody has an answer to at this point, but I will give you my two cents on this issue. It is only a discussion, not meant to provide any viable solutions. In general, there are mainly two ways to do voice conversion:
The first method usually suffers from poor sound quality because it is difficult to completely disentangle speakers from speech while keeping enough information to reconstruct the speech with high quality (unless you use text labels which make it a TTS system and thus impossible to work in real-time), while the latter suffers from dissimilarity as the input speaker information is often leaked into the decoder. This paper introduces adversarial classifier loss to mitigate the second problem, so we can guarantee the converted results sound similar to the target speaker for seen input speakers and sometimes for unseen input speakers while maintaining a reasonable degree of naturalness in synthesized speech. However, when it comes to zero-shot conversion, the trick of adversarial classifier loss is no longer applicable, because such a classifier is even not able to find patterns for only less than a hundred speakers, let alone thousands of speakers that are usually required to train zero-shot conversion models. In addition, if you read the original StarGAN v2 paper, you will see that the style encoder is trained to only reconstruct the image, and hence it works well for reconstruction but works poorly for conversion when the disentanglement in the encoder is not sufficient and when there're so many speakers that the style space becomes extremely complicated and the discriminator loses track of bad samples from the generator. That is to say, if you want to do the zero-shot conversion, you will need to work heavily on improving the current discriminator settings. For example, build a set of discriminators each of which only works on a subset of speakers, or use speaker embeddings to help the model set the right goals for discriminations. You can also instead disentangle the input speakers as much as possible and try to reconstruct the speech with the given style. There are several ways of disentangling the input speaker information, for example, Huang et. al. 2020. Another way is to use speaker agnostic features such that PPG and F0 to reconstruct the speech, but spoiler alert these features are usually not good enough to synthesize natural-sounding speech. Of course, if you can find a way to make the adversarial classifier work in the zero-shot setting while keeping the same sound quality, I believe it will deserve a machine learning top conference publication such as in NIPS or ICML. |
Thanks! I will try to add multi band loss like hifigan, and "SEQUENCE-TO-SEQUENCE SINGING VOICE SYNTHESIS |
Hello, I recently tried some solutions to achieve any to any oice conversion. Simply increasing the number of speakers is the best result so far. I am trying to use x-vector as a style encoder recently. Is there anything I need to pay attention to? |
In addition, you mentioned that when there are too many speakers, the speaker discriminator will be difficult to converge. Can you change its loss to other loss? |
Sorry for the late reply. I hope you've got some good results using x-vector, though I believe it would not work better than style encoder alone because x-vector has much less information about the target speaker than the trained style encoder does. The jittering F0 is probably caused by how the F0 features are processed by the encoder. It is only processed by a single ResBlock, which is unlikely to remove all the input F0 information. The subsequent AdaIN blocks have to transform these low-pitch features to high-pitch features, making it difficult and inevitably lose detailed information and thus jitter. My suggestion is you add a few more instance normalization layers to process the F0 feature and hopefully the features fed into the decoder only contain the pitch curves instead of the exact F0 value in Hz, which is what the model was trained for. The problems of low similarity with a large number of speakers are probably caused by the limited capacity of the discriminators. I do not have any good suggestions for you, but you may try something like large hypernetworks that generate weights of discriminators for each individual speaker after some shared layers to further process speaker-specific characteristics. This can also be applied to the mapping network. The basic idea is to make the discriminators powerful enough to memorize the characteristics of each speaker. Another very simple way is to have multiple discriminators, each of which only acts on a specific set of speakers. For example, discriminator 1 is trained on speakers 1 to 10, 2 is trained on 11 to 20, and so on. |
Thank You!I will try it.If there is any progress, I will share with you as soon as possible. |
Using multiple discriminators is effective, and when the model converges, the sound quality on the unseen speaker is better, and the similarity to the target speaker is better than the original one. |
I think it depends on the number of speakers you have in the training set and what your latent space of the speaker embedding looks like. Usually, a multivariable Gaussian assumption is what people would use, so you may want to add an additional loss term to the latent variables from the style encoder or x-vector to enforce the underlying Gaussian distribution (an L2 norm would do the job). When you say many unseen sound characteristics are lost, what do you mean exactly by "sound characteristics"? Can you give some examples of the "lost characteristics" versus what the "characteristics" should actually be like? Another way to test if the latent space actually encodes unseen speakers that are readily available to use by the generator is to use gradient descent to find the style that reconstructs the unseen speaker's speech. That is, after training your model, you simply fix everything and make the style vector a trainable parameter, and use the gradient descent to minimize the reconstruction loss between the input mel and output mel of unseen speakers. If the loss does not converge to a reasonable value, it means there's no style in the space the generator has learned to faithfully reconstruct unseen speakers' speech. One easy way to finetune for unseen speakers is to simply remove the lost projection layer that converts the 512 channels to number of speakers. Another more complicated way is to use a hyperntework or weight AdaIN (see Chen et. al. |
https://drive.google.com/drive/folders/1lQO7ZtWN6MvyZeMFwoB2L0AjDPL_9V1p?usp=sharing |
I think 1300y_out is very similar to Ref_wav, so the good news is that the generator is capable of reconsrtucting unseen speakers without any further training. Have you tried to use the style obtained with gradient descent to convert other input audio? Does it work? If so, at least the model can do one-shot learning with a few iterations of gradient descent. You're right that Y_out does not sound very similar to Ref_wav though, is this the result from X-vectors or style encoders without specific speakers? If the style obtained from gradient descent works with other input, it means the problem is not in the generator or the discriminator, but the style encoder that is unable to find a style embedding space with unseen speakers. If the style does not work with other input, it means the encoder of the generator may have been overfitted to reconstruct the input, so disentangling the input speaker information may be necessary. |
This looks promising, so the problem probably is in the style encoder then. Can I know how many speakers you used to train the style encoder and how many discriminators were there and how you assigned these discriminators to those speakers? By the way, I didn't see "0y_out_huangmeixi_error_f0", maybe you didn't upload it there, so I'm not sure what you meant by "In other words, the encoder will also encode f0." It is expected that the style encoder encodes the background noise, and it is actually the most obvious thing it will encode given how the loss is set up. However, if you don't want it to encode the recording environment, you can use the contrastive loss to make it noise-robust. That is, generate a noise degenerated copy of your audio and make the style encoder encode both of them into the same style vector. This is also usually how speaker embeddings like x-vector are trained. |
Sorry for the late reply. A total of 117 speakers are used as the data set. There may be some noise in these data, including the sound of mouse clicks, pink noise, etc., but the sound is not loud. Twenty are English speech, and the rest are singings. |
One discriminator for every 10 speakers. So here are 12 discriminators. I haven't had time to try other speaker and discriminator correspondences.I also did not try to share parameters between the discriminators. |
I have listened to "0y_out_huangmeixi_error_f0" you uploaded and if I understand correctly, you probably think the style is somehow "overfitted" in the sense that it also encodes the F0 of the reconstruction target? I think this is not true, because a vector of size 64 can't encode a whole F0 curve, but one training objective is the average pitch of the reference is the same as the average pitch of the converted output, so it definitely learns the average F0. It also encodes how the pitch would deviate from the input F0 because the style diversification loss also tries to maximize the F0 between two different styles. Hence, the style also encodes some information about the speaking/singing style of the target, which is desirable in our case. The discriminator settings seem fair, but how did you train the style encoder? Are you still using the unshared linear projection or the style encoder is now independent of the input speakers? What about the mapping network? Did you remove the mapping network in its entirety? |
Sorry for the late reply. I remove the mapping network. I use the origin network,unshared linear projection. Have you tried the improvements of stylegan2? According to my observation, if the sample input is fixed and optimized continuously with sgd, the gradient is mainly concentrated in instance normal.In addition, can bCR-GAN loss be replaced by StyleGAN2 with adaptive discriminator augmentation (ADA)? |
There is a problem with breathing sound modeling, is there a way to deal with it? |
I don't think StyleGAN2 is relevant to StarGANv2, because the main difference in StyleGAN2 is they changed the instance normalization without the affine component (i.e., only normalize and learn the standard deviation, not the mean). The same setting hurts the performance in StarGANv2 as our model decodes from a latent space encoded by the encoder instead of noise, so it's not really that relevant. I believe StyleGAN3 is more relevant if you are willing to try to implement an aliasing-free generator instead. As for ADA, I was not able to find a set of augmentation and probability such that no leaks occur, which is the main reason I was using bCR-GAN. The augmentation didn't matter that much if you have enough data, so it doesn't really help for the VCTK-20 dataset. I put it there only for cases where some speakers have much less than data others (like only 5 mins instead of 30 mins as in VCTK). It helps with emotional conversion and noisy datasets though. I didn't encounter any problems with the breath sound. You can listen to the demo here and the breath can be heard clearly. I guess it's probably your dataset is noisy so the breath sound was filtered as noise by the encoder. In that case, you may want to intentionally corrupt your input by audio augmentation. |
Back to the style encoder problem, how do you encode unseen speakers if you have unshared components? |
Sorry, there is a misunderstanding in the description here. ` class StyleEncoder(nn.Module):
` |
Is it possible to add wavelet transform to the model, such as referring to the design of swagan's generator |
@980202006 It's definitely possible to add wavelet transform to the model and it could theoretically make a big difference because the high-frequency content is what makes speech clear even though the mel-spectrogram looks visually the same. However, I can't say exactly how much high-frequency content is there in mel-spectrogram because the resolution of mel specs is usually very low and what vocoders do is exactly uncover the lost high-frequency information. I think fine-tuning with hifi-gan probably would do the same thing, but you can definitely try and see if it helps. |
Back to the style encoder problem, so I think you removed the shared linear layers (N of them where N is the number of speakers) and replaced it with a single linear projection for every speaker. I have tried this approach too, but it seems like the style encoder has a hard time encoding the speaker characteristics and usually returns a style vector that sounds like a combination of seen speakers during training instead. However, if you use simple gradient descent to find the style that can reconstruct unseen speakers, it is usually possible to find such a style and it preserves most of the characteristics during reconstruction, exactly like what you have presented here. In fact, the style encoder sometimes even fails to find a style that reconstructs the seen speakers in my case. My hypothesis is that the shared projections lack the power to separate different speakers while unshared projections force the models to learn more about the speaker characteristics. One way to verify this is to train a linear projection for each speaker that reconstructs the given input by fixing both the |
In my model, I regard the style encoder as a speaker information extraction model, that is, it extracts the high-dimensional representation of the speaker from the mel instead of fitting a specific speaker vector space. I prefer to use points instead. Non-spatial to represent a speaker, which may result in the loss of some information. Because, I found that the original style encoder has an average pooling operation, which is very similar to x-vector or d-vector. |
Thank you! |
@yl4579 My own F0 model seems ok, like this: but I didn't add noise for augmentation when training ASR and F0 model, is data augmentation necessary? One more question, I want to train an any-to-one VC model, do I need to use Auto-Encoder instead of StarGAN? |
@yl4579 thanks. I also found that the audio recorded from the mobile phone h5 has poor sound-changing effect, similar to this example; on the contrary, the sound-changing effect of dry sound is ok. Is there any solution for mobile phone channel compensation or data enhancement? |
This is more likely to be a problem with your data or model, or a back-propagation problem caused by the torch statement. Since it cannot fit the data well, the model is constantly trying to increase or decrease the scale of the data. |
I am still trying to sort out the ideas here. The basic idea is to use multiple discriminators, each of which only discriminates a part of the speakers (random selection). |
@980202006 The clean voice sounds very good, though fine-tuning the vocoder would improve the sound quality. You may want to use vocoders specifically designed for singing synthesis. However, I cannot listen to the mobile phone recorded results. I don't have permission for that, can you share the folder please? Although I can't listen to the samples, my guess is that voices recorded with mobile phones are worse in sound quality so the speakers' characteristics cannot be well captured by the model. You can either use data augmentation to corrupt the input to the style encoder for a more robust style representation or you can just do speech enhancement to make the sound quality better. This for example sounds exceptionally good: https://daps.cs.princeton.edu/projects/Su2021HiFi2/index.php |
@yl4579 |
@yl4579 I'm missing something about the problems deep learning might have. Are there reviews that cover various issues, such as covariate shift? |
@980202006 Did you add reverb to the input for the style encoder? How did you the data augmentation? |
@yl4579 Yes, I added reverb to the input data of the style encoder. Thank you. |
@yl4579 @980202006 I found the speech intelligibility gets worse compared to the sources, especially when I test Chinese in a model trained by English. How to relieve this phenomenon? And @980202006 are you using a multi-language ASR for the multi-speaker training, as your datasets include Chinese, English, and singing? |
@980202006 I listened to your demo, and I think they are pretty good, especially the speech intelligibility is very good. How do you manage that? Could you leave an e-mail for more discussions for details in Chinese demos? |
@980202006 How does the result differ when your input to the style encoder is reverberated and not reverberated? Do they sound similar or quite different? |
@Kristopher-Chen the original model was not proposed to tackle cross-lingual voice conversion, so you may need to train an ASR model that works for both English and Chinese (e.g., using IPAs) and train a model with both English and Chinese data. The ASR training code will be made available soon, at the latest in late May. |
@yl4579 Recently, I trained a model with 100 speakers from VCTK. When evaluating, I met with some problems. https://drive.google.com/drive/folders/1lraGNF3tGzExGnmhvXo3QDrc72uE23zg?usp=sharing
|
If the reverberation data (style encoder) is added during training, it will alleviate the problem during unseen inference, but it will not completely solve the problem. |
@Kristopher-Chen Maybe you want to see if your speaker classification discriminator has collapsed. Training this discriminator must be careful. I guess it is better to not let its loss drop towards 0, but to give a balanced value. |
How is the asr training code progressing now?I am very looking forward to it。 |
@MMMMichaelzhang It is available here: https://github.com/yl4579/AuxiliaryASR |
Hi. How is the JDC code progressing? Thank you very much~ |
@Charlottecuc I'm still working on it, I'll create another repo probably by this week. |
@Charlottecuc The training code for F0 model is available now: https://github.com/yl4579/PitchExtractor |
@CrackerHax Your loss becomes nan, so the model is broken. This is likely caused by bad normalization because some value exceeds 65535 (float 16 maximum number). See #6 (comment) |
I trained again with fp16=false and still got nan (at the same epoch as fp16=true). The only change I made in config file was it's a single voice (num_domains: 1) |
I did some transfer learning with 20 voices on the default model and it worked fine. |
@980202006 @yl4579 |
I set num_domain=1 and I meet the same problem,have you sovled it?@CrackerHax |
Hi,thanks for this project. I have tried to remove the domain information of the style encoder, which does have a certain effect and can generate natural sound, but there are the following problems:
The reconstruction effect is better by inputting the original audio to the style encoder.
Data used:
Batch size: 32 (8 per GPU)
Can you provide some suggestions, whether data or model?
The text was updated successfully, but these errors were encountered: