-
-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to improve the reconstruction of high-frequency details in the VQVAE training? #25
Comments
After 60 epochs, the results have significantly improved (see below). However, I still notice that high-frequency details are missing from the reconstructions, and almost every image contains green spots. Is there a way to enhance the reconstruction of high-frequency components and eliminate the green spots? Thank you! |
Hello @vadori , Can you share your config file for vqvae training once ? |
Thanks for replying!! Let me share the config file.
In regards to the inference, the snapshots I posted are from the visualizations during training, where values are actually clipped to 255 - so the line of code in the inference might solve the problem. Thank you! Do you also have any suggestions in regards to the parameters to tweak to ensure high frequency components are preserved? |
I see that the disc_start parameter has been set to 10 iterations. Have you tried training vqvae with a higher value of disc start. Like maybe 5000? If not, then can you once try that. (Assuming you have 1K images atleast, with batch size of 4 thats about 20 epochs). |
Many thanks for your kind reply @explainingai-code! I am testing with more vectors in the codebook - do you think it can help based on your experience? Additionally, what parameter controls the size of the autoencoder bottleneck? z_channels? I want to test with a different configuration and see if it helps. I am also experimenting with additional losses to see if I can enforce the reconstruction of high frequencies. Again, thank you. |
@vadori
This autoencoder config will have a downscale factor o f 8(256x256 reduced to 32x32, as down_sample is done three times and there are three downblocks( from len(down_channels)-1). Let me know if this works for you or if you run into any issues with these changes. |
Hi again @explainingai-code, thanks for your reply!!
I also added 2 additional loss functions, and it looks like the performance has improved. However, in order to understand what could be a better solution (such as the one you suggested) I would need to be sure I understand the parameters. Maybe you can help me here? Or is there any documentation I may look at to get this info? codebook_size is quite self-explanatory, but what about the following variables? Could you please confirm/correct? It would be immensely helpful!
Also, for example, with the following parameters, are the sizes of the outputs below correct?
output of the first down layer: 128x128x384 Then after the middle layers, a Conv2d module is applied to get a 32x32x4 output, which is the size of the bottleneck, and each vector with dimension 4 is mapped onto a codeword (quantized). If I set z_channels:32, I would get a 32x32x32 output, which is the size of the bottleneck, and each vector with dimension 32 would be mapped onto a codeword (quantized). Thank you!! |
@vadori Went through everything and your understanding for all of them is correct. Just one thing, I dont have any |
Hi again @explainingai-code, Edit: In the paper of LDMs https://arxiv.org/pdf/2112.10752, it looks like for image-to-image translation tasks and specifically for semantic image synthesis, they actually concatenated downsampled versions of the semantic maps to the latent image representation - which is somewhat surprising to me as I was expecting the conditioning to be required on more network points. It would be great if you could let me know if you had any experience with the two distinct settings. (with and without cross attention at intermediate layers for semantic image synthesis) |
Hello @vadori , However for use cases like say generating variations of image, where the goal is to retain the semantic aspects of image but not necessarily the spatial layout, cross attention would be appropriate choice. |
Thank you! Why would you say that avoiding cross-attention would work better? I am interested in your intuition, even though you did not experiment with it. I am modifying the code accordingly using a custom encoder to generate mask encodings. I am planning to try both solutions (with cross and without). |
By better, I am mainly referring to how easy it is for the model to learn 'how to use spatial conditioning' . In concatenation, because of the convolution layer, each noisy pixel at x_ij is only impacted by the corresponding spatial pixel cond_ij, which is exactly what we need, since we want the denoising process to generate an image which has the exact same layout as conditioning image. |
Thank you very much for you response. You may be right :) I am curious to see what works. What I like about cross attention is that cross attention is repeated throughout the network multiple times, offering repeated guidance, while the conditioning via concatenation is performed once, and the information is potentially diluted (aka, lost) while going from input to output - but this is just how I imagine it. Maybe concatenation could be performed at multiple layers. Experiments should answer - maybe, in my case, none works because I am working in the latent space with semantic mask encodings, not semantic mask downsampled versions. This is because, with simple downsampling, I lose too many details (the components in the mask are tiny, and some disappear, and I must ensure that all of them persist). The conditioning input (encoded mask) encodes a spatial layout, but the model must learn this to use its information to actually apply spatial conditioning when generating the images. When downsampling the conditioning semantic mask rather than encoding it, the fact that the mask brings with it a spatial layout is explicit, and the model does need to learn this, only to constrain the generative model accordingly. |
Got it. Regarding simple downscaling leading to loss of details, another thing you could try is instead of passing a downsampled version, pass normal (same size as original image) mask and add additional conv2d layers that take the mask from say 256x256(original image size) to 32x32(latent size). And then concat this to the noisy input, allowing model to learn features which provide the low level details that simple downscaled mask loses out on. |
Great suggestion, thank you! I'll update you as soon as I have some results. Hopefully, it won’t be too long! 😄 |
Thank you so much for sharing!
Could you provide insights into the number of epochs required to achieve high-resolution, fine details during VQVAE training for 256x256 RGB images?
Additionally, has anyone compared VQVAE to VAE? Is VQVAE performing better? What key parameters can be adjusted in VQVAE to improve performance if the results aren't satisfactory?
I am trying to use the VQVAE to encode histological images. After 6 epochs with a batch size of 4, I get this a result on a sample image:
I'm wondering if the uniform color in the foreground, the absence of high-resolution details, and the presence of a seemingly repeated textural pattern across the image are simply the result of training for too few epochs and the model needing more time. Or could this indicate a more fundamental issue with my experimental setup? What are your thoughts?
Of course, it would be great if anyone could share their opinion on this!
Thanks again!
The text was updated successfully, but these errors were encountered: