Is there a way to generate a single image using multiple GPUs? #11483
Replies: 11 comments
-
Hi, there's multiple ways to interpret "generate a single image using multiple GPUs" so maybe you can be more specific about this. For example the most basic way of doing this is splitting the different steps and models into separate GPUs, so for example the text encoders and VAE on GPU 0 and the unet/transformer model in GPU 1, you can do this easily manually or you can do it with accelerate which is also covered in the docs in device placement But I'm guessing you're referring to model sharding which you can read in the docs. Also in the same part you can read about accelerate and parallelism. |
Beta Was this translation helpful? Give feedback.
-
Hello, |
Beta Was this translation helpful? Give feedback.
-
AFAIK you won't get faster inference times from multiple GPUs than a single one, the only reason you can get faster inference speeds with multiple GPUs would be if the model won't fit on a single one. Do you have a reference where this is true? |
Beta Was this translation helpful? Give feedback.
-
@suzukimain Pipeline parallelism will not be ideal for your use case. It is typically better for really large models when you want to generate >1 image faster than generating them individually on multiple GPUs as done in data parallelism/sharding. It's also more well suited for training rather than inference. What you're looking for, with single image multi-GPU, is tensor and context parallelism. These methods allow you to significantly speedup generation. Two good starting points are: We have plans for natively supporting tensor parallelism soon. It's not very hard to implement yourself though, and Pytorch's DTensor API has a small learning curve -- you can give this a look: https://pytorch.org/tutorials/intermediate/TP_tutorial.html Context parallelism can give you the fastest way to do single image multi-GPU, but is conceptually harder to understand. The simplest variant that you could try looking at is Ring attention. It involves splitting the attention query/key/value tensors across the sequence dimension cleverly, performing partial computations on each GPU, and combining the partials to get the attention output. Here's a quick google search if you're interested: https://coconut-mode.com/posts/ring-attention/. If you'd like to just use something that works out-of-the-box without diving into the theory too much, pytorch has experimental support that you can look into. |
Beta Was this translation helpful? Give feedback.
-
oh I must add, xDiT and ParaAttention are for the transformer models, I was fixated on StableDiffusion 1.5 and Unets in my response because of the context issues OP posted. |
Beta Was this translation helpful? Give feedback.
-
Thanks for clarifying! xDiT and ParaAttention support tensor/context parallel which can be used with any model that contains feed-forward and attention layers. So SD1.5 can work with it too (might need some modifications), even if it's a unet arch. Although, it's a very small model and the GPU communication overhead may outweigh the benefits of applying these techniques. |
Beta Was this translation helpful? Give feedback.
-
mostly I was thinking that, hence my answer, for a small model I'm almost sure that the GPU communication will be a lot slower than just using a single GPU or in the best case scenario, probably the performance gain won't be enough to justify multiple GPUs for a single image inference. Also for ParaAttention I was mostly using what I've read as a response from the author, and xDiT because of the name and the models it supports. But as always, this would need testing, @suzukimain if you decide to test this, please let us know if you succeed or your experience with them. |
Beta Was this translation helpful? Give feedback.
-
I just remembered another thing. Since most models use CFG to generate high quality images, you can parallelize across the batch dimension (essentially data parallelism but for single image). This can roughly speedup generation by 30-50% (if the two GPUs you're using are the same). The downside is that it requires each GPU to be able to fit the model entirely. Ofcourse, you can apply clever offloading but for SD1.5, this should work great even on low VRAM! This can serve as an easy starting point: #10879 (but it's not going to be merged as it's an example for a blog on examples of custom hooks). |
Beta Was this translation helpful? Give feedback.
-
Also I was thinking on consumer grade infrastructure, if you are planning a commercial solution, the new NVLink Switches that NVidia presented yesterday have more bandwidth than a 5090 (as an example), so communication lag between GPUs is not an issue if you have the money for it. |
Beta Was this translation helpful? Give feedback.
-
If you just want to accelerate the image generation, i would like to recommend you lyraDiff, a speedup tool for diffusers. As described, it only cost half the time to generate a 1024*1024 image than the original diffusers. At the same time, the image quality is not lost due to acceleration. Moreover, the code is very similar to diffusers and is easy to use. Here are the github link: https://github.com/TMElyralab/lyraDiff |
Beta Was this translation helpful? Give feedback.
-
I'm converting this to a discussion since it's not really an issue and it's and interesting topic that can be discussed further. |
Beta Was this translation helpful? Give feedback.
-
This is related to #2977 and #3392, but I would like to know how to generate a single image using multiple GPUs. If such a method does not exist, I would also like to know if Accelerate's Memory-efficient pipeline parallelism can be applied to this.
Beta Was this translation helpful? Give feedback.
All reactions