Is there a way to generate a single image using multiple GPUs? #11483

suzukimain · 2025-03-18T13:43:05Z

suzukimain
Mar 18, 2025

This is related to #2977 and #3392, but I would like to know how to generate a single image using multiple GPUs. If such a method does not exist, I would also like to know if Accelerate's Memory-efficient pipeline parallelism can be applied to this.

asomoza · 2025-03-18T14:20:59Z

asomoza
Mar 18, 2025
Maintainer

Hi, there's multiple ways to interpret "generate a single image using multiple GPUs" so maybe you can be more specific about this. For example the most basic way of doing this is splitting the different steps and models into separate GPUs, so for example the text encoders and VAE on GPU 0 and the unet/transformer model in GPU 1, you can do this easily manually or you can do it with accelerate which is also covered in the docs in device placement

But I'm guessing you're referring to model sharding which you can read in the docs.

Also in the same part you can read about accelerate and parallelism.

0 replies

suzukimain · 2025-03-19T08:57:29Z

suzukimain
Mar 19, 2025
Author

Hi, there's multiple ways to interpret "generate a single image using multiple GPUs" so maybe you can be more specific about this. For example the most basic way of doing this is splitting the different steps and models into separate GPUs, so for example the text encoders and VAE on GPU 0 and the unet/transformer model in GPU 1, you can do this easily manually or you can do it with accelerate which is also covered in the docs in device placement

But I'm guessing you're referring to model sharding which you can read in the docs.

Also in the same part you can read about accelerate and parallelism.

Hello,
Thank you for your response.
Just to add for the record, I would like to be able to generate one image from one prompt faster by using several different GPUs, as in this comment.
For example, I want to generate an image faster using two different GPUs in the StableDiffusionPipeline.

0 replies

asomoza · 2025-03-19T12:19:08Z

asomoza
Mar 19, 2025
Maintainer

AFAIK you won't get faster inference times from multiple GPUs than a single one, the only reason you can get faster inference speeds with multiple GPUs would be if the model won't fit on a single one.

Do you have a reference where this is true?

0 replies

a-r-r-o-w · 2025-03-19T12:53:02Z

a-r-r-o-w
Mar 19, 2025
Maintainer

@suzukimain Pipeline parallelism will not be ideal for your use case. It is typically better for really large models when you want to generate >1 image faster than generating them individually on multiple GPUs as done in data parallelism/sharding. It's also more well suited for training rather than inference.

What you're looking for, with single image multi-GPU, is tensor and context parallelism. These methods allow you to significantly speedup generation. Two good starting points are:

We have plans for natively supporting tensor parallelism soon. It's not very hard to implement yourself though, and Pytorch's DTensor API has a small learning curve -- you can give this a look: https://pytorch.org/tutorials/intermediate/TP_tutorial.html

Context parallelism can give you the fastest way to do single image multi-GPU, but is conceptually harder to understand. The simplest variant that you could try looking at is Ring attention. It involves splitting the attention query/key/value tensors across the sequence dimension cleverly, performing partial computations on each GPU, and combining the partials to get the attention output. Here's a quick google search if you're interested: https://coconut-mode.com/posts/ring-attention/. If you'd like to just use something that works out-of-the-box without diving into the theory too much, pytorch has experimental support that you can look into.

0 replies

asomoza · 2025-03-19T14:24:55Z

asomoza
Mar 19, 2025
Maintainer

oh I must add, xDiT and ParaAttention are for the transformer models, I was fixated on StableDiffusion 1.5 and Unets in my response because of the context issues OP posted.

0 replies

a-r-r-o-w · 2025-03-19T14:34:23Z

a-r-r-o-w
Mar 19, 2025
Maintainer

Thanks for clarifying! xDiT and ParaAttention support tensor/context parallel which can be used with any model that contains feed-forward and attention layers. So SD1.5 can work with it too (might need some modifications), even if it's a unet arch.

Although, it's a very small model and the GPU communication overhead may outweigh the benefits of applying these techniques.

0 replies

asomoza · 2025-03-19T14:59:38Z

asomoza
Mar 19, 2025
Maintainer

mostly I was thinking that, hence my answer, for a small model I'm almost sure that the GPU communication will be a lot slower than just using a single GPU or in the best case scenario, probably the performance gain won't be enough to justify multiple GPUs for a single image inference.

Also for ParaAttention I was mostly using what I've read as a response from the author, and xDiT because of the name and the models it supports.

But as always, this would need testing, @suzukimain if you decide to test this, please let us know if you succeed or your experience with them.

0 replies

a-r-r-o-w · 2025-03-20T05:07:41Z

a-r-r-o-w
Mar 20, 2025
Maintainer

I just remembered another thing. Since most models use CFG to generate high quality images, you can parallelize across the batch dimension (essentially data parallelism but for single image). This can roughly speedup generation by 30-50% (if the two GPUs you're using are the same). The downside is that it requires each GPU to be able to fit the model entirely. Ofcourse, you can apply clever offloading but for SD1.5, this should work great even on low VRAM!

This can serve as an easy starting point: #10879 (but it's not going to be merged as it's an example for a blog on examples of custom hooks).

0 replies

asomoza · 2025-03-20T05:17:37Z

asomoza
Mar 20, 2025
Maintainer

Also I was thinking on consumer grade infrastructure, if you are planning a commercial solution, the new NVLink Switches that NVidia presented yesterday have more bandwidth than a 5090 (as an example), so communication lag between GPUs is not an issue if you have the money for it.

0 replies

Eamymao · 2025-04-08T03:15:10Z

Eamymao
Apr 8, 2025

If you just want to accelerate the image generation, i would like to recommend you lyraDiff, a speedup tool for diffusers.
As someone who runs SD and FLUX models daily, this useful framework to speed up my image generation is so impressive!

As described, it only cost half the time to generate a 1024*1024 image than the original diffusers. At the same time, the image quality is not lost due to acceleration. Moreover, the code is very similar to diffusers and is easy to use.

Here are the github link: https://github.com/TMElyralab/lyraDiff
With this tool, maybe only on one GPU you can also achieve the acceleration you want!

0 replies

asomoza · 2025-05-02T21:00:03Z

asomoza
May 2, 2025
Maintainer

I'm converting this to a discussion since it's not really an issue and it's and interesting topic that can be discussed further.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there a way to generate a single image using multiple GPUs? #11483

{{title}}

Replies: 11 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Is there a way to generate a single image using multiple GPUs? #11483

suzukimain Mar 18, 2025

Replies: 11 comments

asomoza Mar 18, 2025 Maintainer

suzukimain Mar 19, 2025 Author

asomoza Mar 19, 2025 Maintainer

a-r-r-o-w Mar 19, 2025 Maintainer

asomoza Mar 19, 2025 Maintainer

a-r-r-o-w Mar 19, 2025 Maintainer

asomoza Mar 19, 2025 Maintainer

a-r-r-o-w Mar 20, 2025 Maintainer

asomoza Mar 20, 2025 Maintainer

Eamymao Apr 8, 2025

asomoza May 2, 2025 Maintainer

suzukimain
Mar 18, 2025

asomoza
Mar 18, 2025
Maintainer

suzukimain
Mar 19, 2025
Author

asomoza
Mar 19, 2025
Maintainer

a-r-r-o-w
Mar 19, 2025
Maintainer

asomoza
Mar 19, 2025
Maintainer

a-r-r-o-w
Mar 19, 2025
Maintainer

asomoza
Mar 19, 2025
Maintainer

a-r-r-o-w
Mar 20, 2025
Maintainer

asomoza
Mar 20, 2025
Maintainer

Eamymao
Apr 8, 2025

asomoza
May 2, 2025
Maintainer