Is deepspeed zero stage 3 MP or PP. Same as FSDP? #3476

andrasiani · 2023-05-06T23:04:15Z

andrasiani
May 6, 2023

Hi, I`ve been studying Fsdp and Deepspeed.
From Pytorch FSDP docs I understand the model layers are split vertically, which is MP. If layers L1 has a0 a1 weights, a0 goes to gpu0 a1 to gpu1.

Some sources claim Deepspeed zero is the same, but when I look at the visualization video in your documentation, it looks like splitting the network horizontally, first N layerson gpu0, second N layers on gpu1 etc.. Is this pipeline parallelism PP?

I am confused.
When using zero3 what am I actually doing, MP or PP, is it different from FSDP?
Thanks!

renke999 · 2023-05-08T09:40:27Z

renke999
May 8, 2023

I have this question too and I find this page might help

1 reply

andrasiani May 9, 2023
Author

Thanks!

andrasiani · 2023-05-09T15:19:08Z

andrasiani
May 9, 2023
Author

This deepspeed visualization splits the model horizontally, first n layers on gpu0, next n to gpu1 etc: https://www.microsoft.com/en-us/research/blog/zero-deepspeed-new-system-optimizations-enable-training-models-with-over-100-billion-parameters/

Other sources and fsdp indicate that the model is split vertically, layer L w0 w1 w2.. w0 on gpu0 w1 on gpu2 etc. This is FSDP.

Is the deepspeed visualization accurate?

3 replies

wenchenvincent Jun 7, 2023

I haven't dived into the source code of either DeepSpeed or Pytorch FSDP. But from various sources, it seemed that they are very similar. This blog from HuggingFace (https://huggingface.co/blog/accelerate-deepspeed) said that "DeepSpeed, FairScale and PyTorch FullyShardedDataParallel (FSDP) have implemented the core ideas of the ZERO paper."

It seems that there was a difference regarding how they partition the parameters. As you said, the deepspeed visualiztion splits the model horizontally. The FSDP seems to splits the model vertically (as the visualization shows that for each layer, the weights were retrieved from other gpus using all-gather instead of the broadcast used in ZERO). And it was said "Compared with optimizer state+gradient sharding data parallel methods, FSDP shards parameters more uniformly and is capable of better performance via communication and computation overlapping during training." (https://engineering.fb.com/2021/07/15/open-source/fsdp/), The vertical partition would be more uniform than the horizontal partition.

b02202050 Jul 17, 2024

I noticed that in this page: https://huggingface.co/docs/accelerate/concept_guides/fsdp_and_deepspeed#on-differences-in-data-precision-handling. The article says that both FSDP and Deepspeed split a tensor into multiple GPUs. So it might not the case that the method they split the parameters is the main differences between them.

SeanSong-amd Aug 6, 2024

This is somewhat misleading. The different colors represent optimizer states, gradients, and parameters. In DeepSpeed, these components are also split vertically, similar to FSDP. https://huggingface.co/docs/accelerate/v0.11.0/en/deepspeed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is deepspeed zero stage 3 MP or PP. Same as FSDP? #3476

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Is deepspeed zero stage 3 MP or PP. Same as FSDP? #3476

andrasiani May 6, 2023

Replies: 2 comments · 4 replies

renke999 May 8, 2023

andrasiani May 9, 2023 Author

andrasiani May 9, 2023 Author

wenchenvincent Jun 7, 2023

b02202050 Jul 17, 2024

SeanSong-amd Aug 6, 2024

andrasiani
May 6, 2023

Replies: 2 comments 4 replies

renke999
May 8, 2023

andrasiani May 9, 2023
Author

andrasiani
May 9, 2023
Author