Is deepspeed zero stage 3 MP or PP. Same as FSDP? #3476
andrasiani
started this conversation in
General
Replies: 2 comments 4 replies
-
I have this question too and I find this page might help |
Beta Was this translation helpful? Give feedback.
1 reply
-
This deepspeed visualization splits the model horizontally, first n layers on gpu0, next n to gpu1 etc: https://www.microsoft.com/en-us/research/blog/zero-deepspeed-new-system-optimizations-enable-training-models-with-over-100-billion-parameters/ Other sources and fsdp indicate that the model is split vertically, layer L w0 w1 w2.. w0 on gpu0 w1 on gpu2 etc. This is FSDP. Is the deepspeed visualization accurate? |
Beta Was this translation helpful? Give feedback.
3 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi, I`ve been studying Fsdp and Deepspeed.
From Pytorch FSDP docs I understand the model layers are split vertically, which is MP. If layers L1 has a0 a1 weights, a0 goes to gpu0 a1 to gpu1.
Some sources claim Deepspeed zero is the same, but when I look at the visualization video in your documentation, it looks like splitting the network horizontally, first N layerson gpu0, second N layers on gpu1 etc.. Is this pipeline parallelism PP?
I am confused.
When using zero3 what am I actually doing, MP or PP, is it different from FSDP?
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions