Tensor / Model parallelism with deepspeed #6591
Replies: 1 comment
-
Thanks for raising this question. all ZeRO stages are Data Parallelism. At begining, a model is almost evenly sharded on all GPUs in use (say two GPUs, GPU0 has first half weights of every layer and GPU1 has second half weights of every layer). If using zero correctly, the duplication only happens when a layer/a group of layers need to be computed (either forward or backward propagation) on GPU side. Once computation done, the weights not belongs to a particular GPU are removed (e.g. GPU0 will remove 2nd half of a particular layer/layer-group weights). A demo illustration can be found in this blog figure 2. Also curious how did identify model are duplicated not divided on GPU? what is the metric you are monitoring. |
Beta Was this translation helpful? Give feedback.
-
Hi I am trying run experiments on phi3 with deepspeed.
So i have tested all 3 ZeRO stages. One thing I have noticed in none of the stages the model gets divided into multiple gpus on the other hand it seems like the model gets duplicated on multiple GPUs. Can anybody tell me what the issue is?
Deepspeed Config:
{
"train_micro_batch_size_per_gpu": batch_size,
"bf16": {
"enabled": True,
},
"optimizer": {
"type": "Adam",
"params": {"lr": learning_rate},
},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {"device": "cpu", "pin_memory": True},
"allgather_partitions": True,
"reduce_scatter": True,
"reduce_bucket_size": 5e8,
"overlap_comm": True,
"contiguous_gradients": True,
"round_robin_gradients": True,
},
"comms_logger": {
"enabled": "false",
},
"steps_per_print": 1e10,
"logging_level": "WARNING",
}
Thanks for the help in advance. 😄
Beta Was this translation helpful? Give feedback.
All reactions