Tensor / Model parallelism with deepspeed #6591

ansar-ayan · 2024-09-30T09:46:33Z

ansar-ayan
Sep 30, 2024

Hi I am trying run experiments on phi3 with deepspeed.
So i have tested all 3 ZeRO stages. One thing I have noticed in none of the stages the model gets divided into multiple gpus on the other hand it seems like the model gets duplicated on multiple GPUs. Can anybody tell me what the issue is?

Deepspeed Config:
{
"train_micro_batch_size_per_gpu": batch_size,
"bf16": {
"enabled": True,
},
"optimizer": {
"type": "Adam",
"params": {"lr": learning_rate},
},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {"device": "cpu", "pin_memory": True},
"allgather_partitions": True,
"reduce_scatter": True,
"reduce_bucket_size": 5e8,
"overlap_comm": True,
"contiguous_gradients": True,
"round_robin_gradients": True,
},
"comms_logger": {
"enabled": "false",
},
"steps_per_print": 1e10,
"logging_level": "WARNING",
}

Thanks for the help in advance. 😄

GuanhuaWang · 2024-09-30T16:59:49Z

GuanhuaWang
Sep 30, 2024
Collaborator

Thanks for raising this question. all ZeRO stages are Data Parallelism. At begining, a model is almost evenly sharded on all GPUs in use (say two GPUs, GPU0 has first half weights of every layer and GPU1 has second half weights of every layer). If using zero correctly, the duplication only happens when a layer/a group of layers need to be computed (either forward or backward propagation) on GPU side. Once computation done, the weights not belongs to a particular GPU are removed (e.g. GPU0 will remove 2nd half of a particular layer/layer-group weights). A demo illustration can be found in this blog figure 2.

Also curious how did identify model are duplicated not divided on GPU? what is the metric you are monitoring.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tensor / Model parallelism with deepspeed #6591

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Tensor / Model parallelism with deepspeed #6591

ansar-ayan Sep 30, 2024

Replies: 1 comment

GuanhuaWang Sep 30, 2024 Collaborator

ansar-ayan
Sep 30, 2024

GuanhuaWang
Sep 30, 2024
Collaborator