You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When running a simple model including torch.nn.LayerNorm using deepspeed zero3 with torch.compile and compiled_autograd. An error occurs:
site-packages/torch/_subclasses/fake_tensor.py:2017] RuntimeError: Attempting to broadcast a dimension of length 0 at -1! Mismatching argument at index 1 had torch.Size([0]); but expected shape should be broadcastable to [100, 120]
We first found this error in BERT model with deepspeed Zero3 with torch.compile and compiled_autograd.
It's ok for deepspeed Zero1/2 with torch.compile and compiled_autograd
It's ok for deepspeed Zero3 with torch.compile and without compiled_autograd
There are a lot of graph beaks and recompiles in deepspeed Zero3 with torch.compile.
To simplify the issue, I made a small reproducer to extract error op(torch.nn.LayerNorm)
Expected behavior
Running the model with deepspeed Zero3 without error.
Investigation
The error: "RuntimeError: Attempting to broadcast a dimension of length 0 at -1! Mismatching argument at index 1 had torch.Size([0]); but expected shape should be broadcastable to [128, 128, 1600]"
It occurs when compiled autograd tries to trace the backward graph.
It appears in LayerNorm backward decompositions. It tries to broadcast weight_cast(torch.Size([0]) to grad_out_cast' shape([128,128,1600]) and fails.
If bypassing the LayerNorm weight by setting nn.LayerNorm(120, eps=1e-12, elementwise_affine=False) instead of elementwise_affine=True in the file deepspeed_reproducer_cpu.py, the running is ok.
System info:
OS: Ubuntu 22.04
No GPU (it's device-independent, so we use CPU to reproduce)
Python version 3.10.12
PyTorch version 2.5.1
DeepSpeed version 0.15.3
To Reproduce
Steps to reproduce the behavior:
Set environment variable for more verbose logs: TORCH_LOGS="+dynamo,graph,graph_code,graph_breaks,recompiles,aot_graphs,aot_joint_graph,compiled_autograd_verbose"
Run with deepspeed --num_nodes 1 --num_gpus 1 deepspeed_reproducer_cpu.py
Hi @tohtana, I have tried setting stage3_param_persistence_threshold to zero, but it seems it doesn't help. The error still occurs.
I also opened an issue in pytorch.
@tohtana I wonder why should I try to set stage3_param_persistence_threshold to zero. As I understand, setting stage3_param_persistence_threshold > param.size can let param persist, so maybe stage3_param_persistence_threshold should be large enough?
As I understand, self.persistent_parameters in DeepSpeedZeroOptimizer_Stage3 will all_gather when executing step() function. But self.persistent_parameters will still partition when init and do pre_fwd/bwd all-gather and post_fwd/bwd partition, right? If I describe it incorrectly, please point it out.
If my above understanding is correct, then stage3_param_persistence_threshold won't help. Because the error occurs when compiled autograd tries to trace the backward graph at the first iteration. At this time, the params are still partitioned as compiled autograd won't execute _pre_backward_module_hook to all-gather params.
Describe the bug
When running a simple model including torch.nn.LayerNorm using deepspeed zero3 with torch.compile and compiled_autograd. An error occurs:
We first found this error in BERT model with deepspeed Zero3 with torch.compile and compiled_autograd.
Expected behavior
Running the model with deepspeed Zero3 without error.
Investigation
The error: "RuntimeError: Attempting to broadcast a dimension of length 0 at -1! Mismatching argument at index 1 had torch.Size([0]); but expected shape should be broadcastable to [128, 128, 1600]"
It occurs when compiled autograd tries to trace the backward graph.
It appears in LayerNorm backward decompositions. It tries to broadcast weight_cast(torch.Size([0]) to grad_out_cast' shape([128,128,1600]) and fails.
If bypassing the LayerNorm weight by setting
nn.LayerNorm(120, eps=1e-12, elementwise_affine=False)
instead ofelementwise_affine=True
in the file deepspeed_reproducer_cpu.py, the running is ok.System info:
To Reproduce
Steps to reproduce the behavior:
TORCH_LOGS="+dynamo,graph,graph_code,graph_breaks,recompiles,aot_graphs,aot_joint_graph,compiled_autograd_verbose"
deepspeed --num_nodes 1 --num_gpus 1 deepspeed_reproducer_cpu.py
The text was updated successfully, but these errors were encountered: