[BUG] Zero3 for torch.compile with compiled_autograd when running LayerNorm #6719

yitingw1 · 2024-11-06T09:58:30Z

Describe the bug

When running a simple model including torch.nn.LayerNorm using deepspeed zero3 with torch.compile and compiled_autograd. An error occurs:

site-packages/torch/_subclasses/fake_tensor.py:2017] RuntimeError: Attempting to broadcast a dimension of length 0 at -1! Mismatching argument at index 1 had torch.Size([0]); but expected shape should be broadcastable to [100, 120]

We first found this error in BERT model with deepspeed Zero3 with torch.compile and compiled_autograd.

It's ok for deepspeed Zero1/2 with torch.compile and compiled_autograd
It's ok for deepspeed Zero3 with torch.compile and without compiled_autograd
There are a lot of graph beaks and recompiles in deepspeed Zero3 with torch.compile.
To simplify the issue, I made a small reproducer to extract error op(torch.nn.LayerNorm)

Expected behavior
Running the model with deepspeed Zero3 without error.

Investigation

The error: "RuntimeError: Attempting to broadcast a dimension of length 0 at -1! Mismatching argument at index 1 had torch.Size([0]); but expected shape should be broadcastable to [128, 128, 1600]"
It occurs when compiled autograd tries to trace the backward graph.
It appears in LayerNorm backward decompositions. It tries to broadcast weight_cast(torch.Size([0]) to grad_out_cast' shape([128,128,1600]) and fails.

if weight_cast is not None:         
    grad_x_hat = grad_out_cast * weight_cast

If bypassing the LayerNorm weight by setting nn.LayerNorm(120, eps=1e-12, elementwise_affine=False) instead of elementwise_affine=True in the file deepspeed_reproducer_cpu.py, the running is ok.

System info:

OS: Ubuntu 22.04
No GPU (it's device-independent, so we use CPU to reproduce)
Python version 3.10.12
PyTorch version 2.5.1
DeepSpeed version 0.15.3

To Reproduce
Steps to reproduce the behavior:

Set environment variable for more verbose logs: TORCH_LOGS="+dynamo,graph,graph_code,graph_breaks,recompiles,aot_graphs,aot_joint_graph,compiled_autograd_verbose"
Run with deepspeed --num_nodes 1 --num_gpus 1 deepspeed_reproducer_cpu.py
You can use --num_gpus 2/4/8 for multi-cards
Below is deepspeed_reproducer_cpu.py

import torch
import torchvision
import torchvision.transforms as transforms
import torch.distributed as dist
import deepspeed
from deepspeed.accelerator import get_accelerator
from tqdm import tqdm
from torch.utils.data import DataLoader
from torch.utils.data.distributed import DistributedSampler
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(32 * 32 * 3, 120)
        self.fc2 = nn.Linear(120, 10)
        self.LayerNorm1 = nn.LayerNorm(120, eps=1e-12, elementwise_affine=True)

    def forward(self, x):
        x = torch.flatten(x, 1)  # flatten all dimensions except batch
        x = F.relu(self.fc1(x))
        x = self.LayerNorm1(x)
        x = self.fc2(x)
        return x

compile_kwargs = {"dynamic": False}
device = torch.device('cpu')

model = Net()
model.to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
model_engine, optimizer, *_ = deepspeed.initialize(
    model=model,
    model_parameters=model.parameters(),
    optimizer=optimizer,
    config="./deepspeed_config.json",
)
# torch_compile
model_engine.compile(
    compile_kwargs=compile_kwargs,
)

# dataset
transform = transforms.Compose(
    [transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))]
)
batch_size = 100
trainset = torchvision.datasets.CIFAR10(
    root="./DATA/CIFAR10", train=True, download=True, transform=transform
)
# process dataset
trainloader = DataLoader(
    trainset,
    batch_size=batch_size,
    sampler=DistributedSampler(trainset, shuffle=True),
    num_workers=16,
    pin_memory=True,
)
progress_bar = tqdm(
    total=len(trainloader),
    desc=f"Training 1/1 epoch",
    position=0,
    leave=True,
    disable= dist.is_initialized() and dist.get_rank() != 0,
)
for epoch in range(100):
    with torch._dynamo.compiled_autograd.enable(
                torch.compile(backend=get_accelerator().get_compile_backend(), **compile_kwargs)):
        running_loss = 0.0
        for i, data in enumerate(trainloader, 0):
            inputs, labels = data
            inputs, labels = inputs.to(device), labels.to(device)
            optimizer.zero_grad()
            
            # forward + backward + optimize
            outputs = model_engine(inputs)
            loss = criterion(outputs, labels)
            model_engine.backward(loss)
            model_engine.step()

            # print statistics
            running_loss += loss.item()
            if i % 2000 == 1999:  # print every 2000 mini-batches
                print(f"[{epoch + 1}, {i + 1:5d}] loss: {running_loss / 2000:.3f}")
                running_loss = 0.0
            progress_bar.update(1)
print("Finished Training")

Below is deepspeed_config.json

{
    "train_batch_size": 32, 
    "optimizer": {
        "type": "SGD",
        "params": {
            "lr": 0.001,
            "momentum": 0.9
        }
    },
    "zero_allow_untested_optimizer": true,
    "zero_optimization": {
        "stage": 3,
        "overlap_comm": false,
        "reduce_scatter" : false,
        "contiguous_gradients" : false
    },
}

The text was updated successfully, but these errors were encountered:

tohtana · 2024-11-08T21:29:00Z

Hi @yitingw1, I wonder if persistent parameter might not work well with the compiler.
Can you try setting stage3_param_persistence_threshold to zero?

yitingw1 · 2024-11-11T02:05:59Z

Hi @tohtana, I have tried setting stage3_param_persistence_threshold to zero, but it seems it doesn't help. The error still occurs.
I also opened an issue in pytorch.

yitingw1 · 2024-11-22T04:36:26Z

@tohtana I wonder why should I try to set stage3_param_persistence_threshold to zero. As I understand, setting stage3_param_persistence_threshold > param.size can let param persist, so maybe stage3_param_persistence_threshold should be large enough?

As I understand, self.persistent_parameters in DeepSpeedZeroOptimizer_Stage3 will all_gather when executing step() function. But self.persistent_parameters will still partition when init and do pre_fwd/bwd all-gather and post_fwd/bwd partition, right? If I describe it incorrectly, please point it out.
If my above understanding is correct, then stage3_param_persistence_threshold won't help. Because the error occurs when compiled autograd tries to trace the backward graph at the first iteration. At this time, the params are still partitioned as compiled autograd won't execute _pre_backward_module_hook to all-gather params.

yitingw1 added bug Something isn't working training labels Nov 6, 2024

tohtana self-assigned this Nov 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Zero3 for torch.compile with compiled_autograd when running LayerNorm #6719

[BUG] Zero3 for torch.compile with compiled_autograd when running LayerNorm #6719

yitingw1 commented Nov 6, 2024

tohtana commented Nov 8, 2024

yitingw1 commented Nov 11, 2024

yitingw1 commented Nov 22, 2024

[BUG] Zero3 for torch.compile with compiled_autograd when running LayerNorm #6719

[BUG] Zero3 for torch.compile with compiled_autograd when running LayerNorm #6719

Comments

yitingw1 commented Nov 6, 2024

tohtana commented Nov 8, 2024

yitingw1 commented Nov 11, 2024

yitingw1 commented Nov 22, 2024