[BUG] [Fix-Suggested] KeyError in stage_1_and_2.py Due to Optimizer-Model Parameter Mismatch #6770

traincheck-team · 2024-11-20T23:00:47Z

Describe the bug

related to #3718

An KeyError is thrown inside deepspeed.initialize at runtime/zero/stage_1_and_2.py", line 574, in _create_param_mapping, due to inconsistent usage of model parameters and parameters managed by the optimizer.

Full Traceback (Click to Show)

```log
Traceback (most recent call last):
  File "/home/yuxuan/gitrepos/machine-learning-issues/DS-3718/bug.py", line 53, in 
    expose_bug()
  File "/home/yuxuan/gitrepos/machine-learning-issues/DS-3718/bug.py", line 50, in expose_bug
    model_engine, _, _, _ = deepspeed.initialize(model=model, optimizer=optimizer, config_params=ds_config_fp16)
  File "/home/yuxuan/miniconda3/envs/ml-daikon/lib/python3.10/site-packages/deepspeed/__init__.py", line 193, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/home/yuxuan/miniconda3/envs/ml-daikon/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 313, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/home/yuxuan/miniconda3/envs/ml-daikon/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1302, in _configure_optimizer
    self.optimizer = self._configure_zero_optimizer(basic_optimizer)
  File "/home/yuxuan/miniconda3/envs/ml-daikon/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1560, in _configure_zero_optimizer
    optimizer = DeepSpeedZeroOptimizer(
  File "/home/yuxuan/miniconda3/envs/ml-daikon/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 553, in __init__
    self._param_slice_mappings = self._create_param_mapping()
  File "/home/yuxuan/miniconda3/envs/ml-daikon/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 574, in _create_param_mapping
    lp_name = self.param_names[lp]
KeyError: tensor([-0.1179,  0.0190, -0.2361, -0.0330, -0.1097,  0.0685, -0.2788, -0.0424,
         0.1805, -0.1327,  0.2295, -0.1572,  0.2908, -0.2656, -0.2262,  0.2296,
        -0.0720, -0.1561, -0.0781, -0.2091,  0.1213,  0.2913,  0.0560, -0.1072,
         0.0990, -0.2922, -0.1741, -0.0852,  0.1493,  0.0510, -0.3086,  0.0510,
         0.0163,  0.1622, -0.1566,  0.2947, -0.1843,  0.2590,  0.1528, -0.0735,
        -0.1443, -0.1165, -0.3152,  0.1250, -0.0102,  0.1953,  0.2947,  0.2566,
         0.2942,  0.1985, -0.2688, -0.1689,  0.2786, -0.0039,  0.0989,  0.2617,
         0.0778, -0.1381,  0.0307, -0.1530, -0.0360,  0.1461, -0.0746,  0.1142,
        -0.2235, -0.2544, -0.1772, -0.1136, -0.0287,  0.3057,  0.1761,  0.0782,
         0.1699,  0.0997, -0.1385, -0.0923, -0.1219, -0.2313, -0.0925,  0.1703,
         0.0032,  0.3071, -0.0467,  0.0065, -0.2251,  0.2949, -0.2507,  0.2847,
        -0.0638, -0.1945,  0.1630,  0.2463, -0.0249, -0.0586, -0.1923, -0.0291,
         0.0769,  0.2839,  0.0655, -0.3152,  0.2118, -0.0918,  0.0764,  0.2585,
        -0.0240,  0.1981, -0.1708, -0.2991,  0.1741,  0.0652,  0.2600, -0.2397,
         0.0431,  0.0839,  0.1979,  0.0051, -0.2389,  0.0657,  0.1400,  0.3115,
         0.2091, -0.2200,  0.2610, -0.0994, -0.2996, -0.2710, -0.0466,  0.0237,
        -0.0053,  0.2881,  0.0077,  0.1194,  0.0026,  0.1493,  0.2361,  0.2915,
         0.1206,  0.2198, -0.2668, -0.2032, -0.2406,  0.2112,  0.1519,  0.1636,
        -0.1826, -0.2490,  0.2637,  0.2380, -0.2703, -0.2249,  0.0025,  0.1674,
        -0.2751,  0.0442, -0.1255,  0.0972, -0.2766,  0.0444,  0.0058,  0.0765,
         0.2798,  0.0579,  0.2499, -0.2095,  0.0173,  0.0000], device='cuda:0',
       requires_grad=True)
[2024-11-20 13:00:07,916] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 3765479
```

Suspected Root Cause

deepspeed.initialize(model=model, optimizer=optimizer, config_params=ds_config_fp16)

This issue can be triggered in any case if in the arguments to deepspeed.initialize, parameters in optimizer.param_groups is not a subset of model.parameters.

At the fault location, the code is trying to access parameter's names stored in self.param_names using tensors in self.bit16_groups.

self.bit16_groups is populated from optimizer.param_groups, while

self.param_names is populated from the model itself.

Thus, if the optimizer's parameters are not exactly a subset of the model, a KeyError will be thrown.
The case where optimizer's parameters are not exactly a subset of the model is quite common, due to optimization techniques like Parameter Grouping and ZeRO Optimization.

To Reproduce

We prepared a rather simple reproduction script to reproduce this error. In this script, deepspeed.initialize is accidently called twice. After the first deepspeed.initialize, optimizer.param_groups was consolidated into one single parameter, and causes key error in the second deepspeed.initialize.

Install deepspeed 0.15.4
run bug.py using deepspeed --num_gpus=1 bug.py

# bug.py
import torch
import deepspeed
import torch.nn as nn

# Define a simple model
class SimpleModel(nn.Module):
    def __init__(self):
        super(SimpleModel, self).__init__()
        self.fc1 = nn.Linear(10, 10)
        self.fc2 = nn.Linear(10, 5)
    def forward(self, x):
        x = self.fc1(x)
        return self.fc2(x)

# Main function to expose the bug
def expose_bug():
    # Initialize model
    model = SimpleModel()
    # Initialize DeepSpeed configurations for fp16
    ds_config_fp16 = {
        "train_micro_batch_size_per_gpu": 1,
        "fp16": {"enabled": True,},
        "zero_optimization": {"stage": 2}
    }

    optimizer = torch.optim.Adam(filter(lambda p: p.requires_grad, model.parameters()), lr=1e-3)
    # optimizer have 4 params now
    print(optimizer.param_groups)
    # Initialize DeepSpeed engine
    model_engine, optim, _, _ = deepspeed.initialize(model=model, optimizer=optimizer, config_params=ds_config_fp16)
    # optimizer have 1 params now
    print(optimizer.param_groups)
    # EXCEPTION!!!
    model_engine, optim, _, _ = deepspeed.initialize(model=model, optimizer=optimizer, config_params=ds_config_fp16)

if __name__ == "__main__":
    expose_bug()

Notice that the second deepspeed.initialize throws the KeyError exception.
Also notice that the first print of optimizer.param_groups shows 4 params, while the second print shows only one param (the content of one param is the merge of the 4 param).

prior to deepspeed.initialize

After deepspeed.initialize

Since in the second deepspeed.initialize, the merged param actually does not exist in the model, an KeyError will be thrown.

Expected behavior / Suggested Fix

We expect two behaviors here from DeepSpeed

Forbid deepspeed.initialize on models / optimizers that have already been used in another deepspeed.initialize.
Check for "optimizer.param_group should be a subset of model.parameters()" explicitly and throw a more user-friendly exception or warning.

ds_report output

Click to Show

collect2: error: ld returned 1 exit status
gds .................... [NO] ....... [NO]
transformer_inference .. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2
 [WARNING]  using untested triton version (2.2.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/xxx/python3.10/site-packages/torch']
torch version .................... 2.2.2+cu121
deepspeed install path ........... ['/home/xxx/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.15.4, unknown, unknown
torch cuda version ............... 12.1
torch hip version ................ None
nvcc version ..................... 12.3
deepspeed wheel compiled w. ...... torch 2.2, cuda 12.1
shared memory (/dev/shm) size .... 31.24 GB

I will be more than happy to contribute to the two suggested fixes, let me know what you think!

The text was updated successfully, but these errors were encountered:

traincheck-team mentioned this issue Nov 20, 2024

[BUG] [Fix-Suggested] ZeRO Stage 3 Overwrites Module ID Attribute Causing Incorrect Expert Placement on GPUs #6772

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] [Fix-Suggested] KeyError in stage_1_and_2.py Due to Optimizer-Model Parameter Mismatch #6770

[BUG] [Fix-Suggested] KeyError in stage_1_and_2.py Due to Optimizer-Model Parameter Mismatch #6770

traincheck-team commented Nov 20, 2024 •

edited

Loading

[BUG] [Fix-Suggested] KeyError in stage_1_and_2.py Due to Optimizer-Model Parameter Mismatch #6770

[BUG] [Fix-Suggested] KeyError in stage_1_and_2.py Due to Optimizer-Model Parameter Mismatch #6770

Comments

traincheck-team commented Nov 20, 2024 • edited Loading

Describe the bug

Suspected Root Cause

To Reproduce

Expected behavior / Suggested Fix

ds_report output

traincheck-team commented Nov 20, 2024 •

edited

Loading