[BUG]: why duplicate PID appears on rank 0 #6111

ericxsun · 2024-11-03T01:52:45Z

Is there an existing issue for this bug?

I have searched the existing issues

🐛 Describe the bug

When using the GeminiPlugin to train a model, it runs normally at the start. However, once a checkpoint（shard） is saved, a duplicate PID appears on rank 0.

Start:

After Saved Checkpoint

Why and how to avoid it ? Thanks a lot

Environment

Torch: 2.1.2
Colossalai: 0.4.2
Python: 3.8
Cuda: 12.1.0

ericxsun · 2024-11-03T07:04:20Z

While dig into, I found that when saving the optimizer, the PIDs from other ranks appear on rank 0.

torch.cuda.empty_cache()
booster.save_optimizer(optimizer, path_optimizer, shard=True, size_per_shard=2048)

ericxsun · 2024-11-04T04:01:49Z

I observed that, following this line:

ColossalAI/colossalai/zero/gemini/gemini_optimizer.py

Line 525 in 2f583c1

    
           compacted_states = self.pack_optimizer_states_to_tensor(param_id, state_names) if own_param else None

, the PID for other ranks starts appearing on rank-0

Furthermore, after reaching this line:

ColossalAI/colossalai/zero/gemini/gemini_optimizer.py

Line 593 in 2f583c1

    
           compacted_states = torch.zeros(compacted_size, dtype=dtype, device=device, requires_grad=False)

If device is replaced with torch.device(f"cuda:{torch.cuda.current_device()}"), each rank retains only one PID, just as at the start.

compacted_states = torch.zeros(
    compacted_size,
    dtype=dtype,
    device=torch.device(f"cuda:{torch.cuda.current_device()}"),
    requires_grad=False
)

And after reaching this line:

ColossalAI/colossalai/zero/gemini/gemini_optimizer.py

Line 532 in 2f583c1

    
           dist.all_gather_object(gathered_state_shards, [compacted_states, shard_offset, shard_size], group=zero_group)

the PID for other ranks still starts appearing on each rank.

ericxsun · 2024-11-04T04:02:26Z

any Colossalai-ers could help me? Thanks a lot.

ericxsun added the bug Something isn't working label Nov 3, 2024

ericxsun changed the title ~~[BUG]: why the PID duplicate to other gpus~~ [BUG]: why duplicate PID appears on rank 0 Nov 3, 2024

ericxsun mentioned this issue Nov 4, 2024

[BUG]: weird stuck while training #6095

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]: why duplicate PID appears on rank 0 #6111

[BUG]: why duplicate PID appears on rank 0 #6111

ericxsun commented Nov 3, 2024 •

edited

Loading

ericxsun commented Nov 3, 2024

ericxsun commented Nov 4, 2024 •

edited

Loading

ericxsun commented Nov 4, 2024

[BUG]: why duplicate PID appears on rank 0 #6111

[BUG]: why duplicate PID appears on rank 0 #6111

Comments

ericxsun commented Nov 3, 2024 • edited Loading

Is there an existing issue for this bug?

🐛 Describe the bug

Environment

ericxsun commented Nov 3, 2024

ericxsun commented Nov 4, 2024 • edited Loading

ericxsun commented Nov 4, 2024

ericxsun commented Nov 3, 2024 •

edited

Loading

ericxsun commented Nov 4, 2024 •

edited

Loading