packed_data问题 #208

WhyDwelledOnAi · 2024-12-13T03:22:07Z

使用generate_packed_dataset.py后的packed_data训练时，训练会卡在accessory/engine_pretrain.py 的metric_logger.synchronize_between_processes()不动，然后ddp超时结束。
在使用*.parquet文件时则没有问题。

[E ProcessGroupNCCL.cpp:828] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=495104, OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1805715 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'

环境完全遵循文档中的requirement.txt.

WhyDwelledOnAi · 2024-12-13T03:53:20Z

在accessory/util/misc.py中修改SmoothedValue类的synchronize_between_processes函数。

rank = torch.distributed.get_rank(group=None)
t = torch.tensor([self.count, self.total], dtype=torch.float64, device=f'cuda:{rank}')

doesn't work

WhyDwelledOnAi closed this as completed Dec 13, 2024

WhyDwelledOnAi reopened this Dec 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

packed_data问题 #208

packed_data问题 #208

WhyDwelledOnAi commented Dec 13, 2024

WhyDwelledOnAi commented Dec 13, 2024 •

edited

Loading

packed_data问题 #208

packed_data问题 #208

Comments

WhyDwelledOnAi commented Dec 13, 2024

WhyDwelledOnAi commented Dec 13, 2024 • edited Loading

WhyDwelledOnAi commented Dec 13, 2024 •

edited

Loading