You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
以下是部分日志:
train_dataset: Dataset({
features: ['input_ids', 'labels'],
num_rows: 180431597
})
[2025-01-18 23:24:31,286] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.5
[WARNING] using untested triton version (3.1.0), only 1.0.0 is known to be compatible
/mnt/LM_disk12/weichenchuang/env/conda_for_hg.3.10py/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead.
def forward(ctx, input, weight, bias=None):
/mnt/LM_disk12/weichenchuang/env/conda_for_hg.3.10py/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: torch.cuda.amp.custom_bwd(args...) is deprecated. Please use torch.amp.custom_bwd(args..., device_type='cuda') instead.
def backward(ctx, grad_output):
NCCL version 2.21.5+cuda12.4
[2025-01-18 23:24:34,048] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[2025-01-18 23:24:34,719] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.5
[WARNING] using untested triton version (3.1.0), only 1.0.0 is known to be compatible
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
/mnt/LM_disk12/weichenchuang/env/conda_for_hg.3.10py/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead.
def forward(ctx, input, weight, bias=None):
/mnt/LM_disk12/weichenchuang/env/conda_for_hg.3.10py/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: torch.cuda.amp.custom_bwd(args...) is deprecated. Please use torch.amp.custom_bwd(args..., device_type='cuda') instead.
def backward(ctx, grad_output):
[2025-01-18 23:24:35,040] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.5
[WARNING] using untested triton version (3.1.0), only 1.0.0 is known to be compatible
/mnt/LM_disk12/weichenchuang/env/conda_for_hg.3.10py/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead.
def forward(ctx, input, weight, bias=None):
/mnt/LM_disk12/weichenchuang/env/conda_for_hg.3.10py/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: torch.cuda.amp.custom_bwd(args...) is deprecated. Please use torch.amp.custom_bwd(args..., device_type='cuda') instead.
def backward(ctx, grad_output):
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.5
[WARNING] using untested triton version (3.1.0), only 1.0.0 is known to be compatible
/mnt/LM_disk12/weichenchuang/env/conda_for_hg.3.10py/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead.
def forward(ctx, input, weight, bias=None):
/mnt/LM_disk12/weichenchuang/env/conda_for_hg.3.10py/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: torch.cuda.amp.custom_bwd(args...) is deprecated. Please use torch.amp.custom_bwd(args..., device_type='cuda') instead.
def backward(ctx, grad_output):
[2025-01-18 23:24:36,284] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-01-18 23:24:36,299] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
在数据量比较多的情况下,训练初期loss先降低,后来慢慢变大,并且开了max_grad_norm。想问问看有没有什么排查的方向看是哪里的问题。训练机器是a800,以下是训练配置:
以下是训练代码:
以下是部分日志:
train_dataset: Dataset({
features: ['input_ids', 'labels'],
num_rows: 180431597
})
[2025-01-18 23:24:31,286] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.5
[WARNING] using untested triton version (3.1.0), only 1.0.0 is known to be compatible
/mnt/LM_disk12/weichenchuang/env/conda_for_hg.3.10py/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning:
torch.cuda.amp.custom_fwd(args...)
is deprecated. Please usetorch.amp.custom_fwd(args..., device_type='cuda')
instead.def forward(ctx, input, weight, bias=None):
/mnt/LM_disk12/weichenchuang/env/conda_for_hg.3.10py/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning:
torch.cuda.amp.custom_bwd(args...)
is deprecated. Please usetorch.amp.custom_bwd(args..., device_type='cuda')
instead.def backward(ctx, grad_output):
NCCL version 2.21.5+cuda12.4
[2025-01-18 23:24:34,048] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[2025-01-18 23:24:34,719] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.5
[WARNING] using untested triton version (3.1.0), only 1.0.0 is known to be compatible
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
/mnt/LM_disk12/weichenchuang/env/conda_for_hg.3.10py/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning:
torch.cuda.amp.custom_fwd(args...)
is deprecated. Please usetorch.amp.custom_fwd(args..., device_type='cuda')
instead.def forward(ctx, input, weight, bias=None):
/mnt/LM_disk12/weichenchuang/env/conda_for_hg.3.10py/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning:
torch.cuda.amp.custom_bwd(args...)
is deprecated. Please usetorch.amp.custom_bwd(args..., device_type='cuda')
instead.def backward(ctx, grad_output):
[2025-01-18 23:24:35,040] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.5
[WARNING] using untested triton version (3.1.0), only 1.0.0 is known to be compatible
/mnt/LM_disk12/weichenchuang/env/conda_for_hg.3.10py/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning:
torch.cuda.amp.custom_fwd(args...)
is deprecated. Please usetorch.amp.custom_fwd(args..., device_type='cuda')
instead.def forward(ctx, input, weight, bias=None):
/mnt/LM_disk12/weichenchuang/env/conda_for_hg.3.10py/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning:
torch.cuda.amp.custom_bwd(args...)
is deprecated. Please usetorch.amp.custom_bwd(args..., device_type='cuda')
instead.def backward(ctx, grad_output):
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.5
[WARNING] using untested triton version (3.1.0), only 1.0.0 is known to be compatible
/mnt/LM_disk12/weichenchuang/env/conda_for_hg.3.10py/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning:
torch.cuda.amp.custom_fwd(args...)
is deprecated. Please usetorch.amp.custom_fwd(args..., device_type='cuda')
instead.def forward(ctx, input, weight, bias=None):
/mnt/LM_disk12/weichenchuang/env/conda_for_hg.3.10py/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning:
torch.cuda.amp.custom_bwd(args...)
is deprecated. Please usetorch.amp.custom_bwd(args..., device_type='cuda')
instead.def backward(ctx, grad_output):
[2025-01-18 23:24:36,284] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-01-18 23:24:36,299] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
以下是loss曲线:
以下是grad_norm比较大的loss:
{'loss': 0.094, 'grad_norm': 3.6441407203674316, 'learning_rate': 0.0004997769340792326, 'epoch': 0.03}
{'loss': 0.1019, 'grad_norm': 3.8415417671203613, 'learning_rate': 0.0004997768399428072, 'epoch': 0.03}
{'loss': 0.0922, 'grad_norm': 3.0973892211914062, 'learning_rate': 0.0004997767457865314, 'epoch': 0.03}
{'loss': 0.0894, 'grad_norm': 2.5256614685058594, 'learning_rate': 0.0004997766516104055, 'epoch': 0.03}
{'loss': 0.0928, 'grad_norm': 2.3791558742523193, 'learning_rate': 0.000499776557414429, 'epoch': 0.03}
{'loss': 0.0949, 'grad_norm': 2.533939838409424, 'learning_rate': 0.0004997764631986026, 'epoch': 0.03}
{'loss': 0.0946, 'grad_norm': 5.6503705978393555, 'learning_rate': 0.0004997763689629258, 'epoch': 0.03}
{'loss': 0.1045, 'grad_norm': 237.627197265625, 'learning_rate': 0.0004997762747073987, 'epoch': 0.03}
{'loss': 0.1246, 'grad_norm': 4.959110260009766, 'learning_rate': 0.0004997761804320215, 'epoch': 0.03}
{'loss': 0.0988, 'grad_norm': 5.715487003326416, 'learning_rate': 0.0004997760861367939, 'epoch': 0.03}
{'loss': 0.1025, 'grad_norm': 3.7426841259002686, 'learning_rate': 0.0004997759918217163, 'epoch': 0.03}
{'loss': 0.0932, 'grad_norm': 3.6973986625671387, 'learning_rate': 0.0004997758974867883, 'epoch': 0.03}
{'loss': 0.0963, 'grad_norm': 8.751940727233887, 'learning_rate': 0.0004997758031320102, 'epoch': 0.03}
{'loss': 0.0924, 'grad_norm': 4.516246795654297, 'learning_rate': 0.0004997757087573818, 'epoch': 0.03}
{'loss': 0.1054, 'grad_norm': 2.862150192260742, 'learning_rate': 0.0004997756143629034, 'epoch': 0.03}
{'loss': 0.0935, 'grad_norm': 3.533675193786621, 'learning_rate': 0.0004997755199485747, 'epoch': 0.03}
{'loss': 0.094, 'grad_norm': 4.6160688400268555, 'learning_rate': 0.0004997754255143959, 'epoch': 0.03}
The text was updated successfully, but these errors were encountered: