DeepSpeed MPI error #5288
Sabiha1225
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
ds_config = {
"fp16": {
"enabled": "auto"
},
"optimizer": {
"type": "Adam",
"params": {
"lr": 0.001,
"betas": [0.8,0.999],
"eps": 1e-8,
"weight_decay": 3e-7
}
},
"zero_optimization": {
"stage": 3,
"offload_param": {
"device": "cpu",
"pin_memory": True
},
"offload_optimizer": {
"device": "nvme",
"nvme_path": "/mnt/nvme",
"pin_memory": True,
"ratio": 0.3,
"buffer_count": 4,
"fast_init": False
},
"overlap_comm": True,
"contiguous_gradients": True,
},
"tensorboard": {
"enabled": True,
"output_path": "output/ds_logs_125/",
"job_name": "train_bert"
},
"wandb": {
"enabled": True,
"group": "my_group",
"team": "sabiha12",
"project": "deepspeed"
},
"csv_monitor": {
"enabled": True,
"output_path": "output/ds_logs_125/",
"job_name": "train_bert"
},
"steps_per_print": 2000,
"train_batch_size": train_batch_size,
"train_micro_batch_size_per_gpu": 1,
"wall_clock_breakdown": False,
"dump_state": True
}
This is my configuration. When I am running a LLM model with deep speed I am getting following error.
[2024-03-16 12:09:15,782] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
['labels', 'input_ids', 'attention_mask']
[2024-03-16 12:09:17,968] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-03-16 12:09:17,968] [INFO] [comm.py:652:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...
Abort(1090191) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init_thread: Unknown error class, error stack:
MPIR_Init_thread(189)........:
MPID_Init(1561)..............:
MPIDI_OFI_mpi_init_hook(1546):
(unknown)(): Unknown error class
[unset]: write_line error; fd=-1 buf=:cmd=abort exitcode=1090191
:
system msg for write_line failure : Bad file descriptor
Abort(1090191) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init_thread: Unknown error class, error stack:
MPIR_Init_thread(189)........:
MPID_Init(1561)..............:
MPIDI_OFI_mpi_init_hook(1546):
(unknown)(): Unknown error class
[unset]: write_line error; fd=-1 buf=:cmd=abort exitcode=1090191
:
system msg for write_line failure : Bad file descriptor
Segmentation fault
Kindly suggest some solution.
Beta Was this translation helpful? Give feedback.
All reactions