LoRA LLAMA-70B finetuning fails on multi GPU. #10069

sriraman2020 · 2024-02-01T17:00:18Z

https://github.com/intel-analytics/BigDL/tree/main/python/llm/example/GPU/LLM-Finetuning/LoRA

lora_finetune_llama2_7b_pvc_1550_4_card.sh works fine with 7B

Replaced the workload with llama-70B it fails. - meta-llama/Llama-2-70b-hf

System config

(bigdl_31J) sdp@aia-sdp-pvc-135536:/localdisk/sdp/sudarsh2/rsrirama/BigDL/python/llm/example/GPU/LLM-Finetuning/LoRA$ clinfo | grep "compute"
Max compute units 224
Max compute units 224
Max compute units 512
Max compute units 512
Max compute units 512
Max compute units 512
Max compute units 512
Max compute units 512
Max compute units 512
Max compute units 512
(bigdl_31J) sdp@aia-sdp-pvc-135536:/localdisk/sdp/sudarsh2/rsrirama/BigDL/python/llm/example/GPU/LLM-Finetuning/LoRA$

Error log below

RuntimeError: Native API failed. Native API returns: -5 (PI_ERROR_OUT_OF_RESOURCES) -5 (PI_ERROR_OUT_OF_RESOURCES)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 7 PID 2019202 RUNNING AT aia-sdp-pvc-135536
= KILLED BY SIGNAL: 9 (Killed)

jason-dai · 2024-02-01T17:09:56Z

Try https://github.com/intel-analytics/BigDL/blob/main/python/llm/example/GPU/LLM-Finetuning/QLoRA/alpaca-qlora/qlora_finetune_llama2_70b_pvc_1550_4_card.sh?

sriraman2020 · 2024-02-01T19:11:42Z

seems to be stuck here for 15 mins

plusbang · 2024-02-02T02:22:34Z

seems to be stuck here for 15 mins

According to the log, AMX state allocation in the OS failed!. It seems you still need to bypass AMX as we discussed in previous issue.

sriraman2020 · 2024-02-02T06:53:03Z

We are still seeing this error after disabled AMX - export BIGDL_LLM_AMX_DISABLED=1

Uptime: 461.239374 s
2024:02:01-23:06:50:(3084264) |CCL_ERROR| exchange_utils.cpp:220 recvmsg_fd: condition !check_msg_retval("recvmsg", recv_bytes, iov, msg, sizeof(u.cntr_buf), sock, *fd) failed
errno: No such file or directory
2024:02:01-23:06:50:(3084264) |CCL_ERROR| worker.cpp:338 ccl_worker_func: worker 0 caught internal exception: oneCCL: exchange_utils.cpp:220 recvmsg_fd: EXCEPTION: errno: No such file or directory
terminate called after throwing an instance of 'ccl::v1::exception'
what(): oneCCL: exchange_utils.cpp:220 recvmsg_fd: EXCEPTION: errno: No such file or directory

LIBXSMM_VERSION: main_stable-1.17-3651 (25693763)LIBXSMM WARNING: AMX state allocation in the OS failed!

LIBXSMM_TARGET: clx [Intel(R) Xeon(R) Platinum 8480+]
Registry and code: 13 MB
Command: python -u ./alpaca_qlora_finetuning.py --base_model meta-llama/Llama-2-70b-hf --data_path yahma/alpaca-cleaned --output_dir ./bigdl-qlora-alpaca --gradient_checkpointing True --micro_batch_size 8 --batch_size 128 --deepspeed ./deepspeed_zero2.json --saved_low_bit_model ./llama-2-70b-hf-nf4
Uptime: 461.489362 s
Terminated
(bigdl_31J) sdp@aia-sdp-pvc-135536:/localdisk/sdp/sudarsh2/rsrirama/BigDL/python/llm/example/GPU/LLM-Finetuning/QLoRA/alpaca-qlora$

plusbang · 2024-02-02T07:58:42Z

Could you please provide more details about your environment (dependency version list)?
Please make sure you've prepared your environment following installation instructions in https://github.com/intel-analytics/BigDL/tree/main/python/llm/example/GPU/LLM-Finetuning/QLoRA/alpaca-qlora#1-install

sriraman2020 · 2024-02-02T08:46:17Z

/BigDL/python/llm/example/GPU/LLM-Finetuning/QLoRA/alpaca-qlora$ pip list
Package Version

accelerate 0.23.0
aiohttp 3.9.3
aiosignal 1.3.1
annotated-types 0.6.0
async-timeout 4.0.3
attrs 23.2.0
bigdl-core-xe-21 2.5.0b20240201
bigdl-core-xe-esimd-21 2.5.0b20240201
bigdl-llm 2.5.0b20240201
bitsandbytes 0.42.0
certifi 2024.2.2
charset-normalizer 3.3.2
datasets 2.14.7
deepspeed 0.11.2+78c518ed
dill 0.3.7
filelock 3.13.1
fire 0.5.0
frozenlist 1.4.1
fsspec 2023.10.0
hjson 3.1.0
huggingface-hub 0.17.3
idna 3.6
intel-extension-for-deepspeed 0.9.4+ec33277
intel-extension-for-pytorch 2.1.10+xpu
intel-openmp 2024.0.2
Jinja2 3.1.3
MarkupSafe 2.1.4
mpmath 1.3.0
multidict 6.0.5
multiprocess 0.70.15
networkx 3.2.1
ninja 1.11.1.1
numpy 1.26.3
oneccl-bind-pt 2.1.100+xpu
packaging 23.2
pandas 2.2.0
peft 0.5.0
pillow 10.2.0
pip 23.3.1
protobuf 5.26.0rc1
psutil 5.9.8
py-cpuinfo 9.0.0
pyarrow 15.0.0
pyarrow-hotfix 0.6
pydantic 2.6.0
pydantic_core 2.16.1
python-dateutil 2.8.2
pytz 2024.1
PyYAML 6.0.1
regex 2023.12.25
requests 2.31.0
safetensors 0.4.2
scipy 1.12.0
sentencepiece 0.1.99
setuptools 68.2.2
six 1.16.0
sympy 1.12
tabulate 0.9.0
termcolor 2.4.0
tokenizers 0.14.1
torch 2.1.0a0+cxx11.abi
torchvision 0.16.0a0+cxx11.abi
tqdm 4.66.1
transformers 4.34.0
typing_extensions 4.9.0
tzdata 2023.4
urllib3 2.2.0
wheel 0.41.2
xxhash 3.4.1
yarl 1.9.4

sriraman2020 · 2024-02-02T14:06:01Z

@plusbang Looks like oneCCL issue? Do let me know if any more information is required.

plusbang · 2024-02-04T02:21:46Z

@plusbang Looks like oneCCL issue? Do let me know if any more information is required.

yeah, it seems like a oneccl related issue. We previously encountered another oneccl related bug and solve it by sudo apt install level-zero-dev (https://github.com/intel-analytics/BigDL/tree/main/python/llm/example/GPU/LLM-Finetuning/QLoRA/alpaca-qlora#7-troubleshooting). Maybe you could also try it.

sriraman2020 · 2024-02-05T16:00:31Z

driver already installed and present,

plusbang · 2024-02-06T02:23:15Z

driver already installed and present,

Maybe you could try to export CCL_LOG_LEVEL=debug and obtain more error messages about oneCCL.

sriraman2020 · 2024-02-06T05:49:24Z

below log with export CCL_LOG_LEVEL=debug, export ONECCL_BINDINGS_FOR_PYTORCH_ENV_VERBOSE=1

plusbang · 2024-02-06T10:15:21Z

below log with export CCL_LOG_LEVEL=debug, export ONECCL_BINDINGS_FOR_PYTORCH_ENV_VERBOSE=1

According to the log Too many open files, maybe you could try to raise the system open file limit using ulimit -n 1048576.

sriraman2020 · 2024-02-08T05:12:15Z

It is running with above fix ulimit -n 1048576, but after modifying the below code changes(**) it getting error,
attached Error message details and CCL_Dubug logs here.

**with torch.autograd.profiler_legacy.profile(enabled=True, use_xpu=True, record_shapes=True) as prof:**
    trainer = transformers.Trainer(
        model=model,
        train_dataset=train_data,
        eval_dataset=val_data,
        args=transformers.TrainingArguments(
            per_device_train_batch_size=micro_batch_size,
            gradient_accumulation_steps=gradient_accumulation_steps,
            # warmup_ratio=0.03,
            # warmup_steps=100,
            max_grad_norm=0.3,
            # num_train_epochs=num_epochs,
            learning_rate=learning_rate,
            lr_scheduler_type="cosine",
            bf16=True,  # ensure training more stable
            logging_steps=1,
            optim="adamw_torch",
            evaluation_strategy="steps" if val_set_size > 0 else "no",
            save_strategy="steps",
            eval_steps=1 if val_set_size > 0 else None,
            save_steps=1,
            max_steps = 1,
            output_dir=output_dir,
            save_total_limit=1,
            load_best_model_at_end=True if val_set_size > 0 else False,
            ddp_find_unused_parameters=False if ddp else None,
            group_by_length=group_by_length,
            report_to="wandb" if use_wandb else None,
            run_name=wandb_run_name if use_wandb else None,
            gradient_checkpointing=gradient_checkpointing,
            ddp_backend="ccl",
            deepspeed=deepspeed,
            save_safetensors=False,
        ),
        data_collator=transformers.DataCollatorForSeq2Seq(
            tokenizer, pad_to_multiple_of=8, return_tensors="pt", padding=True
        ),
    )
    model.config.use_cache = False

    trainer.train(resume_from_checkpoint=resume_from_checkpoint)

    model.save_pretrained(output_dir)

    print(
        "\n If there's a warning about missing keys above, please disregard :)"
    )
**torch.save(prof.table(sort_by="id", row_limit=-1),"./qlora_llama7b_finetuning_profile_id.pt")**
**torch.save(prof.key_averages(group_by_input_shape=True).table(row_limit=-1),"./qlora_llama7b_finetuning_profile_detail.pt")**
**prof.export_chrome_trace("./qlora_llama7b_finetuning_trace.json")**

if name == "main":
fire.Fire(train)

jason-dai · 2024-02-08T11:46:33Z

It is running with above fix ulimit -n 1048576, but after modifying the below code changes(**) it getting error, attached Error message details and CCL_Dubug logs here.

It seems there are no error messages or logs here?

sriraman2020 · 2024-02-08T14:47:29Z

Actually its working fine. The error was due to shared system. We are able to successfully collect performance stats.
Thanks!

hkvision · 2024-02-09T07:44:22Z

Thanks for your response. Since the issue is closed, we are closing it. Feel free to raise new issues in the future :)

hkvision added the user issue label Feb 2, 2024

plusbang mentioned this issue Feb 7, 2024

LLM: update FAQ about too many open files #10119

Merged

1 task

hkvision closed this as completed Feb 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LoRA LLAMA-70B finetuning fails on multi GPU. #10069

LoRA LLAMA-70B finetuning fails on multi GPU. #10069

sriraman2020 commented Feb 1, 2024

jason-dai commented Feb 1, 2024

sriraman2020 commented Feb 1, 2024

plusbang commented Feb 2, 2024

sriraman2020 commented Feb 2, 2024

plusbang commented Feb 2, 2024

sriraman2020 commented Feb 2, 2024

sriraman2020 commented Feb 2, 2024 •

edited

Loading

plusbang commented Feb 4, 2024

sriraman2020 commented Feb 5, 2024

plusbang commented Feb 6, 2024

sriraman2020 commented Feb 6, 2024

plusbang commented Feb 6, 2024 •

edited

Loading

sriraman2020 commented Feb 8, 2024

jason-dai commented Feb 8, 2024

sriraman2020 commented Feb 8, 2024

hkvision commented Feb 9, 2024

LoRA LLAMA-70B finetuning fails on multi GPU. #10069

LoRA LLAMA-70B finetuning fails on multi GPU. #10069

Comments

sriraman2020 commented Feb 1, 2024

=================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = RANK 7 PID 2019202 RUNNING AT aia-sdp-pvc-135536 = KILLED BY SIGNAL: 9 (Killed)

jason-dai commented Feb 1, 2024

sriraman2020 commented Feb 1, 2024

plusbang commented Feb 2, 2024

sriraman2020 commented Feb 2, 2024

plusbang commented Feb 2, 2024

sriraman2020 commented Feb 2, 2024

sriraman2020 commented Feb 2, 2024 • edited Loading

plusbang commented Feb 4, 2024

sriraman2020 commented Feb 5, 2024

plusbang commented Feb 6, 2024

sriraman2020 commented Feb 6, 2024

plusbang commented Feb 6, 2024 • edited Loading

sriraman2020 commented Feb 8, 2024

jason-dai commented Feb 8, 2024

sriraman2020 commented Feb 8, 2024

hkvision commented Feb 9, 2024

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 7 PID 2019202 RUNNING AT aia-sdp-pvc-135536
= KILLED BY SIGNAL: 9 (Killed)

sriraman2020 commented Feb 2, 2024 •

edited

Loading

plusbang commented Feb 6, 2024 •

edited

Loading