Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LoRA LLAMA-70B finetuning fails on multi GPU. #10069

Closed
sriraman2020 opened this issue Feb 1, 2024 · 16 comments
Closed

LoRA LLAMA-70B finetuning fails on multi GPU. #10069

sriraman2020 opened this issue Feb 1, 2024 · 16 comments

Comments

@sriraman2020
Copy link

https://github.com/intel-analytics/BigDL/tree/main/python/llm/example/GPU/LLM-Finetuning/LoRA

lora_finetune_llama2_7b_pvc_1550_4_card.sh works fine with 7B

Replaced the workload with llama-70B it fails. - meta-llama/Llama-2-70b-hf

System config

(bigdl_31J) sdp@aia-sdp-pvc-135536:/localdisk/sdp/sudarsh2/rsrirama/BigDL/python/llm/example/GPU/LLM-Finetuning/LoRA$ clinfo | grep "compute"
Max compute units 224
Max compute units 224
Max compute units 512
Max compute units 512
Max compute units 512
Max compute units 512
Max compute units 512
Max compute units 512
Max compute units 512
Max compute units 512
(bigdl_31J) sdp@aia-sdp-pvc-135536:/localdisk/sdp/sudarsh2/rsrirama/BigDL/python/llm/example/GPU/LLM-Finetuning/LoRA$

Error log below

RuntimeError: Native API failed. Native API returns: -5 (PI_ERROR_OUT_OF_RESOURCES) -5 (PI_ERROR_OUT_OF_RESOURCES)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 7 PID 2019202 RUNNING AT aia-sdp-pvc-135536
= KILLED BY SIGNAL: 9 (Killed)

@sriraman2020
Copy link
Author

seems to be stuck here for 15 mins
image

@plusbang
Copy link
Contributor

plusbang commented Feb 2, 2024

seems to be stuck here for 15 mins image

According to the log, AMX state allocation in the OS failed!. It seems you still need to bypass AMX as we discussed in previous issue.

@sriraman2020
Copy link
Author

We are still seeing this error after disabled AMX - export BIGDL_LLM_AMX_DISABLED=1

Uptime: 461.239374 s
2024:02:01-23:06:50:(3084264) |CCL_ERROR| exchange_utils.cpp:220 recvmsg_fd: condition !check_msg_retval("recvmsg", recv_bytes, iov, msg, sizeof(u.cntr_buf), sock, *fd) failed
errno: No such file or directory
2024:02:01-23:06:50:(3084264) |CCL_ERROR| worker.cpp:338 ccl_worker_func: worker 0 caught internal exception: oneCCL: exchange_utils.cpp:220 recvmsg_fd: EXCEPTION: errno: No such file or directory
terminate called after throwing an instance of 'ccl::v1::exception'
what(): oneCCL: exchange_utils.cpp:220 recvmsg_fd: EXCEPTION: errno: No such file or directory

LIBXSMM_VERSION: main_stable-1.17-3651 (25693763)LIBXSMM WARNING: AMX state allocation in the OS failed!

LIBXSMM_TARGET: clx [Intel(R) Xeon(R) Platinum 8480+]
Registry and code: 13 MB
Command: python -u ./alpaca_qlora_finetuning.py --base_model meta-llama/Llama-2-70b-hf --data_path yahma/alpaca-cleaned --output_dir ./bigdl-qlora-alpaca --gradient_checkpointing True --micro_batch_size 8 --batch_size 128 --deepspeed ./deepspeed_zero2.json --saved_low_bit_model ./llama-2-70b-hf-nf4
Uptime: 461.489362 s
Terminated
(bigdl_31J) sdp@aia-sdp-pvc-135536:/localdisk/sdp/sudarsh2/rsrirama/BigDL/python/llm/example/GPU/LLM-Finetuning/QLoRA/alpaca-qlora$

@plusbang
Copy link
Contributor

plusbang commented Feb 2, 2024

Could you please provide more details about your environment (dependency version list)?
Please make sure you've prepared your environment following installation instructions in https://github.com/intel-analytics/BigDL/tree/main/python/llm/example/GPU/LLM-Finetuning/QLoRA/alpaca-qlora#1-install

@sriraman2020
Copy link
Author

/BigDL/python/llm/example/GPU/LLM-Finetuning/QLoRA/alpaca-qlora$ pip list
Package Version


accelerate 0.23.0
aiohttp 3.9.3
aiosignal 1.3.1
annotated-types 0.6.0
async-timeout 4.0.3
attrs 23.2.0
bigdl-core-xe-21 2.5.0b20240201
bigdl-core-xe-esimd-21 2.5.0b20240201
bigdl-llm 2.5.0b20240201
bitsandbytes 0.42.0
certifi 2024.2.2
charset-normalizer 3.3.2
datasets 2.14.7
deepspeed 0.11.2+78c518ed
dill 0.3.7
filelock 3.13.1
fire 0.5.0
frozenlist 1.4.1
fsspec 2023.10.0
hjson 3.1.0
huggingface-hub 0.17.3
idna 3.6
intel-extension-for-deepspeed 0.9.4+ec33277
intel-extension-for-pytorch 2.1.10+xpu
intel-openmp 2024.0.2
Jinja2 3.1.3
MarkupSafe 2.1.4
mpmath 1.3.0
multidict 6.0.5
multiprocess 0.70.15
networkx 3.2.1
ninja 1.11.1.1
numpy 1.26.3
oneccl-bind-pt 2.1.100+xpu
packaging 23.2
pandas 2.2.0
peft 0.5.0
pillow 10.2.0
pip 23.3.1
protobuf 5.26.0rc1
psutil 5.9.8
py-cpuinfo 9.0.0
pyarrow 15.0.0
pyarrow-hotfix 0.6
pydantic 2.6.0
pydantic_core 2.16.1
python-dateutil 2.8.2
pytz 2024.1
PyYAML 6.0.1
regex 2023.12.25
requests 2.31.0
safetensors 0.4.2
scipy 1.12.0
sentencepiece 0.1.99
setuptools 68.2.2
six 1.16.0
sympy 1.12
tabulate 0.9.0
termcolor 2.4.0
tokenizers 0.14.1
torch 2.1.0a0+cxx11.abi
torchvision 0.16.0a0+cxx11.abi
tqdm 4.66.1
transformers 4.34.0
typing_extensions 4.9.0
tzdata 2023.4
urllib3 2.2.0
wheel 0.41.2
xxhash 3.4.1
yarl 1.9.4

@sriraman2020
Copy link
Author

sriraman2020 commented Feb 2, 2024

@plusbang Looks like oneCCL issue? Do let me know if any more information is required.
image

@plusbang
Copy link
Contributor

plusbang commented Feb 4, 2024

@plusbang Looks like oneCCL issue? Do let me know if any more information is required. image

yeah, it seems like a oneccl related issue. We previously encountered another oneccl related bug and solve it by sudo apt install level-zero-dev (https://github.com/intel-analytics/BigDL/tree/main/python/llm/example/GPU/LLM-Finetuning/QLoRA/alpaca-qlora#7-troubleshooting). Maybe you could also try it.

@sriraman2020
Copy link
Author

driver already installed and present,
Level_Zero

@plusbang
Copy link
Contributor

plusbang commented Feb 6, 2024

driver already installed and present, Level_Zero

Maybe you could try to export CCL_LOG_LEVEL=debug and obtain more error messages about oneCCL.

@sriraman2020
Copy link
Author

below log with export CCL_LOG_LEVEL=debug, export ONECCL_BINDINGS_FOR_PYTORCH_ENV_VERBOSE=1

CCL_Log_level

@plusbang
Copy link
Contributor

plusbang commented Feb 6, 2024

below log with export CCL_LOG_LEVEL=debug, export ONECCL_BINDINGS_FOR_PYTORCH_ENV_VERBOSE=1

CCL_Log_level

According to the log Too many open files, maybe you could try to raise the system open file limit using ulimit -n 1048576.

@sriraman2020
Copy link
Author

It is running with above fix ulimit -n 1048576, but after modifying the below code changes(**) it getting error,
attached Error message details and CCL_Dubug logs here.

**with torch.autograd.profiler_legacy.profile(enabled=True, use_xpu=True, record_shapes=True) as prof:**
    trainer = transformers.Trainer(
        model=model,
        train_dataset=train_data,
        eval_dataset=val_data,
        args=transformers.TrainingArguments(
            per_device_train_batch_size=micro_batch_size,
            gradient_accumulation_steps=gradient_accumulation_steps,
            # warmup_ratio=0.03,
            # warmup_steps=100,
            max_grad_norm=0.3,
            # num_train_epochs=num_epochs,
            learning_rate=learning_rate,
            lr_scheduler_type="cosine",
            bf16=True,  # ensure training more stable
            logging_steps=1,
            optim="adamw_torch",
            evaluation_strategy="steps" if val_set_size > 0 else "no",
            save_strategy="steps",
            eval_steps=1 if val_set_size > 0 else None,
            save_steps=1,
            max_steps = 1,
            output_dir=output_dir,
            save_total_limit=1,
            load_best_model_at_end=True if val_set_size > 0 else False,
            ddp_find_unused_parameters=False if ddp else None,
            group_by_length=group_by_length,
            report_to="wandb" if use_wandb else None,
            run_name=wandb_run_name if use_wandb else None,
            gradient_checkpointing=gradient_checkpointing,
            ddp_backend="ccl",
            deepspeed=deepspeed,
            save_safetensors=False,
        ),
        data_collator=transformers.DataCollatorForSeq2Seq(
            tokenizer, pad_to_multiple_of=8, return_tensors="pt", padding=True
        ),
    )
    model.config.use_cache = False

    trainer.train(resume_from_checkpoint=resume_from_checkpoint)

    model.save_pretrained(output_dir)

    print(
        "\n If there's a warning about missing keys above, please disregard :)"
    )
**torch.save(prof.table(sort_by="id", row_limit=-1),"./qlora_llama7b_finetuning_profile_id.pt")**
**torch.save(prof.key_averages(group_by_input_shape=True).table(row_limit=-1),"./qlora_llama7b_finetuning_profile_detail.pt")**
**prof.export_chrome_trace("./qlora_llama7b_finetuning_trace.json")**

if name == "main":
fire.Fire(train)

@jason-dai
Copy link
Contributor

It is running with above fix ulimit -n 1048576, but after modifying the below code changes(**) it getting error, attached Error message details and CCL_Dubug logs here.

It seems there are no error messages or logs here?

@sriraman2020
Copy link
Author

Actually its working fine. The error was due to shared system. We are able to successfully collect performance stats.
Thanks!

@hkvision
Copy link
Contributor

hkvision commented Feb 9, 2024

Thanks for your response. Since the issue is closed, we are closing it. Feel free to raise new issues in the future :)

@hkvision hkvision closed this as completed Feb 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants