-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LoRA LLAMA-70B finetuning fails on multi GPU. #10069
Comments
According to the log, |
We are still seeing this error after disabled AMX - export BIGDL_LLM_AMX_DISABLED=1 Uptime: 461.239374 s LIBXSMM_VERSION: main_stable-1.17-3651 (25693763)LIBXSMM WARNING: AMX state allocation in the OS failed! LIBXSMM_TARGET: clx [Intel(R) Xeon(R) Platinum 8480+] |
Could you please provide more details about your environment (dependency version list)? |
/BigDL/python/llm/example/GPU/LLM-Finetuning/QLoRA/alpaca-qlora$ pip list accelerate 0.23.0 |
@plusbang Looks like oneCCL issue? Do let me know if any more information is required. |
yeah, it seems like a oneccl related issue. We previously encountered another oneccl related bug and solve it by |
It is running with above fix ulimit -n 1048576, but after modifying the below code changes(**) it getting error,
if name == "main": |
It seems there are no error messages or logs here? |
Actually its working fine. The error was due to shared system. We are able to successfully collect performance stats. |
Thanks for your response. Since the issue is closed, we are closing it. Feel free to raise new issues in the future :) |
https://github.com/intel-analytics/BigDL/tree/main/python/llm/example/GPU/LLM-Finetuning/LoRA
lora_finetune_llama2_7b_pvc_1550_4_card.sh works fine with 7B
Replaced the workload with llama-70B it fails. - meta-llama/Llama-2-70b-hf
System config
(bigdl_31J) sdp@aia-sdp-pvc-135536:/localdisk/sdp/sudarsh2/rsrirama/BigDL/python/llm/example/GPU/LLM-Finetuning/LoRA$ clinfo | grep "compute"
Max compute units 224
Max compute units 224
Max compute units 512
Max compute units 512
Max compute units 512
Max compute units 512
Max compute units 512
Max compute units 512
Max compute units 512
Max compute units 512
(bigdl_31J) sdp@aia-sdp-pvc-135536:/localdisk/sdp/sudarsh2/rsrirama/BigDL/python/llm/example/GPU/LLM-Finetuning/LoRA$
Error log below
RuntimeError: Native API failed. Native API returns: -5 (PI_ERROR_OUT_OF_RESOURCES) -5 (PI_ERROR_OUT_OF_RESOURCES)
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 7 PID 2019202 RUNNING AT aia-sdp-pvc-135536
= KILLED BY SIGNAL: 9 (Killed)
The text was updated successfully, but these errors were encountered: