-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG]Issue with Zero Optimization for Llama-2-7b Fine-Tuning on Intel GPUs #6713
Comments
@delock, can you please help? Thanks! |
@molang66 Hi, I reran the cmd that you pasted in this issue and found that no such error appeared. So I think that must be some version mismatch or outdated. I verify the cmd with following versions: Ubuntu 22.04.2 LTS Can you provide more details about your development environment? Or you can try using my verified versions :) |
Hi @tjruwase, @Liangliang-Ma will followup with this issue. Thanks! |
Thank so much for help. I have updated my CCL version, and now I am encountering this issue:
I was running on Stampede3 cluster, and my environments are as follows
GPU driver: |
Do you have the latest version of deepspeed? I have seen similar issue for outdated deepspeed. |
@Liangliang-Ma My deepspeed version is 0.15.3. I think this is the latest version.
Could it be my gpu drive version? I don't know what the latest version of the drive is |
@Liangliang-Ma Thank you for your response. I’d like to know which command can check the GPU driver; I didn’t see any indication for the rolling stable version.
Is this normal? Compilation worked fine with version 24.2.1. |
@molang66 You can check with dpkg -l | grep -P "intel|level-zero|libigc|libigd|libigf|opencl" to see installed components. If you install the same gpu driver version like mine, you will see the output like this: ii intel-fw-gpu 2024.24.5-337-22.04 And we suggest you keep using oneAPI 2024.2.1 with ipex2.3.110 because it currently matching each other for usage. |
Describe the bug
I’m experiencing an issue when fine-tuning the Llama-2-7b model from Hugging Face with Zero optimization enabled. I am running on 8 Intel Max 1550 GPUs using the code from the examples provided in Intel Extension for DeepSpeed.
The model loads and runs successfully without Zero optimization, but when I enable Zero optimization (particularly with stage 3), I encounter the following errors:
[rank0]: RuntimeError: could not create an engine
2024:11:05-02:39:09:(678567) |CCL_INFO| finalizing level-zero
2024:11:05-02:39:09:(678567) |CCL_INFO| finalized level-zero
0%| | 0/50 [00:00<?, ?it/s]
2024:11:05-02:39:09:(678572) |CCL_INFO| finalizing level-zero
2024:11:05-02:39:09:(678566) |CCL_INFO| finalizing level-zero
...
[2024-11-05 02:39:10,447] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 678572
**System info **
Model: Llama-2-7b from Hugging Face
GPUs: 8x Intel Max 1550 GPUs
Software:
• Intel Extension for pytorch
• DeepSpeed with Zero Optimization (Stage 3)
• oneCCL for communication backend
Launcher context
cd transformers
deepspeed --num_gpus=8 examples/pytorch/language-modeling/run_clm.py
--deepspeed tests/deepspeed/ds_config_zero3.json
--model_name_or_path meta-llama/Llama-2-7b-hf
--dataset_name wikitext
--dataset_config_name wikitext-2-raw-v1
--dataloader_num_workers 0
--per_device_train_batch_size 1
--warmup_steps 10
--max_steps 50
--bf16
--do_train
--output_dir /tmp/test-clm
--overwrite_output_dir
The text was updated successfully, but these errors were encountered: