Skip to content

Commit

Permalink
LLM: update FAQ about too many open files (#10119)
Browse files Browse the repository at this point in the history
  • Loading branch information
plusbang authored Feb 7, 2024
1 parent 3cf601a commit 91800fd
Show file tree
Hide file tree
Showing 6 changed files with 22 additions and 20 deletions.
8 changes: 8 additions & 0 deletions docs/readthedocs/source/doc/LLM/Overview/FAQ/resolve_error.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,3 +53,11 @@ This error is caused by out of GPU memory. Some possible solutions to decrease G
### failed to enable AMX

You could use `export BIGDL_LLM_AMX_DISABLED=1` to disable AMX manually and solve this error.

### oneCCL: comm_selector.cpp:57 create_comm_impl: EXCEPTION: ze_data was not initialized

You may encounter this error during finetuning on multi GPUs. Please try `sudo apt install level-zero-dev` to fix it.

### Too many open files

You may encounter this error during finetuning, expecially when run 70B model. Please raise the system open file limit using `ulimit -n 1048576`.
6 changes: 1 addition & 5 deletions python/llm/example/GPU/LLM-Finetuning/LoRA/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -83,8 +83,4 @@ python ./export_merged_model.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --
Then you can use `./outputs/checkpoint-200-merged` as a normal huggingface transformer model to do inference.

### 7. Troubleshooting
- If you fail to finetune on multi cards because of following error message:
```bash
RuntimeError: oneCCL: comm_selector.cpp:57 create_comm_impl: EXCEPTION: ze_data was not initialized
```
Please try `sudo apt install level-zero-dev` to fix it.
Please refer to [here](../README.md#troubleshooting) for solutions of common issues during finetuning.
6 changes: 1 addition & 5 deletions python/llm/example/GPU/LLM-Finetuning/QA-LoRA/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,8 +77,4 @@ python ./export_merged_model.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --
Then you can use `./outputs/checkpoint-200-merged` as a normal huggingface transformer model to do inference.

### 7. Troubleshooting
- If you fail to finetune on multi cards because of following error message:
```bash
RuntimeError: oneCCL: comm_selector.cpp:57 create_comm_impl: EXCEPTION: ze_data was not initialized
```
Please try `sudo apt install level-zero-dev` to fix it.
Please refer to [here](../README.md#troubleshooting) for solutions of common issues during finetuning.
Original file line number Diff line number Diff line change
Expand Up @@ -160,8 +160,4 @@ python ./export_merged_model.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --
Then you can use `./outputs/checkpoint-200-merged` as a normal huggingface transformer model to do inference.

### 7. Troubleshooting
- If you fail to finetune on multi cards because of following error message:
```bash
RuntimeError: oneCCL: comm_selector.cpp:57 create_comm_impl: EXCEPTION: ze_data was not initialized
```
Please try `sudo apt install level-zero-dev` to fix it.
Please refer to [here](../../README.md#troubleshooting) for solutions of common issues during finetuning.
10 changes: 10 additions & 0 deletions python/llm/example/GPU/LLM-Finetuning/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,3 +8,13 @@ This folder contains examples of running different training mode with BigDL-LLM
- [ReLora](ReLora): examples of running ReLora finetuning
- [DPO](DPO): examples of running DPO finetuning
- [common](common): common templates and utility classes in finetuning examples


## Troubleshooting
- If you fail to finetune on multi cards because of following error message:
```bash
RuntimeError: oneCCL: comm_selector.cpp:57 create_comm_impl: EXCEPTION: ze_data was not initialized
```
Please try `sudo apt install level-zero-dev` to fix it.

- Please raise the system open file limit using `ulimit -n 1048576`. Otherwise, there may exist error `Too many open files`.
6 changes: 1 addition & 5 deletions python/llm/example/GPU/LLM-Finetuning/ReLora/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -83,8 +83,4 @@ python ./export_merged_model.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --
Then you can use `./outputs/checkpoint-200-merged` as a normal huggingface transformer model to do inference.

### 7. Troubleshooting
- If you fail to finetune on multi cards because of following error message:
```bash
RuntimeError: oneCCL: comm_selector.cpp:57 create_comm_impl: EXCEPTION: ze_data was not initialized
```
Please try `sudo apt install level-zero-dev` to fix it.
Please refer to [here](../README.md#troubleshooting) for solutions of common issues during finetuning.

0 comments on commit 91800fd

Please sign in to comment.