You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Analyze:
after give folder /raid/nvidia-nim 777 permission: sudo chmod -R 777 /raid/nvidia-nim, able to download nim model
suggest: add :sudo chmod -R 777 /raid/nvidia-nim in setup.sh script or remind user give 777 permission to model cache folder
Issue 2: hit OOM and Crash issue after step 9
Observe:
6b8cd071-2f06-4025-8c51-1fd5b17e6ee3
Analyze:
when set cpu memory < 77G in runtime yaml file(default value is 32G), it always get OOM and Crash issue
if set cpu memory >= 77G in runtime yaml file, then able to deploy nim with no issue
8cc731b2-3e45-42b0-878d-09659707d59d
suggest: tell user to increase cpu and gpu memory if they encounter this issue.
The text was updated successfully, but these errors were encountered:
NIM_IMAGE: nvcr.io/nim/meta/llama-3.1-70b-instruct:1.3.0
page: nim-deploy/kserve at main · NVIDIA/nim-deploy · GitHub
step:
3.export NGC_API_KEY=<NGC_API_KEY>
4.export HF_TOKEN=<HF_TOKEN>
5.export NODE_NAME=<NODE_NAME>
Issue 1: hit permission issue when download model to cache folder
after step 8 , Observe:
INFO 2024-11-26 10:19:04.812 pre_download.py:87] Fetching contents for profile tensorrt_llm-h100_nvl-fp8-tp2-pp1-throughput
INFO 2024-11-26 10:19:04.812 pre_download.py:93] {
feat_lora : false",
gpu : H100_NVL",
gpu_device : 2321:10de",
llm_engine : tensorrt_llm",
pp : 1",
precision : fp8",
profile : throughput",
tp : 2"
}
[11-26 10:19:17.953 ERROR nim_sdk::hub::repo rust/nim-sdk/src/hub/repo.rs:128] One or more errors fetching files:
[11-26 10:19:17.953 ERROR nim_sdk::hub::repo rust/nim-sdk/src/hub/repo.rs:130] I/O error Permission denied (os error 13)
[11-26 10:19:17.953 ERROR nim_sdk::hub::repo rust/nim-sdk/src/hub/repo.rs:130] I/O error Permission denied (os error 13)
[11-26 10:19:17.953 ERROR nim_sdk::hub::repo rust/nim-sdk/src/hub/repo.rs:130] I/O error Permission denied (os error 13)
[11-26 10:19:17.953 ERROR nim_sdk::hub::repo rust/nim-sdk/src/hub/repo.rs:130] I/O error Permission denied (os error 13)
[11-26 10:19:17.953 ERROR nim_sdk::hub::repo rust/nim-sdk/src/hub/repo.rs:130] I/O error Permission denied (os error 13)
[11-26 10:19:17.953 ERROR nim_sdk::hub::repo rust/nim-sdk/src/hub/repo.rs:130] I/O error Permission denied (os error 13)
[11-26 10:19:17.953 ERROR nim_sdk::hub::repo rust/nim-sdk/src/hub/repo.rs:130] I/O error Permission denied (os error 13)
[11-26 10:19:17.953 ERROR nim_sdk::hub::repo rust/nim-sdk/src/hub/repo.rs:130] I/O error Permission denied (os error 13)
[11-26 10:19:17.953 ERROR nim_sdk::hub::repo rust/nim-sdk/src/hub/repo.rs:130] I/O error Permission denied (os error 13)
[11-26 10:19:17.953 ERROR nim_sdk::hub::repo rust/nim-sdk/src/hub/repo.rs:130] I/O error Permission denied (os error 13)
[11-26 10:19:17.953 ERROR nim_sdk::hub::repo rust/nim-sdk/src/hub/repo.rs:130] I/O error Permission denied (os error 13)
[11-26 10:19:17.953 ERROR nim_sdk::hub::repo rust/nim-sdk/src/hub/repo.rs:130] I/O error Permission denied (os error 13)
[11-26 10:19:17.953 ERROR nim_sdk::hub::repo rust/nim-sdk/src/hub/repo.rs:130] I/O error Permission denied (os error 13)
[11-26 10:19:17.953 ERROR nim_sdk::hub::repo rust/nim-sdk/src/hub/repo.rs:130] I/O error Permission denied (os error 13)
[11-26 10:19:17.953 ERROR nim_sdk::hub::repo rust/nim-sdk/src/hub/repo.rs:130] I/O error Permission denied (os error 13)
Traceback (most recent call last):
File /opt/nim/llm/.venv/bin/download-to-cache", line 6, in
sys.exit(download_to_cache())
File /opt/nim/llm/nim_llm_sdk/hub/pre_download.py", line 97, in download_to_cache
cached_files = repo.get_all()
Exception: I/O error Permission denied (os error 13)
Analyze:
after give folder /raid/nvidia-nim 777 permission: sudo chmod -R 777 /raid/nvidia-nim, able to download nim model
suggest: add :sudo chmod -R 777 /raid/nvidia-nim in setup.sh script or remind user give 777 permission to model cache folder
Issue 2: hit OOM and Crash issue after step 9
Observe:
6b8cd071-2f06-4025-8c51-1fd5b17e6ee3
Analyze:
when set cpu memory < 77G in runtime yaml file(default value is 32G), it always get OOM and Crash issue
if set cpu memory >= 77G in runtime yaml file, then able to deploy nim with no issue
8cc731b2-3e45-42b0-878d-09659707d59d
suggest: tell user to increase cpu and gpu memory if they encounter this issue.
The text was updated successfully, but these errors were encountered: