Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot deploy Llama 3.1 70B Instruct #109

Open
DougAtNvidia opened this issue Dec 3, 2024 · 0 comments
Open

Cannot deploy Llama 3.1 70B Instruct #109

DougAtNvidia opened this issue Dec 3, 2024 · 0 comments

Comments

@DougAtNvidia
Copy link
Contributor

NIM_IMAGE: nvcr.io/nim/meta/llama-3.1-70b-instruct:1.3.0
page: nim-deploy/kserve at main · NVIDIA/nim-deploy · GitHub
step:

  1. install k8s and kserve (cloud-native-stack/playbooks at master · NVIDIA/cloud-native-stack (github.com))
  2. git clone nim-deploy/kserve at main · NVIDIA/nim-deploy · GitHub
    3.export NGC_API_KEY=<NGC_API_KEY>
    4.export HF_TOKEN=<HF_TOKEN>
    5.export NODE_NAME=<NODE_NAME>
  3. cd ~/nim-deploy/kserve
  4. bash scripts/setup.sh
  5. download model: kubectl create -f download-profile.yaml
  6. kubectl apply -f llama-3.1-70b-instruct_2xgpu_1.1.0.yaml

Issue 1: hit permission issue when download model to cache folder
after step 8 , Observe:
INFO 2024-11-26 10:19:04.812 pre_download.py:87] Fetching contents for profile tensorrt_llm-h100_nvl-fp8-tp2-pp1-throughput
INFO 2024-11-26 10:19:04.812 pre_download.py:93] {
feat_lora : false",
gpu : H100_NVL",
gpu_device : 2321:10de",
llm_engine : tensorrt_llm",
pp : 1",
precision : fp8",
profile : throughput",
tp : 2"
}
[11-26 10:19:17.953 ERROR nim_sdk::hub::repo rust/nim-sdk/src/hub/repo.rs:128] One or more errors fetching files:
[11-26 10:19:17.953 ERROR nim_sdk::hub::repo rust/nim-sdk/src/hub/repo.rs:130] I/O error Permission denied (os error 13)
[11-26 10:19:17.953 ERROR nim_sdk::hub::repo rust/nim-sdk/src/hub/repo.rs:130] I/O error Permission denied (os error 13)
[11-26 10:19:17.953 ERROR nim_sdk::hub::repo rust/nim-sdk/src/hub/repo.rs:130] I/O error Permission denied (os error 13)
[11-26 10:19:17.953 ERROR nim_sdk::hub::repo rust/nim-sdk/src/hub/repo.rs:130] I/O error Permission denied (os error 13)
[11-26 10:19:17.953 ERROR nim_sdk::hub::repo rust/nim-sdk/src/hub/repo.rs:130] I/O error Permission denied (os error 13)
[11-26 10:19:17.953 ERROR nim_sdk::hub::repo rust/nim-sdk/src/hub/repo.rs:130] I/O error Permission denied (os error 13)
[11-26 10:19:17.953 ERROR nim_sdk::hub::repo rust/nim-sdk/src/hub/repo.rs:130] I/O error Permission denied (os error 13)
[11-26 10:19:17.953 ERROR nim_sdk::hub::repo rust/nim-sdk/src/hub/repo.rs:130] I/O error Permission denied (os error 13)
[11-26 10:19:17.953 ERROR nim_sdk::hub::repo rust/nim-sdk/src/hub/repo.rs:130] I/O error Permission denied (os error 13)
[11-26 10:19:17.953 ERROR nim_sdk::hub::repo rust/nim-sdk/src/hub/repo.rs:130] I/O error Permission denied (os error 13)
[11-26 10:19:17.953 ERROR nim_sdk::hub::repo rust/nim-sdk/src/hub/repo.rs:130] I/O error Permission denied (os error 13)
[11-26 10:19:17.953 ERROR nim_sdk::hub::repo rust/nim-sdk/src/hub/repo.rs:130] I/O error Permission denied (os error 13)
[11-26 10:19:17.953 ERROR nim_sdk::hub::repo rust/nim-sdk/src/hub/repo.rs:130] I/O error Permission denied (os error 13)
[11-26 10:19:17.953 ERROR nim_sdk::hub::repo rust/nim-sdk/src/hub/repo.rs:130] I/O error Permission denied (os error 13)
[11-26 10:19:17.953 ERROR nim_sdk::hub::repo rust/nim-sdk/src/hub/repo.rs:130] I/O error Permission denied (os error 13)
Traceback (most recent call last):
File /opt/nim/llm/.venv/bin/download-to-cache", line 6, in
sys.exit(download_to_cache())
File /opt/nim/llm/nim_llm_sdk/hub/pre_download.py", line 97, in download_to_cache
cached_files = repo.get_all()
Exception: I/O error Permission denied (os error 13)

Analyze:
after give folder /raid/nvidia-nim 777 permission: sudo chmod -R 777 /raid/nvidia-nim, able to download nim model
suggest: add :sudo chmod -R 777 /raid/nvidia-nim in setup.sh script or remind user give 777 permission to model cache folder

Issue 2: hit OOM and Crash issue after step 9
Observe:
6b8cd071-2f06-4025-8c51-1fd5b17e6ee3

Analyze:
when set cpu memory < 77G in runtime yaml file(default value is 32G), it always get OOM and Crash issue
if set cpu memory >= 77G in runtime yaml file, then able to deploy nim with no issue
8cc731b2-3e45-42b0-878d-09659707d59d
suggest: tell user to increase cpu and gpu memory if they encounter this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant