You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
ADDITIONAL INFORMATION: Meta Llama 3 Community License, Built with Meta Llama 3.
A copy of the Llama 3 license can be found under /opt/nim/MODEL_LICENSE.
2024-08-02 06:29:52,569 [INFO] PyTorch version 2.2.2 available.
2024-08-02 06:29:53,660 [WARNING] [TRT-LLM] [W] Logger level already set from environment. Discard new verbosity: error
2024-08-02 06:29:53,661 [INFO] [TRT-LLM] [I] Starting TensorRT-LLM init.
2024-08-02 06:29:53,704 [INFO] [TRT-LLM] [I] TensorRT-LLM inited.
[TensorRT-LLM] TensorRT-LLM version: 0.10.1.dev2024053000
INFO 08-02 06:29:55.426 api_server.py:489] NIM LLM API version 1.0.0
INFO 08-02 06:29:55.430 ngc_profile.py:215] Running NIM with LoRA enabled. Only looking for compatible profiles that support LoRA.
INFO 08-02 06:29:55.430 ngc_profile.py:219] Detected 1 compatible profile(s).
INFO 08-02 06:29:55.431 ngc_injector.py:106] Valid profile: 8d3824f766182a754159e88ad5a0bd465b1b4cf69ecf80bd6d6833753e945740 (vllm-fp16-tp1-lora) on GPUs [0]
INFO 08-02 06:29:55.431 ngc_injector.py:141] Selected profile: 8d3824f766182a754159e88ad5a0bd465b1b4cf69ecf80bd6d6833753e945740 (vllm-fp16-tp1-lora)
INFO 08-02 06:29:56.93 ngc_injector.py:146] Profile metadata: feat_lora_max_rank: 32
INFO 08-02 06:29:56.93 ngc_injector.py:146] Profile metadata: feat_lora: true
INFO 08-02 06:29:56.94 ngc_injector.py:146] Profile metadata: llm_engine: vllm
INFO 08-02 06:29:56.94 ngc_injector.py:146] Profile metadata: precision: fp16
INFO 08-02 06:29:56.94 ngc_injector.py:146] Profile metadata: tp: 1
INFO 08-02 06:29:56.94 ngc_injector.py:166] Preparing model workspace. This step might download additional files to run the model.
INFO 08-02 06:30:00.30 ngc_injector.py:172] Model workspace is now ready. It took 3.936 seconds
INFO 08-02 06:30:00.38 llm_engine.py:98] Initializing an LLM engine (v0.4.1) with config: model='/tmp/meta--llama3-8b-instruct-uegwv0dx', speculative_config=None, tokenizer='/tmp/meta--llama3-8b-instruct-uegwv0dx', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0)
WARNING 08-02 06:30:00.786 logging.py:314] Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 08-02 06:30:00.827 utils.py:609] Found nccl from library /usr/local/lib/python3.10/dist-packages/nvidia/nccl/lib/libnccl.so.2
INFO 08-02 06:30:05 selector.py:65] Cannot use FlashAttention backend for Volta and Turing GPUs.
INFO 08-02 06:30:05 selector.py:33] Using XFormers backend.
INFO 08-02 06:30:14 model_runner.py:173] Loading model weights took 14.9771 GB
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.10/dist-packages/vllm_nvext/entrypoints/openai/api_server.py", line 498, in
engine = AsyncLLMEngineFactory.from_engine_args(engine_args, usage_context=UsageContext.OPENAI_API_SERVER)
File "/usr/local/lib/python3.10/dist-packages/vllm_nvext/engine/async_trtllm_engine.py", line 412, in from_engine_args
engine = engine_cls.from_engine_args(engine_args, start_engine_loop, usage_context)
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 365, in from_engine_args
engine = cls(
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 323, in init
self.engine = self._init_engine(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 442, in _init_engine
return engine_class(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 160, in init
self._initialize_kv_caches()
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 237, in _initialize_kv_caches
self.model_executor.determine_num_available_blocks())
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 111, in determine_num_available_blocks
return self.driver_worker.determine_num_available_blocks()
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 135, in determine_num_available_blocks
self.model_runner.profile_run()
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 924, in profile_run
self.execute_model(seqs, kv_caches)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 845, in execute_model
hidden_states = model_executable(**execute_model_kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 361, in forward
hidden_states = self.model(input_ids, positions, kv_caches,
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 286, in forward
hidden_states, residual = layer(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 224, in forward
hidden_states = self.input_layernorm(hidden_states)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/layernorm.py", line 59, in forward
out = torch.empty_like(x)
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
The text was updated successfully, but these errors were encountered:
I followed the documentation to build the LLaMA 3 8B Instruct model with multiple LoRA versions as described in this NVIDIA blog post(https://developer.nvidia.com/zh-cn/blog/deploy-multilingual-llms-with-nvidia-nim/)
My machine is V100
and my commnad is
docker run -it --rm --name=$CONTAINER_NAME
--runtime=nvidia
--gpus "device=5"
--shm-size=16GB
-e NGC_API_KEY
-e NIM_PEFT_SOURCE
-e NIM_PEFT_REFRESH_INTERVAL
-e NIM_MAX_LORA_RANK
-v $NIM_CACHE_PATH:/opt/nim/.cache
-v $LOCAL_PEFT_DIRECTORY:$NIM_PEFT_SOURCE
-p 8000:8000
nvcr.io/nim/meta/llama3-8b-instruct:1.0.0
and I encounted
===========================================
== NVIDIA Inference Microservice LLM NIM ==
NVIDIA Inference Microservice LLM NIM Version 1.0.0
Model: nim/meta/llama3-8b-instruct
Container image Copyright (c) 2016-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This NIM container is governed by the NVIDIA AI Product Agreement here:
https://www.nvidia.com/en-us/data-center/products/nvidia-ai-enterprise/eula/.
A copy of this license can be found under /opt/nim/LICENSE.
The use of this model is governed by the AI Foundation Models Community License
here: https://docs.nvidia.com/ai-foundation-models-community-license.pdf.
ADDITIONAL INFORMATION: Meta Llama 3 Community License, Built with Meta Llama 3.
A copy of the Llama 3 license can be found under /opt/nim/MODEL_LICENSE.
2024-08-02 06:29:52,569 [INFO] PyTorch version 2.2.2 available.
2024-08-02 06:29:53,660 [WARNING] [TRT-LLM] [W] Logger level already set from environment. Discard new verbosity: error
2024-08-02 06:29:53,661 [INFO] [TRT-LLM] [I] Starting TensorRT-LLM init.
2024-08-02 06:29:53,704 [INFO] [TRT-LLM] [I] TensorRT-LLM inited.
[TensorRT-LLM] TensorRT-LLM version: 0.10.1.dev2024053000
INFO 08-02 06:29:55.426 api_server.py:489] NIM LLM API version 1.0.0
INFO 08-02 06:29:55.430 ngc_profile.py:215] Running NIM with LoRA enabled. Only looking for compatible profiles that support LoRA.
INFO 08-02 06:29:55.430 ngc_profile.py:219] Detected 1 compatible profile(s).
INFO 08-02 06:29:55.431 ngc_injector.py:106] Valid profile: 8d3824f766182a754159e88ad5a0bd465b1b4cf69ecf80bd6d6833753e945740 (vllm-fp16-tp1-lora) on GPUs [0]
INFO 08-02 06:29:55.431 ngc_injector.py:141] Selected profile: 8d3824f766182a754159e88ad5a0bd465b1b4cf69ecf80bd6d6833753e945740 (vllm-fp16-tp1-lora)
INFO 08-02 06:29:56.93 ngc_injector.py:146] Profile metadata: feat_lora_max_rank: 32
INFO 08-02 06:29:56.93 ngc_injector.py:146] Profile metadata: feat_lora: true
INFO 08-02 06:29:56.94 ngc_injector.py:146] Profile metadata: llm_engine: vllm
INFO 08-02 06:29:56.94 ngc_injector.py:146] Profile metadata: precision: fp16
INFO 08-02 06:29:56.94 ngc_injector.py:146] Profile metadata: tp: 1
INFO 08-02 06:29:56.94 ngc_injector.py:166] Preparing model workspace. This step might download additional files to run the model.
INFO 08-02 06:30:00.30 ngc_injector.py:172] Model workspace is now ready. It took 3.936 seconds
INFO 08-02 06:30:00.38 llm_engine.py:98] Initializing an LLM engine (v0.4.1) with config: model='/tmp/meta--llama3-8b-instruct-uegwv0dx', speculative_config=None, tokenizer='/tmp/meta--llama3-8b-instruct-uegwv0dx', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0)
WARNING 08-02 06:30:00.786 logging.py:314] Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 08-02 06:30:00.827 utils.py:609] Found nccl from library /usr/local/lib/python3.10/dist-packages/nvidia/nccl/lib/libnccl.so.2
INFO 08-02 06:30:05 selector.py:65] Cannot use FlashAttention backend for Volta and Turing GPUs.
INFO 08-02 06:30:05 selector.py:33] Using XFormers backend.
INFO 08-02 06:30:14 model_runner.py:173] Loading model weights took 14.9771 GB
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.10/dist-packages/vllm_nvext/entrypoints/openai/api_server.py", line 498, in
engine = AsyncLLMEngineFactory.from_engine_args(engine_args, usage_context=UsageContext.OPENAI_API_SERVER)
File "/usr/local/lib/python3.10/dist-packages/vllm_nvext/engine/async_trtllm_engine.py", line 412, in from_engine_args
engine = engine_cls.from_engine_args(engine_args, start_engine_loop, usage_context)
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 365, in from_engine_args
engine = cls(
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 323, in init
self.engine = self._init_engine(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 442, in _init_engine
return engine_class(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 160, in init
self._initialize_kv_caches()
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 237, in _initialize_kv_caches
self.model_executor.determine_num_available_blocks())
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 111, in determine_num_available_blocks
return self.driver_worker.determine_num_available_blocks()
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 135, in determine_num_available_blocks
self.model_runner.profile_run()
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 924, in profile_run
self.execute_model(seqs, kv_caches)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 845, in execute_model
hidden_states = model_executable(**execute_model_kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 361, in forward
hidden_states = self.model(input_ids, positions, kv_caches,
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 286, in forward
hidden_states, residual = layer(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 224, in forward
hidden_states = self.input_layernorm(hidden_states)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/layernorm.py", line 59, in forward
out = torch.empty_like(x)
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
The text was updated successfully, but these errors were encountered: