You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm encountering an issue when trying to start a container using the pytorch/torchserve:0.12.0-gpu image. The container starts but then fails to collect system metrics, specifically related to GPU utilization.
In the actual inference operation of the model, only the CPU can be used rather than the GPU.
Error logs
WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
2024-10-19T06:43:12,922 [DEBUG] main org.pytorch.serve.util.ConfigManager - xpu-smi not available or failed: Cannot run program "xpu-smi": error=2, No such file or directory
2024-10-19T06:43:12,928 [WARN ] main org.pytorch.serve.util.ConfigManager - Your torchserve instance can access any URL to load models. When deploying to production, make sure to limit the set of allowed_urls in config.properties
2024-10-19T06:43:12,941 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Initializing plugins manager...
2024-10-19T06:43:12,999 [INFO ] main org.pytorch.serve.metrics.configuration.MetricConfiguration - Successfully loaded metrics configuration from /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml
2024-10-19T06:43:13,194 [INFO ] main org.pytorch.serve.ModelServer -
Torchserve version: 0.12.0
TS Home: /home/venv/lib/python3.9/site-packages
Current directory: /home/model-server
Temp directory: /home/model-server/tmp
Metrics config path: /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml
Number of GPUs: 2
Number of CPUs: 48
Max heap size: 30688 M
Python executable: /home/venv/bin/python
Config file: /home/model-server/config.properties
Inference address: http://0.0.0.0:8080
Management address: http://0.0.0.0:8081
Metrics address: http://0.0.0.0:8082
Model Store: /home/model-server/model-store
Initial Models: N/A
Log dir: /home/model-server/logs
Metrics dir: /home/model-server/logs
Netty threads: 32
Netty client threads: 0
Default workers per model: 2
Blacklist Regex: N/A
Maximum Response Size: 6553500
Maximum Request Size: 6553500
Limit Maximum Image Pixels: true
Prefer direct buffer: false
Allowed Urls: [file://.|http(s)?://.]
Custom python dependency for model allowed: true
Enable metrics API: true
Metrics mode: LOG
Disable system metrics: false
Workflow Store: /home/model-server/wf-store
CPP log config: N/A
Model config: N/A
System metrics command: default
Model API enabled: true
2024-10-19T06:43:13,209 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Loading snapshot serializer plugin...
2024-10-19T06:43:13,233 [INFO ] main org.pytorch.serve.ModelServer - Initialize Inference server with: EpollServerSocketChannel.
2024-10-19T06:43:13,293 [INFO ] main org.pytorch.serve.ModelServer - Inference API bind to: http://0.0.0.0:8080
2024-10-19T06:43:13,294 [INFO ] main org.pytorch.serve.ModelServer - Initialize Management server with: EpollServerSocketChannel.
2024-10-19T06:43:13,295 [INFO ] main org.pytorch.serve.ModelServer - Management API bind to: http://0.0.0.0:8081
2024-10-19T06:43:13,295 [INFO ] main org.pytorch.serve.ModelServer - Initialize Metrics server with: EpollServerSocketChannel.
2024-10-19T06:43:13,296 [INFO ] main org.pytorch.serve.ModelServer - Metrics API bind to: http://0.0.0.0:8082
Model server started.
2024-10-19T06:43:14,224 [ERROR] Thread-1 org.pytorch.serve.metrics.MetricCollector - Traceback (most recent call last):
File "/home/venv/lib/python3.9/site-packages/pynvml/nvml.py", line 850, in _nvmlGetFunctionPointer
_nvmlGetFunctionPointer_cache[name] = getattr(nvmlLib, name)
File "/usr/lib/python3.9/ctypes/init.py", line 387, in getattr
func = self.getitem(name)
File "/usr/lib/python3.9/ctypes/init.py", line 392, in getitem
func = self._FuncPtr((name_or_ordinal, self))
AttributeError: /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1: undefined symbol: nvmlDeviceGetComputeRunningProcesses_v3
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/venv/lib/python3.9/site-packages/ts/metrics/metric_collector.py", line 27, in
system_metrics.collect_all(sys.modules['ts.metrics.system_metrics'], arguments.gpu)
File "/home/venv/lib/python3.9/site-packages/ts/metrics/system_metrics.py", line 119, in collect_all
value(num_of_gpu)
File "/home/venv/lib/python3.9/site-packages/ts/metrics/system_metrics.py", line 90, in gpu_utilization
statuses = list_gpus.device_statuses()
File "/home/venv/lib/python3.9/site-packages/nvgpu/list_gpus.py", line 75, in device_statuses
return [device_status(device_index) for device_index in range(device_count)]
File "/home/venv/lib/python3.9/site-packages/nvgpu/list_gpus.py", line 75, in
return [device_status(device_index) for device_index in range(device_count)]
File "/home/venv/lib/python3.9/site-packages/nvgpu/list_gpus.py", line 19, in device_status
nv_procs = nv.nvmlDeviceGetComputeRunningProcesses(handle)
File "/home/venv/lib/python3.9/site-packages/pynvml/nvml.py", line 2608, in nvmlDeviceGetComputeRunningProcesses
return nvmlDeviceGetComputeRunningProcesses_v3(handle);
File "/home/venv/lib/python3.9/site-packages/pynvml/nvml.py", line 2576, in nvmlDeviceGetComputeRunningProcesses_v3
fn = _nvmlGetFunctionPointer("nvmlDeviceGetComputeRunningProcesses_v3")
File "/home/venv/lib/python3.9/site-packages/pynvml/nvml.py", line 853, in _nvmlGetFunctionPointer
raise NVMLError(NVML_ERROR_FUNCTION_NOT_FOUND)
pynvml.nvml.NVMLError_FunctionNotFound: Function Not Found
docker run --rm -it --gpus all -d -p 28380:8080 -p 28381:8081 --name torch-server-g -v ./config.properties:/home/model-server/config.properties pytorch/torchserve:0.12.0-gpu
Possible Solution
No response
The text was updated successfully, but these errors were encountered:
Hspix
changed the title
TorchServe container fails to start with pytorch/torchserve:0.12.0-gpu due to pynvml error
Error in MetricCollector when starting pytorch/torchserve:0.12.0-gpu container
Oct 19, 2024
🐛 Describe the bug
I'm encountering an issue when trying to start a container using the pytorch/torchserve:0.12.0-gpu image. The container starts but then fails to collect system metrics, specifically related to GPU utilization.
In the actual inference operation of the model, only the CPU can be used rather than the GPU.
Error logs
WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
2024-10-19T06:43:12,922 [DEBUG] main org.pytorch.serve.util.ConfigManager - xpu-smi not available or failed: Cannot run program "xpu-smi": error=2, No such file or directory
2024-10-19T06:43:12,928 [WARN ] main org.pytorch.serve.util.ConfigManager - Your torchserve instance can access any URL to load models. When deploying to production, make sure to limit the set of allowed_urls in config.properties
2024-10-19T06:43:12,941 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Initializing plugins manager...
2024-10-19T06:43:12,999 [INFO ] main org.pytorch.serve.metrics.configuration.MetricConfiguration - Successfully loaded metrics configuration from /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml
2024-10-19T06:43:13,194 [INFO ] main org.pytorch.serve.ModelServer -
Torchserve version: 0.12.0
TS Home: /home/venv/lib/python3.9/site-packages
Current directory: /home/model-server
Temp directory: /home/model-server/tmp
Metrics config path: /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml
Number of GPUs: 2
Number of CPUs: 48
Max heap size: 30688 M
Python executable: /home/venv/bin/python
Config file: /home/model-server/config.properties
Inference address: http://0.0.0.0:8080
Management address: http://0.0.0.0:8081
Metrics address: http://0.0.0.0:8082
Model Store: /home/model-server/model-store
Initial Models: N/A
Log dir: /home/model-server/logs
Metrics dir: /home/model-server/logs
Netty threads: 32
Netty client threads: 0
Default workers per model: 2
Blacklist Regex: N/A
Maximum Response Size: 6553500
Maximum Request Size: 6553500
Limit Maximum Image Pixels: true
Prefer direct buffer: false
Allowed Urls: [file://.|http(s)?://.]
Custom python dependency for model allowed: true
Enable metrics API: true
Metrics mode: LOG
Disable system metrics: false
Workflow Store: /home/model-server/wf-store
CPP log config: N/A
Model config: N/A
System metrics command: default
Model API enabled: true
2024-10-19T06:43:13,209 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Loading snapshot serializer plugin...
2024-10-19T06:43:13,233 [INFO ] main org.pytorch.serve.ModelServer - Initialize Inference server with: EpollServerSocketChannel.
2024-10-19T06:43:13,293 [INFO ] main org.pytorch.serve.ModelServer - Inference API bind to: http://0.0.0.0:8080
2024-10-19T06:43:13,294 [INFO ] main org.pytorch.serve.ModelServer - Initialize Management server with: EpollServerSocketChannel.
2024-10-19T06:43:13,295 [INFO ] main org.pytorch.serve.ModelServer - Management API bind to: http://0.0.0.0:8081
2024-10-19T06:43:13,295 [INFO ] main org.pytorch.serve.ModelServer - Initialize Metrics server with: EpollServerSocketChannel.
2024-10-19T06:43:13,296 [INFO ] main org.pytorch.serve.ModelServer - Metrics API bind to: http://0.0.0.0:8082
Model server started.
2024-10-19T06:43:14,224 [ERROR] Thread-1 org.pytorch.serve.metrics.MetricCollector - Traceback (most recent call last):
File "/home/venv/lib/python3.9/site-packages/pynvml/nvml.py", line 850, in _nvmlGetFunctionPointer
_nvmlGetFunctionPointer_cache[name] = getattr(nvmlLib, name)
File "/usr/lib/python3.9/ctypes/init.py", line 387, in getattr
func = self.getitem(name)
File "/usr/lib/python3.9/ctypes/init.py", line 392, in getitem
func = self._FuncPtr((name_or_ordinal, self))
AttributeError: /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1: undefined symbol: nvmlDeviceGetComputeRunningProcesses_v3
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/venv/lib/python3.9/site-packages/ts/metrics/metric_collector.py", line 27, in
system_metrics.collect_all(sys.modules['ts.metrics.system_metrics'], arguments.gpu)
File "/home/venv/lib/python3.9/site-packages/ts/metrics/system_metrics.py", line 119, in collect_all
value(num_of_gpu)
File "/home/venv/lib/python3.9/site-packages/ts/metrics/system_metrics.py", line 90, in gpu_utilization
statuses = list_gpus.device_statuses()
File "/home/venv/lib/python3.9/site-packages/nvgpu/list_gpus.py", line 75, in device_statuses
return [device_status(device_index) for device_index in range(device_count)]
File "/home/venv/lib/python3.9/site-packages/nvgpu/list_gpus.py", line 75, in
return [device_status(device_index) for device_index in range(device_count)]
File "/home/venv/lib/python3.9/site-packages/nvgpu/list_gpus.py", line 19, in device_status
nv_procs = nv.nvmlDeviceGetComputeRunningProcesses(handle)
File "/home/venv/lib/python3.9/site-packages/pynvml/nvml.py", line 2608, in nvmlDeviceGetComputeRunningProcesses
return nvmlDeviceGetComputeRunningProcesses_v3(handle);
File "/home/venv/lib/python3.9/site-packages/pynvml/nvml.py", line 2576, in nvmlDeviceGetComputeRunningProcesses_v3
fn = _nvmlGetFunctionPointer("nvmlDeviceGetComputeRunningProcesses_v3")
File "/home/venv/lib/python3.9/site-packages/pynvml/nvml.py", line 853, in _nvmlGetFunctionPointer
raise NVMLError(NVML_ERROR_FUNCTION_NOT_FOUND)
pynvml.nvml.NVMLError_FunctionNotFound: Function Not Found
Installation instructions
docker pull pytorch/torchserve:0.12.0-gpu
Model Packaging
No Packaging
config.properties
disable_token_authorization=true
enable_model_api=true
service_envelope=body
install_py_dep_per_model=true
inference_address=http://0.0.0.0:8080
management_address=http://0.0.0.0:8081
metrics_address=http://0.0.0.0:8082
grpc_inference_address=0.0.0.0
grpc_management_address=0.0.0.0
number_of_netty_threads=32
job_queue_size=1000
model_store=/home/model-server/model-store
workflow_store=/home/model-server/wf-store
Versions
Docker version 26.1.3
Repro instructions
docker run --rm -it --gpus all -d -p 28380:8080 -p 28381:8081 --name torch-server-g -v ./config.properties:/home/model-server/config.properties pytorch/torchserve:0.12.0-gpu
Possible Solution
No response
The text was updated successfully, but these errors were encountered: