Error in MetricCollector when starting pytorch/torchserve:0.12.0-gpu container #3349

Hspix · 2024-10-19T06:52:14Z

🐛 Describe the bug

I'm encountering an issue when trying to start a container using the pytorch/torchserve:0.12.0-gpu image. The container starts but then fails to collect system metrics, specifically related to GPU utilization.
In the actual inference operation of the model, only the CPU can be used rather than the GPU.

Error logs

WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
2024-10-19T06:43:12,922 [DEBUG] main org.pytorch.serve.util.ConfigManager - xpu-smi not available or failed: Cannot run program "xpu-smi": error=2, No such file or directory
2024-10-19T06:43:12,928 [WARN ] main org.pytorch.serve.util.ConfigManager - Your torchserve instance can access any URL to load models. When deploying to production, make sure to limit the set of allowed_urls in config.properties
2024-10-19T06:43:12,941 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Initializing plugins manager...
2024-10-19T06:43:12,999 [INFO ] main org.pytorch.serve.metrics.configuration.MetricConfiguration - Successfully loaded metrics configuration from /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml
2024-10-19T06:43:13,194 [INFO ] main org.pytorch.serve.ModelServer -
Torchserve version: 0.12.0
TS Home: /home/venv/lib/python3.9/site-packages
Current directory: /home/model-server
Temp directory: /home/model-server/tmp
Metrics config path: /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml
Number of GPUs: 2
Number of CPUs: 48
Max heap size: 30688 M
Python executable: /home/venv/bin/python
Config file: /home/model-server/config.properties
Inference address: http://0.0.0.0:8080
Management address: http://0.0.0.0:8081
Metrics address: http://0.0.0.0:8082
Model Store: /home/model-server/model-store
Initial Models: N/A
Log dir: /home/model-server/logs
Metrics dir: /home/model-server/logs
Netty threads: 32
Netty client threads: 0
Default workers per model: 2
Blacklist Regex: N/A
Maximum Response Size: 6553500
Maximum Request Size: 6553500
Limit Maximum Image Pixels: true
Prefer direct buffer: false
Allowed Urls: [file://.|http(s)?://.]
Custom python dependency for model allowed: true
Enable metrics API: true
Metrics mode: LOG
Disable system metrics: false
Workflow Store: /home/model-server/wf-store
CPP log config: N/A
Model config: N/A
System metrics command: default
Model API enabled: true
2024-10-19T06:43:13,209 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Loading snapshot serializer plugin...
2024-10-19T06:43:13,233 [INFO ] main org.pytorch.serve.ModelServer - Initialize Inference server with: EpollServerSocketChannel.
2024-10-19T06:43:13,293 [INFO ] main org.pytorch.serve.ModelServer - Inference API bind to: http://0.0.0.0:8080
2024-10-19T06:43:13,294 [INFO ] main org.pytorch.serve.ModelServer - Initialize Management server with: EpollServerSocketChannel.
2024-10-19T06:43:13,295 [INFO ] main org.pytorch.serve.ModelServer - Management API bind to: http://0.0.0.0:8081
2024-10-19T06:43:13,295 [INFO ] main org.pytorch.serve.ModelServer - Initialize Metrics server with: EpollServerSocketChannel.
2024-10-19T06:43:13,296 [INFO ] main org.pytorch.serve.ModelServer - Metrics API bind to: http://0.0.0.0:8082
Model server started.
2024-10-19T06:43:14,224 [ERROR] Thread-1 org.pytorch.serve.metrics.MetricCollector - Traceback (most recent call last):
File "/home/venv/lib/python3.9/site-packages/pynvml/nvml.py", line 850, in _nvmlGetFunctionPointer
_nvmlGetFunctionPointer_cache[name] = getattr(nvmlLib, name)
File "/usr/lib/python3.9/ctypes/init.py", line 387, in getattr
func = self.getitem(name)
File "/usr/lib/python3.9/ctypes/init.py", line 392, in getitem
func = self._FuncPtr((name_or_ordinal, self))
AttributeError: /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1: undefined symbol: nvmlDeviceGetComputeRunningProcesses_v3

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/venv/lib/python3.9/site-packages/ts/metrics/metric_collector.py", line 27, in
system_metrics.collect_all(sys.modules['ts.metrics.system_metrics'], arguments.gpu)
File "/home/venv/lib/python3.9/site-packages/ts/metrics/system_metrics.py", line 119, in collect_all
value(num_of_gpu)
File "/home/venv/lib/python3.9/site-packages/ts/metrics/system_metrics.py", line 90, in gpu_utilization
statuses = list_gpus.device_statuses()
File "/home/venv/lib/python3.9/site-packages/nvgpu/list_gpus.py", line 75, in device_statuses
return [device_status(device_index) for device_index in range(device_count)]
File "/home/venv/lib/python3.9/site-packages/nvgpu/list_gpus.py", line 75, in
return [device_status(device_index) for device_index in range(device_count)]
File "/home/venv/lib/python3.9/site-packages/nvgpu/list_gpus.py", line 19, in device_status
nv_procs = nv.nvmlDeviceGetComputeRunningProcesses(handle)
File "/home/venv/lib/python3.9/site-packages/pynvml/nvml.py", line 2608, in nvmlDeviceGetComputeRunningProcesses
return nvmlDeviceGetComputeRunningProcesses_v3(handle);
File "/home/venv/lib/python3.9/site-packages/pynvml/nvml.py", line 2576, in nvmlDeviceGetComputeRunningProcesses_v3
fn = _nvmlGetFunctionPointer("nvmlDeviceGetComputeRunningProcesses_v3")
File "/home/venv/lib/python3.9/site-packages/pynvml/nvml.py", line 853, in _nvmlGetFunctionPointer
raise NVMLError(NVML_ERROR_FUNCTION_NOT_FOUND)
pynvml.nvml.NVMLError_FunctionNotFound: Function Not Found

Installation instructions

docker pull pytorch/torchserve:0.12.0-gpu

Model Packaging

No Packaging

config.properties

disable_token_authorization=true
enable_model_api=true
service_envelope=body
install_py_dep_per_model=true
inference_address=http://0.0.0.0:8080
management_address=http://0.0.0.0:8081
metrics_address=http://0.0.0.0:8082
grpc_inference_address=0.0.0.0
grpc_management_address=0.0.0.0
number_of_netty_threads=32
job_queue_size=1000
model_store=/home/model-server/model-store
workflow_store=/home/model-server/wf-store

Versions

Docker version 26.1.3

Repro instructions

docker run --rm -it --gpus all -d -p 28380:8080 -p 28381:8081 --name torch-server-g -v ./config.properties:/home/model-server/config.properties pytorch/torchserve:0.12.0-gpu

Possible Solution

No response

Hspix changed the title ~~TorchServe container fails to start with pytorch/torchserve:0.12.0-gpu due to pynvml error~~ Error in MetricCollector when starting pytorch/torchserve:0.12.0-gpu container Oct 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error in MetricCollector when starting pytorch/torchserve:0.12.0-gpu container #3349

Error in MetricCollector when starting pytorch/torchserve:0.12.0-gpu container #3349

Hspix commented Oct 19, 2024 •

edited

Loading

Error in MetricCollector when starting pytorch/torchserve:0.12.0-gpu container #3349

Error in MetricCollector when starting pytorch/torchserve:0.12.0-gpu container #3349

Comments

Hspix commented Oct 19, 2024 • edited Loading

🐛 Describe the bug

Error logs

Installation instructions

Model Packaging

config.properties

Versions

Repro instructions

Possible Solution

Hspix commented Oct 19, 2024 •

edited

Loading