"TorchServe Fails to Reload Models from S3 After k8s Pod Restart" #3324

koscevicb · 2024-09-20T07:18:52Z

🐛 Describe the bug

I am utilizing TorchServe for model inference, deployed in a Docker container orchestrated by k8s. The models are externally stored on S3 bucket and are loaded into the model store using S3 URLs at runtime.
When the pod running TorchServe undergoes a restart, TorchServe no longer loads the models from their S3 URLs, instead it defaults to using the model that are is in the model store, modelUrl points to the .mar file from model store.
I am able to unregister a model but the corresponding .mar file within the model store is not deleted (which is case when modelUrl points to S3) and I am unable to register the model from S3 again until I manually remove the .mar file from the local model store. It appears that the .mar file cannot be overwritten as is typically expected.

Error logs

{
"code": 500,
"type": "InternalServerException",
"message": "Model file already exists product_classifier_4995.mar"
}

Installation instructions

Docker, k8s is used to run torchserve. Latest version (0.11.1)
It is running on CPU used for vision models.
Models are loaded using url to S3 bucket.

Model Packaging

using the official example instructions

config.properties

inference_address=http://0.0.0.0:8080
management_address=http://0.0.0.0:8081
metrics_address=http://0.0.0.0:8082
number_of_netty_threads=32
job_queue_size=1000
model_store=/home/model-server/model-store
workflow_store=/home/model-server/wf-store
install_py_dep_per_model=true
load_models=all
default_workers_per_model=2
disable_token_authorization=true
enable_model_api=true

Versions

0.11.1

Repro instructions

Mentioned above

Possible Solution

No response

matej14086 · 2024-10-18T10:29:13Z

Having the same issue.
@lxning could this be something related to model-snapshot not saving the model URL for models loaded from S3?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"TorchServe Fails to Reload Models from S3 After k8s Pod Restart" #3324

"TorchServe Fails to Reload Models from S3 After k8s Pod Restart" #3324

koscevicb commented Sep 20, 2024 •

edited

Loading

matej14086 commented Oct 18, 2024

"TorchServe Fails to Reload Models from S3 After k8s Pod Restart" #3324

"TorchServe Fails to Reload Models from S3 After k8s Pod Restart" #3324

Comments

koscevicb commented Sep 20, 2024 • edited Loading

🐛 Describe the bug

Error logs

Installation instructions

Model Packaging

config.properties

Versions

Repro instructions

Possible Solution

matej14086 commented Oct 18, 2024

koscevicb commented Sep 20, 2024 •

edited

Loading