Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"TorchServe Fails to Reload Models from S3 After k8s Pod Restart" #3324

Open
koscevicb opened this issue Sep 20, 2024 · 1 comment
Open

"TorchServe Fails to Reload Models from S3 After k8s Pod Restart" #3324

koscevicb opened this issue Sep 20, 2024 · 1 comment

Comments

@koscevicb
Copy link

koscevicb commented Sep 20, 2024

🐛 Describe the bug

I am utilizing TorchServe for model inference, deployed in a Docker container orchestrated by k8s. The models are externally stored on S3 bucket and are loaded into the model store using S3 URLs at runtime.
When the pod running TorchServe undergoes a restart, TorchServe no longer loads the models from their S3 URLs, instead it defaults to using the model that are is in the model store, modelUrl points to the .mar file from model store.
I am able to unregister a model but the corresponding .mar file within the model store is not deleted (which is case when modelUrl points to S3) and I am unable to register the model from S3 again until I manually remove the .mar file from the local model store. It appears that the .mar file cannot be overwritten as is typically expected.

Error logs

{
"code": 500,
"type": "InternalServerException",
"message": "Model file already exists product_classifier_4995.mar"
}

Installation instructions

Docker, k8s is used to run torchserve. Latest version (0.11.1)
It is running on CPU used for vision models.
Models are loaded using url to S3 bucket.

Model Packaging

using the official example instructions

config.properties

inference_address=http://0.0.0.0:8080
management_address=http://0.0.0.0:8081
metrics_address=http://0.0.0.0:8082
number_of_netty_threads=32
job_queue_size=1000
model_store=/home/model-server/model-store
workflow_store=/home/model-server/wf-store
install_py_dep_per_model=true
load_models=all
default_workers_per_model=2
disable_token_authorization=true
enable_model_api=true

Versions

0.11.1

Repro instructions

Mentioned above

Possible Solution

No response

@matej14086
Copy link

Having the same issue.
@lxning could this be something related to model-snapshot not saving the model URL for models loaded from S3?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants