-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[backend] Metadata writer pod always restarting #8200
Comments
Hey @zijianjoy . Completely loss this message. sorry. This is happens wih all our clusters, as soon we start the kubeflow pipelines the metadata-writes starts restarting with this issue. this happens until today, with k8s 1.24. not sure I can give you more information. but I have some more logs:
|
I'm not sure the reason or details, but the issue disappear after I rebooted one of the control-plane. |
I am getting the same error:
manifest
sa - role manifest
pod - log
|
I am getting the same error, with kubeflow 1.18 & K8s 1.27.6, will this be fixed in the next kubeflow release? |
We have a possible solution described in a previous comment. Other than that, we need more info on when it happens and get info about KFP backend, KFP SDK, and k8s versions. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it. |
/reopen This still happens currently. The details are below, will be glad to share other details if needed. Kubeflow Pipelines version is 2.2.0, platform agnostic, installed on GKE using the following command:
|
@OutSorcerer: You can't reopen an issue/PR unless you authored it or you are a collaborator. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@kubeflow/pipelines I think this issue is still present, and it's related to a very old bug in the I see that there are many other projects which had the same issue:
Fix - Part 1We need to update to a newer version of the
We also probably want to make sure our
Fix - Part 2We probably need to implement a retry on ProtocolError (which includes Here are some people talking about that fix: For reference, here is our code that does the watch: pipelines/backend/metadata_writer/src/metadata_writer.py Lines 157 to 162 in 4467df5
/reopen |
@thesuperzapper: Reopened this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@thesuperzapper, I made a PR that fixes the exception on my installation of Kubeflow Pipelines on GKE, please have a look: #11361 When I significantly reduced timeouts for an experiment like this
the errors disappeared (after testing for multiple hours, originally errors were happending twice per hour), so it seems on GKE there is a proxy somewhere that breaks long connections. But the fix I propose in the PR is to keep |
Environment
Manifests in k8s
1.21
1.8.1/1.8.2/1.8.3/1.8.4
Steps to reproduce
Hi.
Since release 1.8.1 (can't be sure about older versions) our metadata-writer pod is always restarting infinitely with the following message error:
We already try the most recent versions of version 1.8 (we did not try version 2.0.0).
The pipelines are working very well, and we don't have any problems till now because of this, but this only happens with this pod.
This happens in our multiple clusters with multiple installations, so don't look like an issue of a specific cluster.
Expected result
The pod should stop restarting.
Impacted by this bug? Give it a 👍.
The text was updated successfully, but these errors were encountered: