Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[backend] Metadata writer pod always restarting #8200

Open
andre-lx opened this issue Aug 26, 2022 · 14 comments · May be fixed by #11361
Open

[backend] Metadata writer pod always restarting #8200

andre-lx opened this issue Aug 26, 2022 · 14 comments · May be fixed by #11361

Comments

@andre-lx
Copy link

andre-lx commented Aug 26, 2022

Environment

  • How did you deploy Kubeflow Pipelines (KFP)?
    Manifests in k8s
  • K8S Version:
    1.21
  • KFP version:
    1.8.1/1.8.2/1.8.3/1.8.4

Steps to reproduce

Hi.

Since release 1.8.1 (can't be sure about older versions) our metadata-writer pod is always restarting infinitely with the following message error:

metadata-writer-78fc7d5bb8-ph9kj                         2/2     Running   299        78d
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 697, in _update_chunk_length
    self.chunk_left = int(line, 16)
ValueError: invalid literal for int() with base 16: b''

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 438, in _error_catcher
    yield
  File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 764, in read_chunked
    self._update_chunk_length()
  File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 701, in _update_chunk_length
    raise InvalidChunkLength(self, line)
urllib3.exceptions.InvalidChunkLength: InvalidChunkLength(got length b'', 0 bytes read)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/kfp/metadata_writer/metadata_writer.py", line 157, in <module>
    for event in pod_stream:
  File "/usr/local/lib/python3.7/site-packages/kubernetes/watch/watch.py", line 144, in stream
    for line in iter_resp_lines(resp):
  File "/usr/local/lib/python3.7/site-packages/kubernetes/watch/watch.py", line 48, in iter_resp_lines
    for seg in resp.read_chunked(decode_content=False):
  File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 793, in read_chunked
    self._original_response.close()
  File "/usr/local/lib/python3.7/contextlib.py", line 130, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 455, in _error_catcher
    raise ProtocolError("Connection broken: %r" % e, e)
urllib3.exceptions.ProtocolError: ("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read))

We already try the most recent versions of version 1.8 (we did not try version 2.0.0).
The pipelines are working very well, and we don't have any problems till now because of this, but this only happens with this pod.

This happens in our multiple clusters with multiple installations, so don't look like an issue of a specific cluster.

Expected result

The pod should stop restarting.


Impacted by this bug? Give it a 👍.

@zijianjoy
Copy link
Collaborator

Hello @andre-lx , does this issue happen during start-up time, or does it happen when you are running specific pipeline? If there is more information about how to reproduce this issue, it will help us to investigate the problem.

/assign @chensun

@andre-lx
Copy link
Author

Hello @andre-lx , does this issue happen during start-up time, or does it happen when you are running specific pipeline? If there is more information about how to reproduce this issue, it will help us to investigate the problem.

/assign @chensun

Hey @zijianjoy .

Completely loss this message. sorry.

This is happens wih all our clusters, as soon we start the kubeflow pipelines the metadata-writes starts restarting with this issue.

this happens until today, with k8s 1.24.

not sure I can give you more information.

but I have some more logs:

bash-5.1# kubectl logs metadata-writer-76675f9f9-tjr7j -n kubeflow
Connected to the metadata store
Start watching Kubernetes Pods created by Argo
bash-5.1# kubectl logs metadata-writer-76675f9f9-tjr7j -n kubeflow --previous
Connected to the metadata store
Start watching Kubernetes Pods created by Argo
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 697, in _update_chunk_length
    self.chunk_left = int(line, 16)
ValueError: invalid literal for int() with base 16: b''

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 438, in _error_catcher
    yield
  File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 764, in read_chunked
    self._update_chunk_length()
  File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 701, in _update_chunk_length
    raise InvalidChunkLength(self, line)
urllib3.exceptions.InvalidChunkLength: InvalidChunkLength(got length b'', 0 bytes read)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/kfp/metadata_writer/metadata_writer.py", line 157, in <module>
    for event in pod_stream:
  File "/usr/local/lib/python3.7/site-packages/kubernetes/watch/watch.py", line 144, in stream
    for line in iter_resp_lines(resp):
  File "/usr/local/lib/python3.7/site-packages/kubernetes/watch/watch.py", line 48, in iter_resp_lines
    for seg in resp.read_chunked(decode_content=False):
  File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 793, in read_chunked
    self._original_response.close()
  File "/usr/local/lib/python3.7/contextlib.py", line 130, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 455, in _error_catcher
    raise ProtocolError("Connection broken: %r" % e, e)
urllib3.exceptions.ProtocolError: ("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read))

@shaotingcheng
Copy link

I'm not sure the reason or details, but the issue disappear after I rebooted one of the control-plane.

@Rohithzr
Copy link

Rohithzr commented Feb 4, 2024

I am getting the same error:

> kubectl version
Client Version: v1.29.0
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.3

manifest

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: metadata-writer
    application-crd-id: kubeflow-pipelines
  name: metadata-writer
  namespace: kubeflow
spec:
  replicas: 1
  selector:
    matchLabels:
      app: metadata-writer
      application-crd-id: kubeflow-pipelines
  template:
    metadata:
      labels:
        app: metadata-writer
        application-crd-id: kubeflow-pipelines
    spec:
      containers:
      - env:
        - name: NAMESPACE_TO_WATCH
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
        image: gcr.io/ml-pipeline/metadata-writer:2.0.5
        name: main
      serviceAccountName: kubeflow-pipelines-metadata-writer

sa - role manifest

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  labels:
    app: kubeflow-pipelines-metadata-writer-role
    application-crd-id: kubeflow-pipelines
  name: kubeflow-pipelines-metadata-writer-role
  namespace: kubeflow
rules:
- apiGroups:
  - ""
  resources:
  - pods
  verbs:
  - get
  - list
  - watch
  - update
  - patch
- apiGroups:
  - ""
  resources:
  - configmaps
  verbs:
  - get
- apiGroups:
  - argoproj.io
  resources:
  - workflows
  verbs:
  - get
  - list
  - watch
  - update
  - patch

pod - log

Connected to the metadata store
Start watching Kubernetes Pods created by Argo
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/urllib3/response.py", line 761, in _update_chunk_length
    self.chunk_left = int(line, 16)
ValueError: invalid literal for int() with base 16: b''

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/urllib3/response.py", line 444, in _error_catcher
    yield
  File "/usr/local/lib/python3.8/site-packages/urllib3/response.py", line 828, in read_chunked
    self._update_chunk_length()
  File "/usr/local/lib/python3.8/site-packages/urllib3/response.py", line 765, in _update_chunk_length
    raise InvalidChunkLength(self, line)
urllib3.exceptions.InvalidChunkLength: InvalidChunkLength(got length b'', 0 bytes read)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/kfp/metadata_writer/metadata_writer.py", line 163, in <module>
    for event in pod_stream:
  File "/usr/local/lib/python3.8/site-packages/kubernetes/watch/watch.py", line 144, in stream
    for line in iter_resp_lines(resp):
  File "/usr/local/lib/python3.8/site-packages/kubernetes/watch/watch.py", line 48, in iter_resp_lines
    for seg in resp.read_chunked(decode_content=False):
  File "/usr/local/lib/python3.8/site-packages/urllib3/response.py", line 857, in read_chunked
    self._original_response.close()
  File "/usr/local/lib/python3.8/contextlib.py", line 131, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/local/lib/python3.8/site-packages/urllib3/response.py", line 461, in _error_catcher
    raise ProtocolError("Connection broken: %r" % e, e)
urllib3.exceptions.ProtocolError: ("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read))

@akash-gautam
Copy link

I am getting the same error, with kubeflow 1.18 & K8s 1.27.6, will this be fixed in the next kubeflow release?

@rimolive
Copy link
Member

We have a possible solution described in a previous comment. Other than that, we need more info on when it happens and get info about KFP backend, KFP SDK, and k8s versions.

Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label May 24, 2024
Copy link

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

@OutSorcerer
Copy link

/reopen

This still happens currently. The details are below, will be glad to share other details if needed.

Kubeflow Pipelines version is 2.2.0, platform agnostic, installed on GKE using the following command:

export PIPELINE_VERSION=2.2.0 && \
kubectl apply --kustomize="github.com/kubeflow/pipelines/manifests/kustomize/cluster-scoped-resources?ref=${PIPELINE_VERSION}" && \
kubectl wait crd/applications.app.k8s.io --for=condition=established --timeout=60s && \
kubectl apply --kustomize="github.com/kubeflow/pipelines/manifests/kustomize/env/platform-agnostic?ref=${PIPELINE_VERSION}"
NAME                                              READY   STATUS      RESTARTS       AGE
cache-deployer-deployment-65fb47dd94-z7hsd        1/1     Running     0              43h
cache-server-657b5f8474-dk84n                     1/1     Running     0              43h
metadata-envoy-deployment-758c78ccb9-5gxxk        1/1     Running     0              43h
metadata-grpc-deployment-68d6f447cc-8zxpw         1/1     Running     5 (41h ago)    43h
metadata-writer-6bf88bb8c4-x7j6g                  1/1     Running     67 (48m ago)   43h
kubectl version
Client Version: v1.30.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.30.3-gke.1225000
kubectl -n kubeflow logs -p metadata-writer-6bf88bb8c4-x7j6g
Connected to the metadata store
Start watching Kubernetes Pods created by Argo
Kubernetes Pod event:  ADDED train-gjx4m-system-container-driver-3238221148 15044767
...
Kubernetes Pod event:  ADDED train-zlbh9-system-dag-driver-3975871060 15029533
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/urllib3/response.py", line 444, in _error_catcher
    yield
  File "/usr/local/lib/python3.8/site-packages/urllib3/response.py", line 828, in read_chunked
    self._update_chunk_length()
  File "/usr/local/lib/python3.8/site-packages/urllib3/response.py", line 758, in _update_chunk_length
    line = self._fp.fp.readline()
  File "/usr/local/lib/python3.8/socket.py", line 669, in readinto
    return self._sock.recv_into(b)
  File "/usr/local/lib/python3.8/ssl.py", line 1274, in recv_into
    return self.read(nbytes, buffer)
  File "/usr/local/lib/python3.8/ssl.py", line 1132, in read
    return self._sslobj.read(len, buffer)
socket.timeout: The read operation timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/kfp/metadata_writer/metadata_writer.py", line 163, in <module>
    for event in pod_stream:
  File "/usr/local/lib/python3.8/site-packages/kubernetes/watch/watch.py", line 144, in stream
    for line in iter_resp_lines(resp):
  File "/usr/local/lib/python3.8/site-packages/kubernetes/watch/watch.py", line 48, in iter_resp_lines
    for seg in resp.read_chunked(decode_content=False):
  File "/usr/local/lib/python3.8/site-packages/urllib3/response.py", line 857, in read_chunked
    self._original_response.close()
  File "/usr/local/lib/python3.8/contextlib.py", line 131, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/local/lib/python3.8/site-packages/urllib3/response.py", line 449, in _error_catcher
    raise ReadTimeoutError(self._pool, None, "Read timed out.")
urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='34.118.224.1', port=443): Read timed out.

Copy link

@OutSorcerer: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

This still happens currently. The details are below, will be glad to share other details if needed.

Kubeflow Pipelines version is 2.2.0, platform agnostic, installed on GKE using the following command:

export PIPELINE_VERSION=2.2.0 && \
kubectl apply --kustomize="github.com/kubeflow/pipelines/manifests/kustomize/cluster-scoped-resources?ref=${PIPELINE_VERSION}" && \
kubectl wait crd/applications.app.k8s.io --for=condition=established --timeout=60s && \
kubectl apply --kustomize="github.com/kubeflow/pipelines/manifests/kustomize/env/platform-agnostic?ref=${PIPELINE_VERSION}"
NAME                                              READY   STATUS      RESTARTS       AGE
cache-deployer-deployment-65fb47dd94-z7hsd        1/1     Running     0              43h
cache-server-657b5f8474-dk84n                     1/1     Running     0              43h
metadata-envoy-deployment-758c78ccb9-5gxxk        1/1     Running     0              43h
metadata-grpc-deployment-68d6f447cc-8zxpw         1/1     Running     5 (41h ago)    43h
metadata-writer-6bf88bb8c4-x7j6g                  1/1     Running     67 (48m ago)   43h
kubectl version
Client Version: v1.30.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.30.3-gke.1225000
kubectl -n kubeflow logs -p metadata-writer-6bf88bb8c4-x7j6g
Connected to the metadata store
Start watching Kubernetes Pods created by Argo
Kubernetes Pod event:  ADDED train-gjx4m-system-container-driver-3238221148 15044767
...
Kubernetes Pod event:  ADDED train-zlbh9-system-dag-driver-3975871060 15029533
Traceback (most recent call last):
 File "/usr/local/lib/python3.8/site-packages/urllib3/response.py", line 444, in _error_catcher
   yield
 File "/usr/local/lib/python3.8/site-packages/urllib3/response.py", line 828, in read_chunked
   self._update_chunk_length()
 File "/usr/local/lib/python3.8/site-packages/urllib3/response.py", line 758, in _update_chunk_length
   line = self._fp.fp.readline()
 File "/usr/local/lib/python3.8/socket.py", line 669, in readinto
   return self._sock.recv_into(b)
 File "/usr/local/lib/python3.8/ssl.py", line 1274, in recv_into
   return self.read(nbytes, buffer)
 File "/usr/local/lib/python3.8/ssl.py", line 1132, in read
   return self._sslobj.read(len, buffer)
socket.timeout: The read operation timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
 File "/kfp/metadata_writer/metadata_writer.py", line 163, in <module>
   for event in pod_stream:
 File "/usr/local/lib/python3.8/site-packages/kubernetes/watch/watch.py", line 144, in stream
   for line in iter_resp_lines(resp):
 File "/usr/local/lib/python3.8/site-packages/kubernetes/watch/watch.py", line 48, in iter_resp_lines
   for seg in resp.read_chunked(decode_content=False):
 File "/usr/local/lib/python3.8/site-packages/urllib3/response.py", line 857, in read_chunked
   self._original_response.close()
 File "/usr/local/lib/python3.8/contextlib.py", line 131, in __exit__
   self.gen.throw(type, value, traceback)
 File "/usr/local/lib/python3.8/site-packages/urllib3/response.py", line 449, in _error_catcher
   raise ReadTimeoutError(self._pool, None, "Read timed out.")
urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='34.118.224.1', port=443): Read timed out.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@OutSorcerer
Copy link

@andre-lx, @chensun, @rimolive, please consider reopening this.

@thesuperzapper
Copy link
Member

thesuperzapper commented Nov 5, 2024

@kubeflow/pipelines I think this issue is still present, and it's related to a very old bug in the kubernets python package that doesn't handle timeouts on WATCH API calls.

I see that there are many other projects which had the same issue:

Fix - Part 1

We need to update to a newer version of the kubernets python package, as we are using version 10.1.0 (from March 2022), which is before the PR that fixed this issue (kubernetes-client/python-base#133):

kubernetes>=8.0.0,<11.0.0

We also probably want to make sure our urllib3 version is updated to 2+, which might require updating Python because we currently use 3.9:

Fix - Part 2

We probably need to implement a retry on ProtocolError (which includes InvalidChunkLength), but I am not sure the best way to do this.

Here are some people talking about that fix:

For reference, here is our code that does the watch:

pod_stream = k8s_watch.stream(
k8s_api.list_pod_for_all_namespaces,
label_selector=ARGO_WORKFLOW_LABEL_KEY,
timeout_seconds=1800, # Sometimes watch gets stuck
_request_timeout=2000, # Sometimes HTTP GET gets stuck
)

/reopen

@google-oss-prow google-oss-prow bot reopened this Nov 5, 2024
Copy link

@thesuperzapper: Reopened this issue.

In response to this:

@kubeflow/pipelines I think this issue is still present, and it's related to a very old bug in the kubernets python package that doesn't handle timeouts on WATCH API calls.

I see that there are many other projects which had the same issue:

Fix - Part 1

We need to update to a newer version of the kubernets python package, as we are using version 10.1.0 (from March 2022), which is before the PR that fixed this issue (kubernetes-client/python-base#133):

kubernetes>=8.0.0,<11.0.0

We also probably want to make sure our urllib3 version is updated to 2+, which might require updating Python.

Fix - Part 2

We probably need to implement a retry on ProtocolError (which includes InvalidChunkLength), but I am not sure the best way to do this.

Here are some people talking about that fix:

For reference, here is our code that does the watch:

pod_stream = k8s_watch.stream(
k8s_api.list_pod_for_all_namespaces,
label_selector=ARGO_WORKFLOW_LABEL_KEY,
timeout_seconds=1800, # Sometimes watch gets stuck
_request_timeout=2000, # Sometimes HTTP GET gets stuck
)

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@github-project-automation github-project-automation bot moved this from Closed to Needs triage in KFP Runtime Triage Nov 5, 2024
@stale stale bot removed the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Nov 5, 2024
@OutSorcerer
Copy link

OutSorcerer commented Nov 6, 2024

@thesuperzapper, I made a PR that fixes the exception on my installation of Kubeflow Pipelines on GKE, please have a look: #11361

When I significantly reduced timeouts for an experiment like this

pod_stream = k8s_watch.stream( 
     k8s_api.list_pod_for_all_namespaces, 
     label_selector=ARGO_WORKFLOW_LABEL_KEY, 
     timeout_seconds=18,  # Sometimes watch gets stuck 
     _request_timeout=20,  # Sometimes HTTP GET gets stuck 
 ) 

the errors disappeared (after testing for multiple hours, originally errors were happending twice per hour), so it seems on GKE there is a proxy somewhere that breaks long connections.

But the fix I propose in the PR is to keep timeout_seconds=1800 but reduce _request_timeout to 60 plus catch ReadTimeoutError that occurs in case of client timeouts, because such values are recommended in the documentation of the Python client of Kubernetes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Needs triage
9 participants