Pods stuck in `CrashLoopBackoff` when restarting custom EKS node. #2852

ddl-pjohnson · 2024-03-18T22:44:03Z

What happened:

We have a custom AMI that we deploy to EC2 and connect to an existing EKS cluster. We start and stop this node as needed to save costs. In addition the instance has state that we want to maintain across restarts i.e. we don't want to get a new node every restart.

Over the past 2 months we've noticed an issue where k8s doesn't restart properly. some pods get stuck in a CrashLoopBackoff when they try to connect to other pods or services. DNS resolves to the correct IP address, however packets aren't routed to the other pod correctly. It seems like a race condition, where pods start before the network is set up correctly.

The only reliable fix we've found is to delete all pods and let k8s recreate them, this seems to set up the correct iptables rules.

Is there a better way to fix this? It kind of looks like projectcalico/calico#5135, but not sure if the problem is in Calico or AWS.

Environment:

Kubernetes version:
v1.27.10-eks-508b6b3
CNI Version
amazon-k8s-cni:v1.15.1-eksbuild.1
OS (e.g: cat /etc/os-release):

NAME="Amazon Linux"
VERSION="2"
ID="amzn"
ID_LIKE="centos rhel fedora"
VERSION_ID="2"
PRETTY_NAME="Amazon Linux 2"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
HOME_URL="https://amazonlinux.com/"
SUPPORT_END="2025-06-30"

Kernel (e.g. uname -a):

Linux pauljo41250 5.10.197-186.748.amzn2.x86_64 #1 SMP Tue Oct 10 00:30:07 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

The text was updated successfully, but these errors were encountered:

orsenthil · 2024-03-19T15:37:13Z

Is there a better way to fix this? It kind of looks like projectcalico/calico#5135, but not sure if the problem is in Calico or AWS.

Do you have both Calico and VPC CNI? Do you where the specific error message is coming from?
Could you share the status of your pods aws-node and other pods in kube-system namespace?

What does the pod log of the CrashLoopBackOff Containers say?
What does the IPAMD log say ?

ddl-pjohnson · 2024-03-19T18:26:30Z

Yep, we have both installed.

The errors are in pod logs and it's somewhat random what pods have errors. Usually they are connection refused errors connecting to the Kube API or other pods, e.g.:
Invalid Kubernetes API v1 endpoint https://172.20.0.1:443/api: Timed out connecting to server

Or connecting to another pod:

requests.exceptions.ConnectionError: HTTPConnectionPool(host='nucleus-frontend', port=80)

The exact errors vary by things like the language used and what they're connecting to. In all cases DNS works correctly, but the packets aren't routed to the other pod/service.

Is there a secure way to send you logs and pod statuses?

orsenthil · 2024-04-04T19:24:06Z

Invalid Kubernetes API v1 endpoint https://172.20.0.1:443/api: Timed out connecting to server

This is strange error message.
Can you confirm the the API server endpoint match?

kubectl get endpoints kubernetes -o jsonpath='{.subsets[].addresses[].ip}'

I would expect the API path to be /api/v1 in the error message. I am not sure why it tried to connect at /api

You can follow this troubleshooting doc - https://github.com/aws/amazon-vpc-cni-k8s/blob/master/docs/troubleshooting.md and send the logs to '[email protected]' for us to investigate.

I am suspecting that kube-proxy isn't running when this error occurred, but the description of the error itself isn't typical either.

ddl-pjohnson · 2024-04-04T19:45:44Z

It's not just the kubernetes api, it's basically random what services and pods can be connected and which can't e.g. a pod won't be able to connect to our rabbitmq service, or another one will be able to connect to rabbitmq, but won't connect to vault etc.

We've fixed this by draining/cordoning the node on startup. I'll try tracking down the bundle of logs and sending them through.

orsenthil · 2024-04-04T20:17:58Z

We've fixed this by draining/cordoning the node on startup.

Was this node specific behavior? If yes, perhaps there is some thing it is running on the node that changing iptables. Yes, logs will help.

orsenthil · 2024-04-22T21:12:17Z

Closing this as Cx were able to resolve this at the node level.

github-actions · 2024-04-22T21:12:34Z

This issue is now closed. Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.

ddl-pjohnson added needs investigation question labels Mar 18, 2024

orsenthil closed this as completed Apr 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pods stuck in `CrashLoopBackoff` when restarting custom EKS node. #2852

Pods stuck in `CrashLoopBackoff` when restarting custom EKS node. #2852

ddl-pjohnson commented Mar 18, 2024

orsenthil commented Mar 19, 2024

ddl-pjohnson commented Mar 19, 2024

orsenthil commented Apr 4, 2024

ddl-pjohnson commented Apr 4, 2024

orsenthil commented Apr 4, 2024

orsenthil commented Apr 22, 2024

github-actions bot commented Apr 22, 2024

Pods stuck in CrashLoopBackoff when restarting custom EKS node. #2852

Pods stuck in CrashLoopBackoff when restarting custom EKS node. #2852

Comments

ddl-pjohnson commented Mar 18, 2024

orsenthil commented Mar 19, 2024

ddl-pjohnson commented Mar 19, 2024

orsenthil commented Apr 4, 2024

ddl-pjohnson commented Apr 4, 2024

orsenthil commented Apr 4, 2024

orsenthil commented Apr 22, 2024

github-actions bot commented Apr 22, 2024

Pods stuck in `CrashLoopBackoff` when restarting custom EKS node. #2852

Pods stuck in `CrashLoopBackoff` when restarting custom EKS node. #2852