Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pods stuck in CrashLoopBackoff when restarting custom EKS node. #2852

Closed
ddl-pjohnson opened this issue Mar 18, 2024 · 7 comments
Closed

Pods stuck in CrashLoopBackoff when restarting custom EKS node. #2852

ddl-pjohnson opened this issue Mar 18, 2024 · 7 comments

Comments

@ddl-pjohnson
Copy link

What happened:

We have a custom AMI that we deploy to EC2 and connect to an existing EKS cluster. We start and stop this node as needed to save costs. In addition the instance has state that we want to maintain across restarts i.e. we don't want to get a new node every restart.

Over the past 2 months we've noticed an issue where k8s doesn't restart properly. some pods get stuck in a CrashLoopBackoff when they try to connect to other pods or services. DNS resolves to the correct IP address, however packets aren't routed to the other pod correctly. It seems like a race condition, where pods start before the network is set up correctly.

The only reliable fix we've found is to delete all pods and let k8s recreate them, this seems to set up the correct iptables rules.

Is there a better way to fix this? It kind of looks like projectcalico/calico#5135, but not sure if the problem is in Calico or AWS.

Environment:

  • Kubernetes version:
    v1.27.10-eks-508b6b3

  • CNI Version
    amazon-k8s-cni:v1.15.1-eksbuild.1

  • OS (e.g: cat /etc/os-release):

NAME="Amazon Linux"
VERSION="2"
ID="amzn"
ID_LIKE="centos rhel fedora"
VERSION_ID="2"
PRETTY_NAME="Amazon Linux 2"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
HOME_URL="https://amazonlinux.com/"
SUPPORT_END="2025-06-30"
  • Kernel (e.g. uname -a):
Linux pauljo41250 5.10.197-186.748.amzn2.x86_64 #1 SMP Tue Oct 10 00:30:07 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
@orsenthil
Copy link
Member

Is there a better way to fix this? It kind of looks like projectcalico/calico#5135, but not sure if the problem is in Calico or AWS.

Do you have both Calico and VPC CNI? Do you where the specific error message is coming from?
Could you share the status of your pods aws-node and other pods in kube-system namespace?

  1. What does the pod log of the CrashLoopBackOff Containers say?
  2. What does the IPAMD log say ?

@ddl-pjohnson
Copy link
Author

Yep, we have both installed.

The errors are in pod logs and it's somewhat random what pods have errors. Usually they are connection refused errors connecting to the Kube API or other pods, e.g.:
Invalid Kubernetes API v1 endpoint https://172.20.0.1:443/api: Timed out connecting to server

Or connecting to another pod:

requests.exceptions.ConnectionError: HTTPConnectionPool(host='nucleus-frontend', port=80)

The exact errors vary by things like the language used and what they're connecting to. In all cases DNS works correctly, but the packets aren't routed to the other pod/service.

Is there a secure way to send you logs and pod statuses?

@orsenthil
Copy link
Member

Invalid Kubernetes API v1 endpoint https://172.20.0.1:443/api: Timed out connecting to server

This is strange error message.
Can you confirm the the API server endpoint match?

kubectl get endpoints kubernetes -o jsonpath='{.subsets[].addresses[].ip}'

I would expect the API path to be /api/v1 in the error message. I am not sure why it tried to connect at /api

You can follow this troubleshooting doc - https://github.com/aws/amazon-vpc-cni-k8s/blob/master/docs/troubleshooting.md and send the logs to '[email protected]' for us to investigate.

I am suspecting that kube-proxy isn't running when this error occurred, but the description of the error itself isn't typical either.

@ddl-pjohnson
Copy link
Author

It's not just the kubernetes api, it's basically random what services and pods can be connected and which can't e.g. a pod won't be able to connect to our rabbitmq service, or another one will be able to connect to rabbitmq, but won't connect to vault etc.

We've fixed this by draining/cordoning the node on startup. I'll try tracking down the bundle of logs and sending them through.

@orsenthil
Copy link
Member

We've fixed this by draining/cordoning the node on startup.

Was this node specific behavior? If yes, perhaps there is some thing it is running on the node that changing iptables. Yes, logs will help.

@orsenthil
Copy link
Member

Closing this as Cx were able to resolve this at the node level.

Copy link

This issue is now closed. Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants