Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrading from v1.16.0-eksbuild.1 to v1.17 or v1.18 results in failure to assign IP address to container #2872

Closed
jdinsel-xealth opened this issue Apr 11, 2024 · 9 comments
Labels

Comments

@jdinsel-xealth
Copy link

What happened:

Upgrading from the v1.16.0 version to v1.17 or higher results in scheduled pods that cannot obtain an IP address. Downgrading back to v1.16.0 restores functionality. Also seen during this condition is that the EKS cluster does not scale out.

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "3a820c12790041f3e7e75e6a969a6c3e9fad7f9398fcc10b349ee17690e53b89": plugin type="aws-cni" name="aws-cni" failed (add): add cmd: failed to assign an IP address to container

Attach logs

What you expected to happen:

Should be able to assign IP addresses to pod or the cluster should scale out and then be able to assign IP addresses to pods.

How to reproduce it (as minimally and precisely as possible):

On an EKS cluster running EKS 1.28, upgrade the VPC CNI add-on from v1.16 to v1.17 or v1.18. It may be necessary to add additional pods, but at some point, a pod will be assigned to an existing node and will sit in a pending state because aws-cni could not assign an IP address to the container.

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version): 1.28
  • CNI Version: v1.17.1 or v1.18.0
  • OS (e.g: cat /etc/os-release): Amazon linux bottlerocket
  • Kernel (e.g. uname -a): bottlerocket-aws-k8s-1.28-x86_64-v1.19.2-29cc92cc
@jdinsel-xealth jdinsel-xealth changed the title Upgrading from v1.16.0-eksbuild.1 to v1.17 or v1.18 results in Upgrading from v1.16.0-eksbuild.1 to v1.17 or v1.18 results in failure to assign IP address to container Apr 11, 2024
@orsenthil
Copy link
Member

Do you see any errors in the ipamd.log that stand out?

Can you run node log collector - /opt/cni/bin/aws-cni-support.sh / https://github.com/awslabs/amazon-eks-ami/tree/master/log-collector-script/linux against the nodes and share the logs with us.

@jdinsel-xealth
Copy link
Author

I'm having trouble gathering the logs. I walked through some steps when we were on v1.16.0 to get a feel for what was available. I found that I could access the ipamd.log in the aws-node when v1.16.0 was running, but could not get a shell when v1.18.0 was running. We're also using bottlerocket as the AMI and the steps in the linked collector resulted in more errors of missing commands than useful output. I was connected to the node where the IP could not be allocated and was unable to find information with the script or manually. Do you have any guidance on what I could do differently?

@orsenthil
Copy link
Member

I tried to reproduce this issue using a new cluster and bottle-rocket image, but I could not.

  1. Setup 1.29 cluster with bottlerocket using https://github.com/eksctl-io/eksctl/blob/main/examples/20-bottlerocket.yaml
  2. Tested with CNI 1.16 - Scaled up pods, works
  3. Updated CNI to 1.17 - Scaled up pods, created new pods. Works.

To look into your bottlerocket logs.

Login to your instance and do sudo sheltie and access /var/log/aws-route-eni/ and logs in that directory. You could also ssm to your instance, and follow the prompts.

To permit more intrusive troubleshooting, including actions that mutate the
running state of the Bottlerocket host, we provide a tool called "sheltie"
(`sudo sheltie`).  When run, this tool drops you into a root shell in the
Bottlerocket host's root filesystem.
[ec2-user@admin]$ sudo sheltie

@orsenthil
Copy link
Member

This failure add cmd: failed to assign an IP address to container can happen if sufficient ip address is not available on your instance. Ensure that you sufficient ips. The ipamd.log can provide some information regarding availability and assigned too.

@jdinsel-xealth
Copy link
Author

Thanks, @orsenthil, for your guidance. I have reproduced the issue and submitted the ipamd.log to the triage email. There are messages logged that the ENI on the node "does not have available addresses" and "IP address pool stats: total 18, assigned 18" as well as "IP pool is too low: available (0) < ENI target (1) * addrsPerENI (9)" ... yet the cluster did not scale to create another node.

@orsenthil
Copy link
Member

You mean, additional ENI wasn't created or the cluster didn't scale for another node, if later, it is auto-scaling functionality, not networking.

@jdinsel-xealth
Copy link
Author

All that's changed is the VPC CNI driver. On 1.16, nodes scale out when nodes reach their limits; after, they do not and we see the error:

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "3a820c12790041f3e7e75e6a969a6c3e9fad7f9398fcc10b349ee17690e53b89": plugin type="aws-cni" name="aws-cni" failed (add): add cmd: failed to assign an IP address to container

@jdinsel-xealth
Copy link
Author

I think I will close this after seeing no evidence that VPC CNI is not working as expected. Sure, there was an inability to add an IP, but I believe that is because the nodes were over subscribed and the cluster's auto scaler did not add a new node. I found a discrepancy in the number of EC2 instances in the node group and those seen in Kubernetes.

Copy link

This issue is now closed. Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants