IPAMD fails to start #1847

grumpymatt · 2022-02-04T15:18:34Z

What happened:
IPAMD fails to start with iptables error. The aws-node pods fail to start and prevent worker nodes from going ready.
This is occurring after updating to rocky linux 8.5 which is based on rhel 8.5.

/var/log/aws-routed-eni/ipamd.log

{"level":"error","ts":"2022-02-04T14:38:08.239Z","caller":"networkutils/network.go:385","msg":"ipt.NewChain error for chain [AWS-SNAT-CHAIN-0]: running [/usr/sbin/iptables -t nat -N AWS-SNAT-CHAIN-0 --wait]: exit status 3: iptables v1.8.4 (legacy): can't initialize iptables table `nat': Table does not exist (do you need to insmod?)\nPerhaps iptables or your kernel needs to be upgraded.\n"}

POD logs
kubectl logs -n kube-system aws-node-9tqb6

{"level":"info","ts":"2022-02-04T15:11:48.035Z","caller":"entrypoint.sh","msg":"Validating env variables ..."}
{"level":"info","ts":"2022-02-04T15:11:48.036Z","caller":"entrypoint.sh","msg":"Install CNI binaries.."}
{"level":"info","ts":"2022-02-04T15:11:48.062Z","caller":"entrypoint.sh","msg":"Starting IPAM daemon in the background ... "}
{"level":"info","ts":"2022-02-04T15:11:48.071Z","caller":"entrypoint.sh","msg":"Checking for IPAM connectivity ... "}
{"level":"info","ts":"2022-02-04T15:11:50.092Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-02-04T15:11:52.103Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-02-04T15:11:54.115Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-02-04T15:11:56.124Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}

Attach logs

What you expected to happen:
Expect ipamd to start normally.

How to reproduce it (as minimally and precisely as possible):
Deploy eks cluster with ami based on Rocky 8.5. In theory any rhel 8.5 could have this problem.

Anything else we need to know?:
Running the iptables command from the ipamd log as root on the worker node works fine.

Environment:

Server Version: version.Info{Major:"1", Minor:"19+", GitVersion:"v1.19.15-eks-9c63c4", GitCommit:"9c63c4037a56f9cad887ee76d55142abd4155179", GitTreeState:"clean", BuildDate:"2021-10-20T00:21:03Z", GoVersion:"go1.15.15", Compiler:"gc", Platform:"linux/amd64"}
CNI: 1.10.1
OS (e.g: cat /etc/os-release):
NAME="Rocky Linux"
VERSION="8.5 (Green Obsidian)"
ID="rocky"
ID_LIKE="rhel centos fedora"
VERSION_ID="8.5"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Rocky Linux 8.5 (Green Obsidian)"
ANSI_COLOR="0;32"
CPE_NAME="cpe:/o:rocky:rocky:8:GA"
HOME_URL="https://rockylinux.org/"
BUG_REPORT_URL="https://bugs.rockylinux.org/"
ROCKY_SUPPORT_PRODUCT="Rocky Linux"
ROCKY_SUPPORT_PRODUCT_VERSION="8"
Kernel (e.g. uname -a): Linux ip-10-2--xx-xxx.ec2.xxxxxxxx.com 4.18.0-348.12.2.el8_5.x86_64 #1 SMP Wed Jan 19 17:53:40 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

The text was updated successfully, but these errors were encountered:

grumpymatt · 2022-02-04T18:38:26Z

We found by loading ip_tables, iptable_nat, and iptable_mangle kernel modules fixes the issue: modprobe ip_tables iptable_nat iptable_mangle

Still trying to figure out why these modules where loaded be default in 8.4 and not in 8.5.
Also still not sure why the same iptables commands work without these modules directly on the worker instance and not in the container.

achevuru · 2022-02-09T20:02:19Z

We do install iptables by default in aws-node container images. Good to check the changelog between 8.4 & 8.5 for any insights in to the observed behavior.

vishal0nhce · 2022-02-15T07:56:18Z

@grumpymatt I have been getting the same issue while setting up EKS on rhel8.5. And after loading the kernel modules, it does work fine. The strange thing is I had tried the same in RHEL8.0 worker nodes and still getting the same issue. It works fine in RHEL7.x, though.

achevuru · 2022-02-15T22:24:54Z

@grumpymatt Since the issue is clearly tied to missing iptables modules, I think we can close this issue. Let us know, if there is any other concern.

@vishal0nhce Yeah, iptables module is required for VPC CNI and not sure why it is missing in rhel8.5. I don't see any specific call out for rhel 8.5 around this.

grumpymatt · 2022-02-15T22:54:19Z

We found an alternative way of fixing it by updating iptables inside the CNI container image.

from 602401143452.dkr.ecr.us-east-1.amazonaws.com/amazon-k8s-cni:v1.10.1
run yum install iptables-nft -y
run cd /usr/sbin && rm iptables && ln -s xtables-nft-multi iptables

My concern is the direction of RHEL and downstream distros seems to be away from iptables-legacy and to iptables-nft. Is there any plans to release address this in the CNI container image?

achevuru · 2022-02-16T01:04:03Z

Interesting. So, RHEL 8 doesn't support iptables-legacy anymore? That explains the issue. I think iptables legacy mode is sort of the default (at least for now) for most distributions and in particular Amazon Linux 2 images use iptables-legacy by default as well. We will track AL2 images for our default CNI builds. Will check and update if there is something we can do to address this scenario.

bilby91 · 2022-04-04T18:46:58Z

We are seeing a similar situation where IPAM-D won't start successfully and the aws-node pod would restart at least once. We are running eks 1.20.

$ cat /etc/os-release
NAME="Amazon Linux"
VERSION="2"
ID="amzn"
ID_LIKE="centos rhel fedora"
VERSION_ID="2"
PRETTY_NAME="Amazon Linux 2"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
HOME_URL="https://amazonlinux.com/"

jayanthvn · 2022-04-18T23:01:58Z

@bilby91 - Can you please check if kube-proxy is taking time to start? Kube-proxy should setup rules for aws-node to reach API Server on startup.

dhavaln-able · 2022-04-21T19:53:18Z

Similar error of IPAMD failing to start with latest version v1.11.0. Kube-proxy is already running successfully. Only VPC CNI image update from 1.9.0 to 1.11.0. Any clue what's wrong with the latest version? TIA

{"level":"info","ts":"2022-04-21T19:44:43.569Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}

kathy-lee · 2022-04-27T10:55:20Z

Similar error of IPAMD failing to start with latest version v1.11.0. Kube-proxy is already running successfully.
{"level":"info","ts":"2022-04-27T10:07:56.670Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}

js-timbirkett · 2022-05-06T06:28:28Z

I was seeing this error. In my case, a developer had manually created VPC endpoints for a few services, including STS, resulting in traffic to the services being blackholed. So ipamd could not create a session to collect the information it needed to.

sahil100122 · 2022-05-06T09:26:44Z

I am also facing the same issue while trying to upgrade the cluster from 1.19 to 1.20 in EKS . Can't pinpoint the exact problem.

jayanthvn · 2022-05-06T18:22:43Z

@dhavaln-able and @kathy-lee - So with v1.11.0 is aws-node continuously crashing or is it coming up after few restarts?

@sahil100122 - You mean while upgrading kube-proxy is up and running but ipamd is not starting at all?

smalltown · 2022-05-08T08:10:12Z

I found FlatCar CoreOS also encounter related issue, the iptables command of FlatCar CoreOS version 3033.2.0 uses the nftables kernel backend instead of the iptables backend, that leads to the pod which belong to secondary eni cannot access K8s internal ClusterIP

Thank for @grumpymatt's workaround, after I follow the same way to build customized amazon-k8s-cni container image, currently aws vpc cni works in the version of FlatCar CoreOS greater than 3033.2.0

rhenry-brex · 2022-06-01T00:43:31Z

Had same issue while upgrading, but after looking at the trouble shooting guide and patching the daemonset with the following, aws-node came up as expected and without issues.

# New env vars introduced with 1.10.x
- op: add
  path: "/spec/template/spec/initContainers/0/env/-"
  value: {"name": "ENABLE_IPv6", "value": "false"}
- op: add
  path: "/spec/template/spec/containers/0/env/-"
  value: {"name": "ENABLE_IPv4", "value": "true"}
- op: add
  path: "/spec/template/spec/containers/0/env/-"
  value: {"name": "ENABLE_IPv6", "value": "false"}

varunpalekar · 2022-07-18T22:20:51Z

I also face above issue, but in my case I am using custom kube-proxy image.
But when I reverted to default kube-proxy image and restart aws-node pods, all things works fine.

Why aws-node ipamd not giving any error related to communication if issue is with kube-proxy 🤔

trallnag · 2022-07-19T11:44:00Z

Had a similar issue yesterday. AWS Systems Manager applied a patch to all of our nodes. This patch required a reboot of the instances. All instances came up healthy, but on three out of five the network was not working basically making the cluster unusable. Investigation lead me to issues like the one here or this AWS Knowledge Center entry from AWS.

Recycling all nodes resolved the issue. Did not try to just terminate the aws-node pods. Interestingly only one out of three clusters was affected. So probably difficult to reproduce.

What I also noticed: Why is aws-node mounting /var/run/dockershim.sock even though we use containerd?

AWS Node Image: 602401143452.dkr.ecr.eu-central-1.amazonaws.com/amazon-k8s-cni:v1.10.1-eksbuild.1
Default kube-proxy, default aws-node, etc.

inge4pres · 2022-07-20T18:10:53Z

Hey all 👋🏼 please be aware that this failure mode happens also when the IPs for a subnet are exhausted.

I just faced this and noticed I had mis-configured my worker groups to use a small subnet (/26) instead of a bigger one I intended to use (/18).

TaiSHiNet · 2022-08-12T19:10:45Z

Also: Check you have the right security group attached to your nodes

esidate · 2022-09-09T18:55:07Z

For those coming here after upgrading EKS try re-applying the VPC CNI manifest file, for example:
kubectl apply -f https://raw.githubusercontent.com/aws/amazon-vpc-cni-k8s/v1.11.3/config/master/aws-k8s-cni.yaml

jacobhjkim · 2022-09-29T12:15:56Z

For me, the issue was policy/AmazonEKS_CNI_Policy-2022092909143815010000000b
My policy only allowed IPV6 like below.

{
    "Statement": [
        {
            "Action": [
                "ec2:DescribeTags",
                "ec2:DescribeNetworkInterfaces",
                "ec2:DescribeInstances",
                "ec2:DescribeInstanceTypes",
                "ec2:AssignIpv6Addresses"
            ],
            "Effect": "Allow",
            "Resource": "*",
            "Sid": "IPV6"
        },
        {
            "Action": "ec2:CreateTags",
            "Effect": "Allow",
            "Resource": "arn:aws:ec2:*:*:network-interface/*",
            "Sid": "CreateTags"
        }
    ],
    "Version": "2012-10-17"
}

I changed the policy like below:

{
    "Statement": [
        {
            "Action": [
                "ec2:UnassignPrivateIpAddresses",
                "ec2:ModifyNetworkInterfaceAttribute",
                "ec2:DetachNetworkInterface",
                "ec2:DescribeTags",
                "ec2:DescribeNetworkInterfaces",
                "ec2:DescribeInstances",
                "ec2:DescribeInstanceTypes",
                "ec2:DeleteNetworkInterface",
                "ec2:CreateNetworkInterface",
                "ec2:AttachNetworkInterface",
                "ec2:AssignPrivateIpAddresses"
            ],
            "Effect": "Allow",
            "Resource": "*",
            "Sid": "IPV4"
        },
        {
            "Action": [
                "ec2:DescribeTags",
                "ec2:DescribeNetworkInterfaces",
                "ec2:DescribeInstances",
                "ec2:DescribeInstanceTypes",
                "ec2:AssignIpv6Addresses"
            ],
            "Effect": "Allow",
            "Resource": "*",
            "Sid": "IPV6"
        },
        {
            "Action": "ec2:CreateTags",
            "Effect": "Allow",
            "Resource": "arn:aws:ec2:*:*:network-interface/*",
            "Sid": "CreateTags"
        }
    ],
    "Version": "2012-10-17"
}

and it works! 😅

zhengyongtao · 2022-10-02T15:42:15Z

I've had the same problem these two weeks, has someone found a solution?

jayanthvn · 2022-10-02T18:24:45Z

I've had the same problem these two weeks, has someone found a solution?

Can you please share the last few lines of ipamd logs before aws node restarts?

zhengyongtao · 2022-10-03T00:41:24Z

I've had the same problem these two weeks, has someone found a solution?

Can you please share the last few lines of ipamd logs before aws node restarts?

ipamd log:

{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1418","msg":"Trying to add 192.168.43.0"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"Adding 192.168.43.0/32 to DS for eni-00023922abf62516c"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"IP already in DS"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1418","msg":"Trying to add 192.168.60.1"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"Adding 192.168.60.1/32 to DS for eni-00023922abf62516c"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"IP already in DS"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1418","msg":"Trying to add 192.168.47.2"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"Adding 192.168.47.2/32 to DS for eni-00023922abf62516c"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"IP already in DS"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1418","msg":"Trying to add 192.168.46.131"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"Adding 192.168.46.131/32 to DS for eni-00023922abf62516c"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"IP already in DS"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1418","msg":"Trying to add 192.168.61.196"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"Adding 192.168.61.196/32 to DS for eni-00023922abf62516c"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"IP already in DS"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1418","msg":"Trying to add 192.168.49.6"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"Adding 192.168.49.6/32 to DS for eni-00023922abf62516c"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"IP already in DS"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1418","msg":"Trying to add 192.168.41.135"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"Adding 192.168.41.135/32 to DS for eni-00023922abf62516c"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"IP already in DS"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1418","msg":"Trying to add 192.168.38.218"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"Adding 192.168.38.218/32 to DS for eni-00023922abf62516c"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"IP already in DS"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1418","msg":"Trying to add 192.168.39.157"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"Adding 192.168.39.157/32 to DS for eni-00023922abf62516c"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"IP already in DS"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1418","msg":"Trying to add 192.168.59.213"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"Adding 192.168.59.213/32 to DS for eni-00023922abf62516c"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"IP already in DS"}
{"level":"debug","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:653","msg":"Reconcile existing ENI eni-00023922abf62516c IP prefixes"}
{"level":"debug","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1351","msg":"Found prefix pool count 0 for eni eni-00023922abf62516c\n"}
{"level":"debug","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:653","msg":"Successfully Reconciled ENI/IP pool"}
{"level":"debug","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1396","msg":"IP pool stats: Total IPs/Prefixes = 87/0, AssignedIPs/CooldownIPs: 31/0, c.maxIPsPerENI = 29"}
command terminated with exit code 137

aws-node:

# kubectl logs -f aws-node-zdp6x --tail 30 -n kube-system  
{"level":"info","ts":"2022-10-02T14:56:07.820Z","caller":"entrypoint.sh","msg":"Validating env variables ..."}
{"level":"info","ts":"2022-10-02T14:56:07.821Z","caller":"entrypoint.sh","msg":"Install CNI binaries.."}
{"level":"info","ts":"2022-10-02T14:56:07.833Z","caller":"entrypoint.sh","msg":"Starting IPAM daemon in the background ... "}
{"level":"info","ts":"2022-10-02T14:56:07.834Z","caller":"entrypoint.sh","msg":"Checking for IPAM connectivity ... "}
{"level":"info","ts":"2022-10-02T14:56:09.841Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:11.847Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:13.853Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:15.860Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:17.866Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:19.872Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:21.878Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:23.884Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:25.890Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:27.897Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:29.903Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:31.909Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:33.916Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:35.922Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:37.928Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:39.934Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:41.940Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:43.947Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:45.953Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:47.959Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:49.966Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}

Event screenshots:

I used cluster-autoscaler for auto-scaling, k8s version is 1.22, also following the troubleshooting guide https://github.com/aws/amazon-vpc-cni-k8s/blob/master/docs/troubleshooting.md#known-issues and applying the suggestion
kubectl apply -f https://raw.githubusercontent.com/aws/amazon-vpc-cni-k8s/v1.11.4/config/master/aws-k8s-cni.yaml

Interestingly, this failure usually only occurs on a certain node, and when I terminate the instance of that node and make it automatically expand again, it starts working.

But after running for a while, it will restart again

gaganyaan2 · 2022-11-15T13:39:00Z

@jayanthvn Sorry, I should have clearly stated earlier that in my automation script the image passed as hardcoded argument to helm using --set that's why the automation was picking the new helm chart 1.2.0 and old image amazon-k8s-cni:v1.11.3-eksbuild.1

CNI image v1.12.0 was not failing. it was actually amazon-k8s-cni:v1.11.3-eksbuild.1 image which was deployed using 1.2.0 helm chart.
Yes, It's a custom ami build using amazon-eks-ami

After hardcoding helm chart version to v1.1.21 and image amazon-k8s-cni:v1.11.3-eksbuild.1 . It worked.

jayanthvn · 2022-11-15T16:49:52Z

No problem, thanks for confirming. Yes with helm chart 1.2.0 and CNI build 1.11.3 will have issue since the dockershim sock was removed - #2122

trallnag · 2022-12-06T15:36:50Z

Just encountered this error again, but only with a single node. On all other nodes it worked. I manually deleted the node and the new one did not have the error anymore. Kubernetes v1.22 and VPC CNI v1.11.4 using the official add-on.

jayanthvn · 2022-12-06T17:15:31Z

@trallnag - Did you get a chance to collect ipamd.log on the impacted node? If so did it show any error?

trallnag · 2022-12-06T18:17:17Z

@jayanthvn, nope. Missed to do that. I'll report back should the issue reoccur. Planning to upgrade to v1.12 tomorrow so maybe I will recycle a few nodes before performing the upgrade to reproduce the issue.

ghost · 2023-01-14T07:55:31Z

It happened in my case because the aws-node daemonset was missing the permissions to manage the IP addresses of nodes and pods. The daemonset uses the K8s service account named aws-node. Solved it by creating an IAM role with AmazonEKS_CNI_Policy and attaching the role to the service account. To attach the role, add an annotation to the service account named aws-node and restart the daemonset.

 annotations:
       eks.amazonaws.com/role-arn: your-role-arn

As mentioned in some answers, it's not a good security practise to attach the AmazonEKS_CNI_Policy to the nodes directly, refer https://aws.github.io/aws-eks-best-practices/networking/vpc-cni/#use-separate-iam-role-for-cni to know more.

nehatomar12 · 2023-01-19T13:49:03Z

@muhd-aslm in my case all aws-node-* pods are running

{"level":"info","ts":"2023-01-19T13:46:22.892Z","caller":"entrypoint.sh","msg":"Validating env variables ..."}
{"level":"info","ts":"2023-01-19T13:46:22.893Z","caller":"entrypoint.sh","msg":"Install CNI binaries.."}
{"level":"info","ts":"2023-01-19T13:46:22.921Z","caller":"entrypoint.sh","msg":"Starting IPAM daemon in the background ... "}
{"level":"info","ts":"2023-01-19T13:46:22.923Z","caller":"entrypoint.sh","msg":"Checking for IPAM connectivity ... "}
I0119 13:46:24.483956      12 request.go:655] Throttling request took 1.046230814s, request: GET:https://10.21.0.1:443/apis/storage.k8s.io/v1beta1?timeout=32s
{"level":"info","ts":"2023-01-19T13:46:24.933Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2023-01-19T13:46:26.943Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
W0119 13:46:27.804083      12 warnings.go:70] spec.configSource: deprecated in v1.22, support removal is planned in v1.23
{"level":"info","ts":"2023-01-19T13:46:28.954Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2023-01-19T13:46:28.983Z","caller":"entrypoint.sh","msg":"Copying config file ... "}
{"level":"info","ts":"2023-01-19T13:46:28.993Z","caller":"entrypoint.sh","msg":"Successfully copied CNI plugin binary and config file."}
{"level":"info","ts":"2023-01-19T13:46:28.994Z","caller":"entrypoint.sh","msg":"Foregrounding IPAM daemon ..."}

but ping is not working

❯ kubectl exec -i -t dnsutils -- ping google.com
^[[Oping: unknown host google.com
command terminated with exit code 2

ghost · 2023-01-25T05:52:46Z

@nehatomar12 can you refer this document to see whether you have the same issue
https://yashmehrotra.com/posts/the-case-of-the-missing-packet-an-eks-migration-tale/

presidenten · 2023-05-05T11:25:04Z

For those coming here after upgrading EKS try re-applying the VPC CNI manifest file, for example: kubectl apply -f https://raw.githubusercontent.com/aws/amazon-vpc-cni-k8s/v1.11.3/config/master/aws-k8s-cni.yaml

@esidate Thanks! This fix it for me as well.

fpmanuel · 2023-05-05T12:48:53Z

@ermiaqasemi From this tutorial I chose to attach the AmazonEKS_CNI_Policy to the aws-node service account and I was getting the error.

I decided to try simply attaching it to the AmazonEKSNodeRole, which apparently is the less recommended way to do it, but it works.

This was the solution for me!

joejulian · 2023-05-12T05:22:26Z

I'm also having the problem where ipamd is failing to connect, but I have a different (and reliable) method of creating the problem.

EKS 1.23
Worker AMI Id: ami-055e3d14d238cbddd (ubuntu-eks/k8s_1.23/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20230430)

All I have to do is reboot a worker. aws-node fails to connect every time.

jdn5126 · 2023-05-12T15:04:43Z

@joejulian this sounds like a different issue than the original reported problem. Can you please open a new issue and include more information? What VPC CNI version? What does /var/log/aws-routed-eni/ipamd.log show? I see that you are using Ubuntu, so you may want to look at some of the known issues in our troubleshooting doc: https://github.com/aws/amazon-vpc-cni-k8s/blob/master/docs/troubleshooting.md#known-issues

blu3r4y · 2023-05-18T12:41:18Z

I experienced the same issue. In our cluster, the kube-proxy version was too old.
After updating it to a version that is compatible with the cluster version, the nodes started up fine.

cilindrox · 2023-06-07T19:16:56Z

Just a quick heads up that I noticed the same error as reported here, but it was #2393 - disabling prom-adapter got vpc-cni back on track.

endersonmaia · 2023-07-13T13:59:22Z

I get a similar error when updating AWS EKS from 1.24 to 1.25 via AWS CDK.

{"level":"info","ts":"2023-07-13T13:50:57.779Z","caller":"entrypoint.sh","msg":"Validating env variables ..."}
{"level":"info","ts":"2023-07-13T13:50:57.780Z","caller":"entrypoint.sh","msg":"Install CNI binaries.."}
{"level":"info","ts":"2023-07-13T13:50:57.800Z","caller":"entrypoint.sh","msg":"Starting IPAM daemon in the background ... "}
{"level":"info","ts":"2023-07-13T13:50:57.802Z","caller":"entrypoint.sh","msg":"Checking for IPAM connectivity ... "}
{"level":"info","ts":"2023-07-13T13:50:59.817Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2023-07-13T13:51:01.827Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2023-07-13T13:51:03.838Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2023-07-13T13:51:05.851Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2023-07-13T13:51:07.862Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2023-07-13T13:51:09.871Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
....

Using this version: amazon-k8s-cni-init:v1.10.1

endersonmaia · 2023-07-13T20:06:14Z

Updating the kube-proxy to a more recent version solved the problem.

dwoods · 2023-08-15T04:27:01Z

I let kube-proxy get behind, and updating it solved the issue for me as well.

DrackThor · 2023-08-31T09:06:24Z

I experienced the same issue. In our cluster, the kube-proxy version was too old. After updating it to a version that is compatible with the cluster version, the nodes started up fine.

This also fixed our problem - thank's a million for this hint!

arianitu · 2023-10-06T20:49:57Z

I experienced the same issue. In our cluster, the kube-proxy version was too old. After updating it to a version that is compatible with the cluster version, the nodes started up fine.

Same here, this is from upgrading from Kubernetes 1.21 -> 1.25 under AWS EKS where aws-node failed to start with these logs:

kubectl logs of aws-node did not reveal much:

time="2023-10-06T20:07:03Z" level=info msg="Starting IPAM daemon... "
time="2023-10-06T20:07:03Z" level=info msg="Checking for IPAM connectivity... "

had to login to the aws-node manually (from https://github.com/aws/amazon-vpc-cni-k8s/blob/master/docs/troubleshooting.md)

kubectl exec -it aws-node-vd5r8 -n kube-system -c aws-eks-nodeagent /bin/bash

find the log file under /var/log/aws-routed-eni, should be a file called ipamd*.log

{"level":"error","ts":"2023-10-06T20:27:43.658Z","caller":"wait/wait.go:109","msg":"Unable to reach API Server, Get \"[https://10.100.0.1:443/version?timeout=5s](https://10.100.0.1/version?timeout=5s)\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"}
{"level":"error","ts":"2023-10-06T20:27:49.658Z","caller":"wait/wait.go:109","msg":"Unable to reach API Server, Get \"[https://10.100.0.1:443/version?timeout=5s](https://10.100.0.1/version?timeout=5s)\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"}
{"level":"error","ts":"2023-10-06T20:27:55.657Z","caller":"wait/wait.go:109","msg":"Unable to reach API Server, Get \"[https://10.100.0.1:443/version?timeout=5s](https://10.100.0.1/version?timeout=5s)\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"}
{"level":"error","ts":"2023-10-06T20:28:01.658Z","caller":"wait/wait.go:109","msg":"Unable to reach API Server, Get \"[https://10.100.0.1:443/version?timeout=5s](https://10.100.0.1/version?timeout=5s)\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"}

The only thing I did not upgrade in the cluster was kube-proxy so I assumed that was the issue and it was, glad someone else had the same experience.

Make sure to go through each of these.. I did not have addons for anything so I had to go through the self managed rout which sucks. I hope there is maybe a way to go from self managed to addons?

https://docs.aws.amazon.com/eks/latest/userguide/managing-vpc-cni.html#vpc-add-on-self-managed-update
https://docs.aws.amazon.com/eks/latest/userguide/managing-kube-proxy.html (in my case, I updated to the same major.minor version as my control pane, so 1.25.x)

alant94 · 2023-10-24T23:53:15Z

I hope there is maybe a way to go from self managed to addons?

@arianitu hi, FYI: there is a way. It is described in this documentation page for example. Just did it on my clusters that are deployed with terraform. And now we can easier manage add-ons going forward.

This is the code I added to terraform that migrated self-managed add-ons to eks add-ons. I had default configurations in these add-ons, still creation was failing without OVERWRITE, so added it and then it completed successfully.

# https://docs.aws.amazon.com/eks/latest/userguide/managing-vpc-cni.html
resource "aws_eks_addon" "vpc_cni" {
  cluster_name                = aws_eks_cluster.cluster.name
  addon_name                  = "vpc-cni"
  addon_version               = "v1.15.1-eksbuild.1"
  resolve_conflicts_on_create = "OVERWRITE"
}

# https://docs.aws.amazon.com/eks/latest/userguide/managing-coredns.html
resource "aws_eks_addon" "coredns" {
  cluster_name                = aws_eks_cluster.cluster.name
  addon_name                  = "coredns"
  addon_version               = "v1.9.3-eksbuild.7"
  resolve_conflicts_on_create = "OVERWRITE"
}

# https://docs.aws.amazon.com/eks/latest/userguide/managing-kube-proxy.html
resource "aws_eks_addon" "kube_proxy" {
  cluster_name                = aws_eks_cluster.cluster.name
  addon_name                  = "kube-proxy"
  addon_version               = "v1.24.17-eksbuild.2"
  resolve_conflicts_on_create = "OVERWRITE"

  depends_on = [aws_eks_node_group.group]
}

wenfengwang · 2023-12-06T04:29:35Z

It happened in my case because the aws-node daemonset was missing the permissions to manage the IP addresses of nodes and pods. The daemonset uses the K8s service account named aws-node. Solved it by creating an IAM role with AmazonEKS_CNI_Policy and attaching the role to the service account. To attach the role, add an annotation to the service account named aws-node and restart the daemonset.
 annotations:
       eks.amazonaws.com/role-arn: your-role-arn
As mentioned in some answers, it's not a good security practise to attach the AmazonEKS_CNI_Policy to the nodes directly, refer https://aws.github.io/aws-eks-best-practices/networking/vpc-cni/#use-separate-iam-role-for-cni to know more.

it WORKS for me! Thanks a lot !

jdn5126 · 2024-01-25T16:50:31Z

I'm going to close this issue, as I think it's length and story is too tough to follow. It will still be searchable by others in the future, and we can rely on new issues to triage IPAMD errors during start.

github-actions · 2024-01-25T16:50:53Z

This issue is now closed. Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.

Tarasovych · 2025-01-06T11:26:07Z

In my case during EKS upgrade from 1.24 to 1.25, kube-proxy left untouched. kube-proxy upgrade was helpful. Just changed the image, picked relevant from here https://gallery.ecr.aws/eks-distro/kubernetes/kube-proxy
CNI plugin is self-managed, v1.11.x

grumpymatt added the bug label Feb 4, 2022

achevuru self-assigned this Feb 15, 2022

smalltown mentioned this issue May 7, 2022

FEATURE: Workaround FlatCar CoreOS upgrade iptables to nftables getamis/terraform-ignition-kubernetes#56

Merged

smalltown mentioned this issue May 8, 2022

Pod routing policies deleted by systemd #1600

Closed

viveklak mentioned this issue May 8, 2022

Add enableIpv6 option to clusters pulumi/pulumi-eks#695

Merged

Himangini mentioned this issue Sep 14, 2022

Fix managed suite failures eksctl-io/eksctl#5696

Merged

7 tasks

jdn5126 assigned jdn5126 and unassigned achevuru Nov 22, 2022

jdn5126 mentioned this issue Dec 8, 2022

Add ENABLE_NFTABLES to VPC CNI #2155

Merged

RafalKorepta mentioned this issue May 15, 2023

cloud's restartVMAndRecover test is failing redpanda-data/redpanda#10542

Closed

jdn5126 removed their assignment Oct 13, 2023

jdn5126 closed this as completed Jan 25, 2024

IPAMD fails to start #1847

IPAMD fails to start #1847

Comments

grumpymatt commented Feb 4, 2022 • edited Loading

grumpymatt commented Feb 4, 2022

achevuru commented Feb 9, 2022

vishal0nhce commented Feb 15, 2022

achevuru commented Feb 15, 2022 • edited Loading

grumpymatt commented Feb 15, 2022

achevuru commented Feb 16, 2022

bilby91 commented Apr 4, 2022

jayanthvn commented Apr 18, 2022

dhavaln-able commented Apr 21, 2022

kathy-lee commented Apr 27, 2022

js-timbirkett commented May 6, 2022

sahil100122 commented May 6, 2022

jayanthvn commented May 6, 2022

smalltown commented May 8, 2022 • edited Loading

rhenry-brex commented Jun 1, 2022

varunpalekar commented Jul 18, 2022

trallnag commented Jul 19, 2022 • edited Loading

inge4pres commented Jul 20, 2022

TaiSHiNet commented Aug 12, 2022

esidate commented Sep 9, 2022

jacobhjkim commented Sep 29, 2022

zhengyongtao commented Oct 2, 2022

jayanthvn commented Oct 2, 2022

zhengyongtao commented Oct 3, 2022 • edited Loading

gaganyaan2 commented Nov 15, 2022 • edited Loading

jayanthvn commented Nov 15, 2022

trallnag commented Dec 6, 2022

jayanthvn commented Dec 6, 2022

trallnag commented Dec 6, 2022

ghost commented Jan 14, 2023 • edited by ghost Loading

nehatomar12 commented Jan 19, 2023

ghost commented Jan 25, 2023 • edited by ghost Loading

presidenten commented May 5, 2023

fpmanuel commented May 5, 2023

joejulian commented May 12, 2023

jdn5126 commented May 12, 2023

blu3r4y commented May 18, 2023

cilindrox commented Jun 7, 2023

endersonmaia commented Jul 13, 2023

endersonmaia commented Jul 13, 2023

dwoods commented Aug 15, 2023

DrackThor commented Aug 31, 2023

arianitu commented Oct 6, 2023 • edited Loading

alant94 commented Oct 24, 2023 • edited Loading

wenfengwang commented Dec 6, 2023

jdn5126 commented Jan 25, 2024

github-actions bot commented Jan 25, 2024

Tarasovych commented Jan 6, 2025

grumpymatt commented Feb 4, 2022 •

edited

Loading

achevuru commented Feb 15, 2022 •

edited

Loading

smalltown commented May 8, 2022 •

edited

Loading

trallnag commented Jul 19, 2022 •

edited

Loading

zhengyongtao commented Oct 3, 2022 •

edited

Loading

gaganyaan2 commented Nov 15, 2022 •

edited

Loading

ghost commented Jan 14, 2023 •

edited by ghost

Loading

ghost commented Jan 25, 2023 •

edited by ghost

Loading

arianitu commented Oct 6, 2023 •

edited

Loading

alant94 commented Oct 24, 2023 •

edited

Loading