-
Notifications
You must be signed in to change notification settings - Fork 743
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IPAMD fails to start #1847
Comments
We found by loading ip_tables, iptable_nat, and iptable_mangle kernel modules fixes the issue: Still trying to figure out why these modules where loaded be default in 8.4 and not in 8.5. |
We do install |
@grumpymatt I have been getting the same issue while setting up EKS on rhel8.5. And after loading the kernel modules, it does work fine. The strange thing is I had tried the same in RHEL8.0 worker nodes and still getting the same issue. It works fine in RHEL7.x, though. |
@grumpymatt Since the issue is clearly tied to missing @vishal0nhce Yeah, |
We found an alternative way of fixing it by updating iptables inside the CNI container image.
My concern is the direction of RHEL and downstream distros seems to be away from iptables-legacy and to iptables-nft. Is there any plans to release address this in the CNI container image? |
Interesting. So, RHEL 8 doesn't support |
We are seeing a similar situation where
|
@bilby91 - Can you please check if kube-proxy is taking time to start? Kube-proxy should setup rules for aws-node to reach API Server on startup. |
Similar error of IPAMD failing to start with latest version v1.11.0. Kube-proxy is already running successfully. Only VPC CNI image update from 1.9.0 to 1.11.0. Any clue what's wrong with the latest version? TIA {"level":"info","ts":"2022-04-21T19:44:43.569Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"} |
Similar error of IPAMD failing to start with latest version v1.11.0. Kube-proxy is already running successfully. |
I was seeing this error. In my case, a developer had manually created VPC endpoints for a few services, including STS, resulting in traffic to the services being blackholed. So |
I am also facing the same issue while trying to upgrade the cluster from 1.19 to 1.20 in EKS . Can't pinpoint the exact problem. |
@dhavaln-able and @kathy-lee - So with v1.11.0 is aws-node continuously crashing or is it coming up after few restarts? @sahil100122 - You mean while upgrading kube-proxy is up and running but ipamd is not starting at all? |
I found FlatCar CoreOS also encounter related issue, the iptables command of FlatCar CoreOS version 3033.2.0 uses the nftables kernel backend instead of the iptables backend, that leads to the pod which belong to secondary eni cannot access K8s internal ClusterIP Thank for @grumpymatt's workaround, after I follow the same way to build customized amazon-k8s-cni container image, currently aws vpc cni works in the version of FlatCar CoreOS greater than 3033.2.0 |
Had same issue while upgrading, but after looking at the trouble shooting guide and patching the daemonset with the following, # New env vars introduced with 1.10.x
- op: add
path: "/spec/template/spec/initContainers/0/env/-"
value: {"name": "ENABLE_IPv6", "value": "false"}
- op: add
path: "/spec/template/spec/containers/0/env/-"
value: {"name": "ENABLE_IPv4", "value": "true"}
- op: add
path: "/spec/template/spec/containers/0/env/-"
value: {"name": "ENABLE_IPv6", "value": "false"} |
I also face above issue, but in my case I am using custom kube-proxy image. Why aws-node ipamd not giving any error related to communication if issue is with kube-proxy 🤔 |
Had a similar issue yesterday. AWS Systems Manager applied a patch to all of our nodes. This patch required a reboot of the instances. All instances came up healthy, but on three out of five the network was not working basically making the cluster unusable. Investigation lead me to issues like the one here or this AWS Knowledge Center entry from AWS. Recycling all nodes resolved the issue. Did not try to just terminate the aws-node pods. Interestingly only one out of three clusters was affected. So probably difficult to reproduce. What I also noticed: Why is aws-node mounting
|
Hey all 👋🏼 please be aware that this failure mode happens also when the IPs for a subnet are exhausted. I just faced this and noticed I had mis-configured my worker groups to use a small subnet (/26) instead of a bigger one I intended to use (/18). |
Also: Check you have the right security group attached to your nodes |
For those coming here after upgrading EKS try re-applying the VPC CNI manifest file, for example: |
For me, the issue was {
"Statement": [
{
"Action": [
"ec2:DescribeTags",
"ec2:DescribeNetworkInterfaces",
"ec2:DescribeInstances",
"ec2:DescribeInstanceTypes",
"ec2:AssignIpv6Addresses"
],
"Effect": "Allow",
"Resource": "*",
"Sid": "IPV6"
},
{
"Action": "ec2:CreateTags",
"Effect": "Allow",
"Resource": "arn:aws:ec2:*:*:network-interface/*",
"Sid": "CreateTags"
}
],
"Version": "2012-10-17"
} I changed the policy like below: {
"Statement": [
{
"Action": [
"ec2:UnassignPrivateIpAddresses",
"ec2:ModifyNetworkInterfaceAttribute",
"ec2:DetachNetworkInterface",
"ec2:DescribeTags",
"ec2:DescribeNetworkInterfaces",
"ec2:DescribeInstances",
"ec2:DescribeInstanceTypes",
"ec2:DeleteNetworkInterface",
"ec2:CreateNetworkInterface",
"ec2:AttachNetworkInterface",
"ec2:AssignPrivateIpAddresses"
],
"Effect": "Allow",
"Resource": "*",
"Sid": "IPV4"
},
{
"Action": [
"ec2:DescribeTags",
"ec2:DescribeNetworkInterfaces",
"ec2:DescribeInstances",
"ec2:DescribeInstanceTypes",
"ec2:AssignIpv6Addresses"
],
"Effect": "Allow",
"Resource": "*",
"Sid": "IPV6"
},
{
"Action": "ec2:CreateTags",
"Effect": "Allow",
"Resource": "arn:aws:ec2:*:*:network-interface/*",
"Sid": "CreateTags"
}
],
"Version": "2012-10-17"
} and it works! 😅 |
I've had the same problem these two weeks, has someone found a solution? |
Can you please share the last few lines of ipamd logs before aws node restarts? |
ipamd log: {"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1418","msg":"Trying to add 192.168.43.0"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"Adding 192.168.43.0/32 to DS for eni-00023922abf62516c"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"IP already in DS"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1418","msg":"Trying to add 192.168.60.1"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"Adding 192.168.60.1/32 to DS for eni-00023922abf62516c"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"IP already in DS"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1418","msg":"Trying to add 192.168.47.2"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"Adding 192.168.47.2/32 to DS for eni-00023922abf62516c"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"IP already in DS"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1418","msg":"Trying to add 192.168.46.131"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"Adding 192.168.46.131/32 to DS for eni-00023922abf62516c"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"IP already in DS"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1418","msg":"Trying to add 192.168.61.196"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"Adding 192.168.61.196/32 to DS for eni-00023922abf62516c"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"IP already in DS"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1418","msg":"Trying to add 192.168.49.6"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"Adding 192.168.49.6/32 to DS for eni-00023922abf62516c"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"IP already in DS"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1418","msg":"Trying to add 192.168.41.135"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"Adding 192.168.41.135/32 to DS for eni-00023922abf62516c"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"IP already in DS"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1418","msg":"Trying to add 192.168.38.218"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"Adding 192.168.38.218/32 to DS for eni-00023922abf62516c"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"IP already in DS"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1418","msg":"Trying to add 192.168.39.157"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"Adding 192.168.39.157/32 to DS for eni-00023922abf62516c"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"IP already in DS"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1418","msg":"Trying to add 192.168.59.213"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"Adding 192.168.59.213/32 to DS for eni-00023922abf62516c"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"IP already in DS"}
{"level":"debug","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:653","msg":"Reconcile existing ENI eni-00023922abf62516c IP prefixes"}
{"level":"debug","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1351","msg":"Found prefix pool count 0 for eni eni-00023922abf62516c\n"}
{"level":"debug","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:653","msg":"Successfully Reconciled ENI/IP pool"}
{"level":"debug","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1396","msg":"IP pool stats: Total IPs/Prefixes = 87/0, AssignedIPs/CooldownIPs: 31/0, c.maxIPsPerENI = 29"}
command terminated with exit code 137 aws-node: # kubectl logs -f aws-node-zdp6x --tail 30 -n kube-system
{"level":"info","ts":"2022-10-02T14:56:07.820Z","caller":"entrypoint.sh","msg":"Validating env variables ..."}
{"level":"info","ts":"2022-10-02T14:56:07.821Z","caller":"entrypoint.sh","msg":"Install CNI binaries.."}
{"level":"info","ts":"2022-10-02T14:56:07.833Z","caller":"entrypoint.sh","msg":"Starting IPAM daemon in the background ... "}
{"level":"info","ts":"2022-10-02T14:56:07.834Z","caller":"entrypoint.sh","msg":"Checking for IPAM connectivity ... "}
{"level":"info","ts":"2022-10-02T14:56:09.841Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:11.847Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:13.853Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:15.860Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:17.866Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:19.872Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:21.878Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:23.884Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:25.890Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:27.897Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:29.903Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:31.909Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:33.916Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:35.922Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:37.928Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:39.934Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:41.940Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:43.947Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:45.953Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:47.959Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:49.966Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"} Event screenshots: I used cluster-autoscaler for auto-scaling, k8s version is 1.22, also following the troubleshooting guide https://github.com/aws/amazon-vpc-cni-k8s/blob/master/docs/troubleshooting.md#known-issues and applying the suggestion Interestingly, this failure usually only occurs on a certain node, and when I terminate the instance of that node and make it automatically expand again, it starts working. But after running for a while, it will restart again |
@koolwithk helm chart 1.2.0 has cni image version 1.12.0, you mean with 1.12.0 ipamd is failing to start? Can you please share ipamd.log? Also are you using EKS AMI? |
@jayanthvn Sorry, I should have clearly stated earlier that in my automation script the image passed as hardcoded argument to helm using
After hardcoding helm chart version to v1.1.21 and image |
No problem, thanks for confirming. Yes with helm chart 1.2.0 and CNI build 1.11.3 will have issue since the dockershim sock was removed - #2122 |
Just encountered this error again, but only with a single node. On all other nodes it worked. I manually deleted the node and the new one did not have the error anymore. Kubernetes v1.22 and VPC CNI v1.11.4 using the official add-on. |
@trallnag - Did you get a chance to collect ipamd.log on the impacted node? If so did it show any error? |
@jayanthvn, nope. Missed to do that. I'll report back should the issue reoccur. Planning to upgrade to v1.12 tomorrow so maybe I will recycle a few nodes before performing the upgrade to reproduce the issue. |
It happened in my case because the aws-node daemonset was missing the permissions to manage the IP addresses of nodes and pods. The daemonset uses the K8s service account named aws-node. Solved it by creating an IAM role with AmazonEKS_CNI_Policy and attaching the role to the service account. To attach the role, add an annotation to the service account named aws-node and restart the daemonset.
As mentioned in some answers, it's not a good security practise to attach the AmazonEKS_CNI_Policy to the nodes directly, refer https://aws.github.io/aws-eks-best-practices/networking/vpc-cni/#use-separate-iam-role-for-cni to know more. |
@muhd-aslm in my case all aws-node-* pods are running
but ping is not working
|
@nehatomar12 can you refer this document to see whether you have the same issue |
@esidate Thanks! This fix it for me as well. |
This was the solution for me! |
I'm also having the problem where ipamd is failing to connect, but I have a different (and reliable) method of creating the problem. EKS 1.23 All I have to do is reboot a worker. aws-node fails to connect every time. |
@joejulian this sounds like a different issue than the original reported problem. Can you please open a new issue and include more information? What VPC CNI version? What does |
I experienced the same issue. In our cluster, the |
Just a quick heads up that I noticed the same error as reported here, but it was #2393 - disabling prom-adapter got vpc-cni back on track. |
I get a similar error when updating AWS EKS from 1.24 to 1.25 via AWS CDK.
Using this version: amazon-k8s-cni-init:v1.10.1 |
Updating the kube-proxy to a more recent version solved the problem. |
I let kube-proxy get behind, and updating it solved the issue for me as well. |
This also fixed our problem - thank's a million for this hint! |
Same here, this is from upgrading from Kubernetes 1.21 -> 1.25 under AWS EKS where kubectl logs of
had to login to the aws-node manually (from https://github.com/aws/amazon-vpc-cni-k8s/blob/master/docs/troubleshooting.md)
find the log file under
The only thing I did not upgrade in the cluster was Make sure to go through each of these.. I did not have addons for anything so I had to go through the self managed rout which sucks. I hope there is maybe a way to go from self managed to addons?
|
@arianitu hi, FYI: there is a way. It is described in this documentation page for example. Just did it on my clusters that are deployed with terraform. And now we can easier manage add-ons going forward. This is the code I added to terraform that migrated self-managed add-ons to eks add-ons. I had default configurations in these add-ons, still creation was failing without OVERWRITE, so added it and then it completed successfully.
|
it WORKS for me! Thanks a lot ! |
I'm going to close this issue, as I think it's length and story is too tough to follow. It will still be searchable by others in the future, and we can rely on new issues to triage IPAMD errors during start. |
This issue is now closed. Comments on closed issues are hard for our team to see. |
What happened:
IPAMD fails to start with iptables error. The aws-node pods fail to start and prevent worker nodes from going ready.
This is occurring after updating to rocky linux 8.5 which is based on rhel 8.5.
/var/log/aws-routed-eni/ipamd.log
POD logs
kubectl logs -n kube-system aws-node-9tqb6
Attach logs
What you expected to happen:
Expect ipamd to start normally.
How to reproduce it (as minimally and precisely as possible):
Deploy eks cluster with ami based on Rocky 8.5. In theory any rhel 8.5 could have this problem.
Anything else we need to know?:
Running the iptables command from the ipamd log as root on the worker node works fine.
Environment:
cat /etc/os-release
):NAME="Rocky Linux"
VERSION="8.5 (Green Obsidian)"
ID="rocky"
ID_LIKE="rhel centos fedora"
VERSION_ID="8.5"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Rocky Linux 8.5 (Green Obsidian)"
ANSI_COLOR="0;32"
CPE_NAME="cpe:/o:rocky:rocky:8:GA"
HOME_URL="https://rockylinux.org/"
BUG_REPORT_URL="https://bugs.rockylinux.org/"
ROCKY_SUPPORT_PRODUCT="Rocky Linux"
ROCKY_SUPPORT_PRODUCT_VERSION="8"
uname -a
):Linux ip-10-2--xx-xxx.ec2.xxxxxxxx.com 4.18.0-348.12.2.el8_5.x86_64 #1 SMP Wed Jan 19 17:53:40 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
The text was updated successfully, but these errors were encountered: