-
Notifications
You must be signed in to change notification settings - Fork 742
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
failed to assign an IP address to container #2749
Comments
@AlissonRS we will take a look at the node logs soon and update you with what we find |
@jdn5126 I was taking at look at the logs, under
Could |
Kubelet seems fine. From the latest IPAMD logs, I see that all IPs are in use:
So IPAM is trying to increase the IP pool through this chain:
|
Wait a minute, what do you have |
I recently improved the logging in this area, but I think you are hitting the same case as #2650, where |
@jdn5126 From the logs, does that mean the subnet is fragmented, thus it cannot allocate a prefix for the node? |
Hmm ok, what WARM/MINIMUM env vars do you have set on the daemonset? This does seem to be implying that the subnet the node was launched in is fragmented to the point where we cannot carve out another prefix, but I am still digging |
@jdn5126 it turns out I have I'll remove This is the
|
Hmm.. As a side note, our main focus right now is on massively improving this area so you never have to touch these environment variables again or worry about what subnet the node is launched in. #2714 is a start to that |
@jdn5126 I ran this command:
So I'm assuming it has 16 consecutive IPs available. I could also create CIDR reservations, though I'm not exactly sure how that works (e.g. do I just create it and AWS takes care of the rest?) For example, my subnet is 10.0.10.0/24, so it means 254 - 5 (reserved for AWS), 249. So I'd guess that allows having 15 blocks of 16 consecutive IP addresses, starting on 10.0.10.4-10.0.10.19, then another one from 10.0.10.20-10.0.10.35 and so on. So I could in theory create 15 CIDR reservations within this subnet. Is my understanding correct? Sorry I'm not very proficient with networking, but I'm trying to learn more about it. |
I removed |
Right, the subnet does look like it has enough available IPs to carve a prefix from. To answer your question, you don't need to create a CIDR reservation, as that's what IPAMD is doing for you. From the best practices guide (https://aws.github.io/aws-eks-best-practices/networking/prefix-mode/index_linux/), I see it recommending I cannot figure out why EC2 would be returning |
The only thing I can think of to try, since this is a dev environment, is setting a higher For the comment about nodes being replaced, I assume that is Karpenter packing pods when the |
@jdn5126 thanks for helping with this. I readded I'll keep investigating myself, but please let me know if there is anything I can provide to further help debug. I could also open a ticket on AWS but I don't have premium support available, so not sure if that's possible. |
Hi @AlissonRS, I did some more digging, and here is what I found:
Here is my conclusion from this:
My questions for you are:
|
@jdn5126 thanks for the detailed output, I'll enable CIDR reservation for the subnets and delete all nodes so the IP addresses are released and not reassigned, Karpenter shall recreate the nodes. If that doesn't work, then I'll just recreate the subnets with larger CIDR ranges + reservation. I'll let you know if that works, otherwise share new logs here. Is there a way for me to find the As for your question, I'm using prefix delegation to increase the number of pods per EC2 node, as the nodes were being underutilized, e.g. reaching max number of pods without requesting even half the CPU and Memory available, let alone actual consumption which is even lower than pod requests for CPU and Memory. |
The And got it, using Prefix Delegation to increase pod density makes sense. After #2714 merges, this will be much easier. |
hesitate to pile on because i do not want to detract from OP issue, let me know if you prefer a new issue. starting here because the error is the same "failed to assign an IP address to container" and was considering moving to prefix mode as a potential solve like OP but after reading this, other issues, docs, cni code...not sure it is the right answer for our context (and especially if features like #2714 talk about removing the need for it). when updating our node groups we some times get "failed to assign an IP address" errors (logs attached). we are nowhere near max pods (234 on our m5.8xlarge instances), subnet exhaustion or fragmentation (/16 cluster subnet, /20s for each region, dedicated to EKS). from this and other threads i understand prefix mode is primarily about increasing pod density, but originally looked at it based on various doc comments describing it as a way to improve scheduling delays (fewer EC2 API requests for IPs/ENIs). however, we also run security groups for pods which based on pod networking use cases we know is already the worst for density and launch time (security tradeoff)... we are looking to optimize this as much as possible. docs clearly state when running pod security policies, you can schdule pods of various types (pod or node security group), and we have a mix of both types on each node. looping across nodes currently i see:
so we are nowhere near max-pods but close enough to pod eni limits (if i understand this instance type can support 54 branch interfaces which is to say you can only have 54 pods with security groups per node). what i don't yet understand is if/how pod-eni is used as scheduling input similar to max-pods. will pods with security groups get scheduled based on max-pods or pod-enis? the concern i am trying to either validate or invalidate is whether we need to reduce max-pods to avoid scheduling workloads on nodes with no remaining branch enis. best practices for pod security groups mention:
which to me could say we will loose some potential capacity if we don't bump max-pods (to max-pods + max branch enis?) which may imply max-pods is the only scheduling limit considered but if that were true i would expect blurbs in all the pod security group docs about setting max-pods == max branch enis or some other guidance. talk me off this ledge. 🙏 considered adjusting various currently getting cni-metrics-helper going for easier visibility/trending. next step is to plan EKS (1.25) + CNI add-on (1.12.6) upgrades to eliminate some bugs and adjust config... but struggling to find the optimal settings when using security groups for pods. thank you for any guidance! |
@deadlysyn so with Security Groups for pods, each node is limited by the number of branch ENIs it can support, correct. Which instance type corresponds with the
, which would imply that no more branch ENIs can be allocated for this instance. Branch ENIs are allocated by the VPC Resource Controller, so we would need a support ticket to access the control plane logs. The branch ENI limits come from https://github.com/aws/amazon-vpc-resource-controller-k8s/blob/master/pkg/aws/vpc/limits.go, in case you hadn't seen that yet. As you mentioned, Kubernetes network policy is more scalable when it comes to pod density per node. We still want to investigate which limits you are hitting here, though. For the scheduling decision, as I understand it, the Kubernetes scheduler will select a node to run the pod on, and then the VPC Resource Controller will try to allocate a branch ENI on that node: https://github.com/aws/amazon-vpc-resource-controller-k8s/blob/master/pkg/provider/branch/trunk/trunk.go#L332. If the node has already reached its branch ENI limit, then the allocation will fail. It doesn't look like the Kubernetes scheduler can pick nodes based on which has the largest number of branch ENI slots available. If you can share the instance type and open a support case, I can involve some other teammates and do more digging. |
@jdn5126 thank you for your quick response and obliging my pile-on! currently on basic support so can't open a case, but the instance type for thanks for the code pointers and sharing your scheduling knowledge. since we are over provisioned it now makes sense to me that typical request/limit/affinity/etc scheduling would toss pods at random nodes regardless of branch eni limits and then stabilize once the new worker node becomes available (this always self-heals, it's just a few minutes of downtime for select pods). i suppose low-tech fixes could be:
i looked at cloudwatch this morning, mostly going through kube-scheduler logs. since volume is high, i used a log insights query like (with time period matching attached logs):
i just see a lot of "successfully bound pod to node" and a couple "added node in listed group to nodetree". nothing else. is there a better log to check? thanks again for the work you do! |
No worries, happy to help! The Security Groups for Pods solution is great for migrating services that use EC2 security groups, and great for using the same EC2 concept across AWS services. Its main drawback is its impact on node scalability and its lack of integration with the Kubernetes scheduler. For CloudWatch, assuming you have already enabled "Control Plane Logging" from your EKS console, you'll be able to see VPC Resource Controller logs with a query like:
and then you can filter for errors allocating branch ENIs, which should be visible from just:
|
Closing this issue as there are no active threads. Please reopen if there is more information to add |
This issue is now closed. Comments on closed issues are hard for our team to see. |
What happened:
I enabled prefix delegation to get more pods per node following the steps here but it seems like this is not working well. I'm using Karpenter (not sure if that is relevant), the error message in pod is as below:
I searched for this error, went through other posts (e.g. #1480, #2411) but it doesn't seem like that's my issue. This is a brand new instance, and my subnets are used exclusively for this EKS cluster, from which all nodes are provisioned by Karpenter using the CNI setup with max-pods, so I don't think there could be a subnet IP fragmentation (but I'm also not sure how to debug this).
The relevant env vars are set:
My subnet has enough IPs:
I c onfigured max pods:
The max pods calculator suggests the instance type (c6a.large) should have 29 pods, or 110 with prefix delegation:
Which is funny, as the node seems to have 32 IPs assigned rather than 29, so it's unclear whether the prefix delegation is working here or not.
I ran
/opt/cni/bin/aws-cni-support.sh
on the node and sent it via email to [email protected].Also run this:
Environment:
kubectl version
):cat /etc/os-release
):uname -a
):Linux ip-10-0-10-34.ec2.internal 5.10.201-191.748.amzn2.x86_64 #1 SMP Mon Nov 27 18:28:14 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
The text was updated successfully, but these errors were encountered: