-
-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Official Canonical Ubuntu EKS ami-based nodes fail to join cluster #3278
Comments
You haven't provided any user data for the node to join the cluster. |
@bryantbiggs thanks for the prompt response. so given the comment here will |
I don't know what Canonical's images require, they don't come from EKS. I suspect you'll have to supply the user data yourself In general, use the EKS optimized AMIs - why does the host OS matter here? |
I know, but if we're being practical here it doesn't get much more proximal than Canonical's EKS images. I mean there's documentation on AWS here and documentation on Canonical here about building EKS with Ubuntu, but it doesn't discuss anything about terraform, hence launching my issue in this repo. And naturally I understand folks don't have time to go running around listing the user data approach for every ami under the sun, but Canonical is probably what, 87% of custom use-cases? Anyway whatever it seems that after many hours of looking into this, the work I needed to do to open this issue may get me to where I need to be, idk.
This is probably a valid point. I recently had some extremely difficult experiences running ML work on AL hosts and it's made me very sensitive to using AL at all, but theoretically what you're saying here in an ideal world is true: if I never need to do much on the host then why should the host matter? Maybe. Right now where I'm at is in my opinion AL is not as well maintained as Canonical and RHEL variants so I'd prefer not to bring it into production, though I guess there are 1M customers using AL EKS without issue. But documentation is more scarce AL in general and especially in the ML space so if anything needs to happen on the host, which I get it it's kubernetes so the host is "probably" less important, it can become a real pain. I would think this module should definitely call out that you can't expect to just label |
Actually is significantly less than 5% (disclosure: I work at AWS on EKS)
I also focus on the ML workloads on EKS - we have:
If you are wanting to utilize NVIDIA GPUs - these come with the correct NVIDIA driver, the NVIDIA container toolkit, etc., as well as things like EFA pre-installed to work out of the box. For the AL2023 variant, all you need to provide are the device plugins (NVIDIA device plugin and the EFA device plugin if using EFA). The Bottlerocket variants have the NVIDIA device plugin baked into their AMIs We also have the equivalent for Neuron devices if wanting to use Inferentia or Trainium instances:
For more info on the AL2023 accelerated variants, you can refer to the launch blog I authored when we first released those late last year https://aws.amazon.com/blogs/containers/amazon-eks-optimized-amazon-linux-2023-accelerated-amis-now-available/
I would refer you to our user data documentation where this is documented https://github.com/terraform-aws-modules/terraform-aws-eks/blob/master/docs/user_data.md
What started as a "how to use CUSTOM_AMI" has now moved closer to the root issue (this is why details and motivational context are important) - what issues are you encountering with your ML workload or in general with Amazon Linux? In AWS, Amazon Linux dominates the OS usage well above any Ubuntu or RHEL usage and we have an entire team dedicated to not only the OS but also contributors to the upstream kernel and various projects. In short - it is very well maintained and will always be the recommendation within AWS (or Bottlerocket which is built off what the AL team provides in terms of components). There is also the important detail that when you use what is provided by a managed service offering such as AWS and/or EKS, you get support when issues arise. When you go your own way and opt out of whats provided, the teams cannot support that (as is the case with an Ubuntu AMI - support nor service teams will be able to assist with issues there) |
I mean custom use cases outside of AL. I'm willing to bet Ubuntu is in the lead. But whatever the case it's a core OS so it's extremely reasonable to expect someone to want to launch a custom ami with it.
The issue I had was not with an AL AMI it was with an AL container runtime for AWS lambda docker. I found that a common utility I needed to run is not offered as part of the
Suggestion: there should be a call out in this repo's main module page explicitly associating running a custom AMI with handling user data. I'm glad that the documentation is exists fundamentally, but it would be helpful if there was a sentence in the main module page that says "If you want to run a custom ami you need to read this page." There is a comment about seeing the "AWS documentation" if you want to run a custom AMI, but I don't think that page in any way would make someone think "oh I need to go read the user data page back in the module documentation," but that I could be wrong about. If it does then ignore me. |
Description
The code example here which instantiates the module developed here in this repo works fine as is, however when I change:
to
and add an
ami_id
to the custom node group as such:nodes are spun up and actually live for a while and pass EC2 health checks, but these nodes ultimately fail to join the EKS cluster, though they do continue to persist as EC2 instances functionally in EC2, my implication being that as far as the VMs themselves go they are fundamentally healthy, eliminating non-kubernetes potential gotchas, basically.
Anyway I have looked at the other issues and of course found this one which was only created last week and I have left a comment for OP as he in some roundabout may may be able to solve my issue/answer my question in this comment here. Regardless it is pretty unclear on how to use a custom ami with this module. The internet seems to imply, via the parent aws_eks_cluster and aws_eks_managed_node_group modules that one needs to use a custom launch template, etc. etc., which is possibly why the gentleman from the issue I linked to just above was using a custom launch template in his terraform; that's certainly how I ended up there. I was also able to find this comment in this module's source code and it will maybe lead me to a resolution, but I'm weary of continuing much further without just straight up asking for some clarification.
A couple things to get out of the way as they will be obvious process of elimination questions:
ami_id
attribute given in theeks_managed_node_groups
blocks. It fully applies and the ec2 instances themselves work.I'm going to try launching a cluster with the approach discussed in the comment for this module's source code I linked to above, but regardless, whatever the solution is, it would be great if a change could be published to the documentation explaining how to achieve this because the module input documentation nowhere exposes an
ami_id
parameter by which one would supply a custom ami id so it makes it seem like it may be impossible to launch a custom ami, despite the AWS documentation saying use keyword "CUSTOM" and the aws_eks_managed_node_group parent module exposing anami_id
attribute. Thanks!Versions
Module version [Required]: 20.8.5
Terraform version: Terraform v1.10.4
Provider version(s):
Reproduction Code [Required]
module "vpc" {
source = "terraform-aws-modules/vpc/aws"
version = "5.8.1"
name = "education-vpc"
cidr = "10.0.0.0/16"
azs = slice(data.aws_availability_zones.available.names, 0, 3)
private_subnets = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
public_subnets = ["10.0.4.0/24", "10.0.5.0/24", "10.0.6.0/24"]
enable_nat_gateway = true
single_nat_gateway = true
enable_dns_hostnames = true
public_subnet_tags = {
"kubernetes.io/role/elb" = 1
}
private_subnet_tags = {
"kubernetes.io/role/internal-elb" = 1
}
}
module "eks" {
source = "terraform-aws-modules/eks/aws"
version = "20.8.5"
cluster_name = local.cluster_name
cluster_version = "1.29"
cluster_endpoint_public_access = true
enable_cluster_creator_admin_permissions = true
cluster_addons = {
aws-ebs-csi-driver = {
service_account_role_arn = module.irsa-ebs-csi.iam_role_arn
}
}
vpc_id = module.vpc.vpc_id
subnet_ids = module.vpc.private_subnets
eks_managed_node_group_defaults = {
ami_type = "CUSTOM"
}
eks_managed_node_groups = {
one = {
name = "node-group-1"
ami_id = "ami-065b49d435df033f6"
instance_types = ["t3.small"]
}
}
data "aws_iam_policy" "ebs_csi_policy" {
arn = "arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy"
}
module "irsa-ebs-csi" {
source = "terraform-aws-modules/iam/aws//modules/iam-assumable-role-with-oidc"
version = "5.39.0"
create_role = true
role_name = "AmazonEKSTFEBSCSIRole-${module.eks.cluster_name}"
provider_url = module.eks.oidc_provider
role_policy_arns = [data.aws_iam_policy.ebs_csi_policy.arn]
oidc_fully_qualified_subjects = ["system:serviceaccount:kube-system:ebs-csi-controller-sa"]
}
Expected behavior
Node joins the cluster.
Actual behavior
Node fails to join the cluster.
The text was updated successfully, but these errors were encountered: