Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Official Canonical Ubuntu EKS ami-based nodes fail to join cluster #3278

Closed
jonassteinberg1 opened this issue Jan 17, 2025 · 6 comments
Closed

Comments

@jonassteinberg1
Copy link

jonassteinberg1 commented Jan 17, 2025

Description

The code example here which instantiates the module developed here in this repo works fine as is, however when I change:

eks_managed_node_group_defaults = {
    ami_type = "AL2_x86_64"

  }

to

eks_managed_node_group_defaults = {
    ami_type = "CUSTOM"

  }

and add an ami_id to the custom node group as such:

eks_managed_node_groups = {
    one = {
      name = "node-group-1"
      ami_id         = "ami-065b49d435df033f6"
      instance_types = ["t3.small"]

      min_size     = 1
      max_size     = 3
      desired_size = 2
    }

    two = {
      name = "node-group-2"
      ami_id         = "ami-065b49d435df033f6"
      instance_types = ["t3.small"]

      min_size     = 1
      max_size     = 2
      desired_size = 1
    }
  }
}

nodes are spun up and actually live for a while and pass EC2 health checks, but these nodes ultimately fail to join the EKS cluster, though they do continue to persist as EC2 instances functionally in EC2, my implication being that as far as the VMs themselves go they are fundamentally healthy, eliminating non-kubernetes potential gotchas, basically.

Anyway I have looked at the other issues and of course found this one which was only created last week and I have left a comment for OP as he in some roundabout may may be able to solve my issue/answer my question in this comment here. Regardless it is pretty unclear on how to use a custom ami with this module. The internet seems to imply, via the parent aws_eks_cluster and aws_eks_managed_node_group modules that one needs to use a custom launch template, etc. etc., which is possibly why the gentleman from the issue I linked to just above was using a custom launch template in his terraform; that's certainly how I ended up there. I was also able to find this comment in this module's source code and it will maybe lead me to a resolution, but I'm weary of continuing much further without just straight up asking for some clarification.

A couple things to get out of the way as they will be obvious process of elimination questions:

  1. Yes the terraform fully applies with the ami_id attribute given in the eks_managed_node_groups blocks. It fully applies and the ec2 instances themselves work.
  2. Yes the Ubuntu ami id I have given is specifically from Canonical's EKS ami offering. This is not a custom ami I made and I have done nothing to it except supply the ami id.
  3. Yes region in which Canonical has deployed the ami matches the region in which I am instantiating my cluster.
  4. Yes the version of the ami matches the version of the EKS cluster being given in the terraform.

I'm going to try launching a cluster with the approach discussed in the comment for this module's source code I linked to above, but regardless, whatever the solution is, it would be great if a change could be published to the documentation explaining how to achieve this because the module input documentation nowhere exposes an ami_id parameter by which one would supply a custom ami id so it makes it seem like it may be impossible to launch a custom ami, despite the AWS documentation saying use keyword "CUSTOM" and the aws_eks_managed_node_group parent module exposing an ami_id attribute. Thanks!

  • [✅] ✋ I have searched the open/closed issues and my issue is not listed.

Versions

  • Module version [Required]: 20.8.5

  • Terraform version: Terraform v1.10.4

  • Provider version(s):

  • provider registry.terraform.io/hashicorp/aws v5.47.0
  • provider registry.terraform.io/hashicorp/cloudinit v2.3.4
  • provider registry.terraform.io/hashicorp/null v3.2.2
  • provider registry.terraform.io/hashicorp/random v3.6.1
  • provider registry.terraform.io/hashicorp/time v0.11.1
  • provider registry.terraform.io/hashicorp/tls v4.0.5

Reproduction Code [Required]

module "vpc" {
source = "terraform-aws-modules/vpc/aws"
version = "5.8.1"

name = "education-vpc"

cidr = "10.0.0.0/16"
azs = slice(data.aws_availability_zones.available.names, 0, 3)

private_subnets = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
public_subnets = ["10.0.4.0/24", "10.0.5.0/24", "10.0.6.0/24"]

enable_nat_gateway = true
single_nat_gateway = true
enable_dns_hostnames = true

public_subnet_tags = {
"kubernetes.io/role/elb" = 1
}

private_subnet_tags = {
"kubernetes.io/role/internal-elb" = 1
}
}

module "eks" {
source = "terraform-aws-modules/eks/aws"
version = "20.8.5"

cluster_name = local.cluster_name
cluster_version = "1.29"

cluster_endpoint_public_access = true
enable_cluster_creator_admin_permissions = true

cluster_addons = {
aws-ebs-csi-driver = {
service_account_role_arn = module.irsa-ebs-csi.iam_role_arn
}
}

vpc_id = module.vpc.vpc_id
subnet_ids = module.vpc.private_subnets

eks_managed_node_group_defaults = {
ami_type = "CUSTOM"

}

eks_managed_node_groups = {
one = {
name = "node-group-1"
ami_id = "ami-065b49d435df033f6"
instance_types = ["t3.small"]

  min_size     = 1
  max_size     = 3
  desired_size = 2
}

two = {
  name = "node-group-2"
  ami_id = "ami-065b49d435df033f6"
  instance_types = ["t3.small"]

  min_size     = 1
  max_size     = 2
  desired_size = 1
}

}
}

data "aws_iam_policy" "ebs_csi_policy" {
arn = "arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy"
}

module "irsa-ebs-csi" {
source = "terraform-aws-modules/iam/aws//modules/iam-assumable-role-with-oidc"
version = "5.39.0"

create_role = true
role_name = "AmazonEKSTFEBSCSIRole-${module.eks.cluster_name}"
provider_url = module.eks.oidc_provider
role_policy_arns = [data.aws_iam_policy.ebs_csi_policy.arn]
oidc_fully_qualified_subjects = ["system:serviceaccount:kube-system:ebs-csi-controller-sa"]
}

Expected behavior

Node joins the cluster.

Actual behavior

Node fails to join the cluster.

@bryantbiggs
Copy link
Member

You haven't provided any user data for the node to join the cluster.

@jonassteinberg1
Copy link
Author

jonassteinberg1 commented Jan 17, 2025

@bryantbiggs thanks for the prompt response. so given the comment here will enable_bootstrap_user_data = true be enough? Or is using a custom ami a process of always having to also provide some type of custom user data? This is simply the Canonical EKS-optimzed ami for 1.29 and us-east-1.

@bryantbiggs
Copy link
Member

I don't know what Canonical's images require, they don't come from EKS. I suspect you'll have to supply the user data yourself

In general, use the EKS optimized AMIs - why does the host OS matter here?

@jonassteinberg1
Copy link
Author

jonassteinberg1 commented Jan 17, 2025

I don't know what Canonical's images require, they don't come from EKS.

I know, but if we're being practical here it doesn't get much more proximal than Canonical's EKS images. I mean there's documentation on AWS here and documentation on Canonical here about building EKS with Ubuntu, but it doesn't discuss anything about terraform, hence launching my issue in this repo. And naturally I understand folks don't have time to go running around listing the user data approach for every ami under the sun, but Canonical is probably what, 87% of custom use-cases? Anyway whatever it seems that after many hours of looking into this, the work I needed to do to open this issue may get me to where I need to be, idk.

why does the host OS matter here

This is probably a valid point. I recently had some extremely difficult experiences running ML work on AL hosts and it's made me very sensitive to using AL at all, but theoretically what you're saying here in an ideal world is true: if I never need to do much on the host then why should the host matter? Maybe. Right now where I'm at is in my opinion AL is not as well maintained as Canonical and RHEL variants so I'd prefer not to bring it into production, though I guess there are 1M customers using AL EKS without issue. But documentation is more scarce AL in general and especially in the ML space so if anything needs to happen on the host, which I get it it's kubernetes so the host is "probably" less important, it can become a real pain.

I would think this module should definitely call out that you can't expect to just label CUSTOM and supply an ami_id and go, but maybe that's so trivial as to be expected. Whatever the case you can close this issue I think from the yaml examples I see from Canonical on launching on EKS I'll be able to get it working via terraform by providing the user data. Thanks again for the prompt responses.

@bryantbiggs
Copy link
Member

but Canonical is probably what, 87% of custom use-cases?

Actually is significantly less than 5% (disclosure: I work at AWS on EKS)

I recently had some extremely difficult experiences running ML work on AL hosts

I also focus on the ML workloads on EKS - we have:

  • AL2023_x86_64_NVIDIA
  • BOTTLEROCKET_x86_64_NVIDIA

If you are wanting to utilize NVIDIA GPUs - these come with the correct NVIDIA driver, the NVIDIA container toolkit, etc., as well as things like EFA pre-installed to work out of the box. For the AL2023 variant, all you need to provide are the device plugins (NVIDIA device plugin and the EFA device plugin if using EFA). The Bottlerocket variants have the NVIDIA device plugin baked into their AMIs

We also have the equivalent for Neuron devices if wanting to use Inferentia or Trainium instances:

  • AL2023_x86_64_NEURON
  • BOTTLEROCKET_x86_64 (contains Neuron components)

For more info on the AL2023 accelerated variants, you can refer to the launch blog I authored when we first released those late last year https://aws.amazon.com/blogs/containers/amazon-eks-optimized-amazon-linux-2023-accelerated-amis-now-available/

I would think this module should definitely call out that you can't expect to just label CUSTOM and supply an ami_id and go, but maybe that's so trivial as to be expected

I would refer you to our user data documentation where this is documented https://github.com/terraform-aws-modules/terraform-aws-eks/blob/master/docs/user_data.md

Right now where I'm at is in my opinion AL is not as well maintained as Canonical and RHEL variants so I'd prefer not to bring it into production, though I guess there are 1M customers using AL EKS without issue. But documentation is more scarce AL in general and especially in the ML space so if anything needs to happen on the host, which I get it it's kubernetes so the host is "probably" less important, it can become a real pain.

What started as a "how to use CUSTOM_AMI" has now moved closer to the root issue (this is why details and motivational context are important) - what issues are you encountering with your ML workload or in general with Amazon Linux? In AWS, Amazon Linux dominates the OS usage well above any Ubuntu or RHEL usage and we have an entire team dedicated to not only the OS but also contributors to the upstream kernel and various projects. In short - it is very well maintained and will always be the recommendation within AWS (or Bottlerocket which is built off what the AL team provides in terms of components). There is also the important detail that when you use what is provided by a managed service offering such as AWS and/or EKS, you get support when issues arise. When you go your own way and opt out of whats provided, the teams cannot support that (as is the case with an Ubuntu AMI - support nor service teams will be able to assist with issues there)

@jonassteinberg1
Copy link
Author

jonassteinberg1 commented Jan 17, 2025

Actually is significantly less than 5% (disclosure: I work at AWS on EKS)

I mean custom use cases outside of AL. I'm willing to bet Ubuntu is in the lead. But whatever the case it's a core OS so it's extremely reasonable to expect someone to want to launch a custom ami with it.

I also focus on the ML workloads on EKS

The issue I had was not with an AL AMI it was with an AL container runtime for AWS lambda docker. I found that a common utility I needed to run is not offered as part of the amazon-linux-extras package management, which to some extent has to do with the utility provider, but also made me realize that in the ML community people use AL less. Then because of this I started to get bogged down in other linux-related issues as I attempted to make things work with AL lambda docker. Given that experience I wasn't keen on running EKS with AL tbh.

I would refer you to our user data documentation

Suggestion: there should be a call out in this repo's main module page explicitly associating running a custom AMI with handling user data. I'm glad that the documentation is exists fundamentally, but it would be helpful if there was a sentence in the main module page that says "If you want to run a custom ami you need to read this page." There is a comment about seeing the "AWS documentation" if you want to run a custom AMI, but I don't think that page in any way would make someone think "oh I need to go read the user data page back in the module documentation," but that I could be wrong about. If it does then ignore me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants