-
-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bottlerocket GPU deployment issue after updated EKS module from 19.21.0 to 20.31.6 #3257
Comments
This is the error received when upgrading from v19 to 20 Here is the user data that is passed to Bottlerocket CPU in version 19 Here is the user data that is passed to the GPU instance in version 19 [settings.kubernetes] Now after the upgrade to version 20 this is what it looks like for the CPU nodes settings.kubernetes.cluster-name = 'bigbang-development-28i' For the GPU instances - I had to delete the managed group from the EKS console and rerun terraform to get them to build settings.kubernetes.cluster-name = 'bigbang-development-28i' Same AMI ID across the EKS module versions just the GPU on version 20 will not join the cluster |
Description
I am trying to upgrade from 19.21.0 to 20.31.6. In the version 19.21.0 I was able to deploy the below managed node groups with Bottlerocket AMIs and have both the generic CPU and GPU nodes join the cluster. Now with the transition to version 20 the generic CPU nodes join the cluster just fine but the GPU nodes never join even though I'm using the same block of code for the user data as in version 19.21.0. I also am unable to connect via SSM into the GPU nodes to further troubleshoot even though they have the same IAM role attached as the CPU nodes.
My EKS version is 1.31 and the AMI release versions for Bottlerocket is 1.29.0-c55d099c
Here is the first part of the module call -
module "eks" {
source = "terraform-aws-modules/eks/aws"
version = "20.31.6"
cluster_name = local.cluster_name
cluster_version = local.env.cluster.cluster_version
cluster_endpoint_public_access = true
cluster_timeouts = {
create = "2h" # Timeout for creating the EKS cluster
update = "2h" # Timeout for updating the EKS cluster
delete = "2h" # Timeout for deleting the EKS cluster
}
authentication_mode = "API_AND_CONFIG_MAP"
enable_cluster_creator_admin_permissions = true
Jumping down to the managed groups:
eks_managed_node_groups = {
}
tags = local.env.tags
depends_on = [module.vpc, module.elb, module.elb_passthrough]
}
The text was updated successfully, but these errors were encountered: