Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cryo: Allow GPU nodes to spawn across AZs #3381

Merged
merged 1 commit into from
Nov 6, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions docs/howto/features/gpu.md
Original file line number Diff line number Diff line change
Expand Up @@ -114,6 +114,9 @@ AWS, and we can configure a node group there to provide us GPUs.
tags+: {
"k8s.io/cluster-autoscaler/node-template/resources/nvidia.com/gpu": "1"
},
// Allow provisioning GPUs across all AZs, to prevent situation where all
// GPUs in a single AZ are in use and no new nodes can be spawned
availabilityZones: masterAzs,
}
```

Expand All @@ -122,6 +125,10 @@ AWS, and we can configure a node group there to provide us GPUs.
1 GPU per node. If you're using a different machine type with
more GPUs, adjust this definition accordingly.

We use a prior variable, `masterAzs`, to allow for GPU nodes to spawn in all
AZ in the region, rather than just a specific one. This is helpful as a single
zone may run out of GPUs rather fast.

2. Render the `.jsonnet` file into a `.yaml` file that `eksctl` can use

```bash
Expand Down
3 changes: 3 additions & 0 deletions eksctl/nasa-cryo.jsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,9 @@ local notebookNodes = [
tags+: {
"k8s.io/cluster-autoscaler/node-template/resources/nvidia.com/gpu": "1"
},
// Allow provisioning GPUs across all AZs, to prevent situation where all
// GPUs in a single AZ are in use and no new nodes can be spawned
availabilityZones: masterAzs,
},
];

Expand Down
16 changes: 8 additions & 8 deletions terraform/aws/efs.tf
Original file line number Diff line number Diff line change
@@ -1,17 +1,15 @@
// Find out which subnet and security group our EFS mount target should be in
// Find out which subnets and security group our EFS mount target should be in
// It needs to be in the public subnet where our nodes are, as the nodes will be
// doing the mounting operation. It should be in a security group shared by all
// the nodes.
data "aws_subnet" "cluster_node_subnet" {
// the nodes. We create a mount target in each subnet, even if we primarily put
// all our nodes in one - this allows for GPU nodes to be spread out across
// AZ when needed
data "aws_subnets" "cluster_node_subnets" {

filter {
name = "vpc-id"
values = [data.aws_eks_cluster.cluster.vpc_config[0]["vpc_id"]]
}
filter {
name = "availability-zone"
values = [var.cluster_nodes_location]
}

filter {
name = "tag:aws:cloudformation:logical-id"
Expand Down Expand Up @@ -70,8 +68,10 @@ resource "aws_efs_file_system" "homedirs" {
}

resource "aws_efs_mount_target" "homedirs" {
for_each = toset(data.aws_subnets.cluster_node_subnets.ids)

file_system_id = aws_efs_file_system.homedirs.id
subnet_id = data.aws_subnet.cluster_node_subnet.id
subnet_id = each.key
security_groups = [data.aws_security_group.cluster_nodes_shared_security_group.id]
}

Expand Down