Skip to content

Commit

Permalink
Merge pull request #3381 from yuvipanda/gpu-mz
Browse files Browse the repository at this point in the history
cryo: Allow GPU nodes to spawn across AZs
  • Loading branch information
yuvipanda authored Nov 6, 2023
2 parents a236bc6 + a767754 commit ef1a18e
Show file tree
Hide file tree
Showing 3 changed files with 18 additions and 8 deletions.
7 changes: 7 additions & 0 deletions docs/howto/features/gpu.md
Original file line number Diff line number Diff line change
Expand Up @@ -114,6 +114,9 @@ AWS, and we can configure a node group there to provide us GPUs.
tags+: {
"k8s.io/cluster-autoscaler/node-template/resources/nvidia.com/gpu": "1"
},
// Allow provisioning GPUs across all AZs, to prevent situation where all
// GPUs in a single AZ are in use and no new nodes can be spawned
availabilityZones: masterAzs,
}
```

Expand All @@ -122,6 +125,10 @@ AWS, and we can configure a node group there to provide us GPUs.
1 GPU per node. If you're using a different machine type with
more GPUs, adjust this definition accordingly.

We use a prior variable, `masterAzs`, to allow for GPU nodes to spawn in all
AZ in the region, rather than just a specific one. This is helpful as a single
zone may run out of GPUs rather fast.

2. Render the `.jsonnet` file into a `.yaml` file that `eksctl` can use

```bash
Expand Down
3 changes: 3 additions & 0 deletions eksctl/nasa-cryo.jsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,9 @@ local notebookNodes = [
tags+: {
"k8s.io/cluster-autoscaler/node-template/resources/nvidia.com/gpu": "1"
},
// Allow provisioning GPUs across all AZs, to prevent situation where all
// GPUs in a single AZ are in use and no new nodes can be spawned
availabilityZones: masterAzs,
},
];

Expand Down
16 changes: 8 additions & 8 deletions terraform/aws/efs.tf
Original file line number Diff line number Diff line change
@@ -1,17 +1,15 @@
// Find out which subnet and security group our EFS mount target should be in
// Find out which subnets and security group our EFS mount target should be in
// It needs to be in the public subnet where our nodes are, as the nodes will be
// doing the mounting operation. It should be in a security group shared by all
// the nodes.
data "aws_subnet" "cluster_node_subnet" {
// the nodes. We create a mount target in each subnet, even if we primarily put
// all our nodes in one - this allows for GPU nodes to be spread out across
// AZ when needed
data "aws_subnets" "cluster_node_subnets" {

filter {
name = "vpc-id"
values = [data.aws_eks_cluster.cluster.vpc_config[0]["vpc_id"]]
}
filter {
name = "availability-zone"
values = [var.cluster_nodes_location]
}

filter {
name = "tag:aws:cloudformation:logical-id"
Expand Down Expand Up @@ -70,8 +68,10 @@ resource "aws_efs_file_system" "homedirs" {
}

resource "aws_efs_mount_target" "homedirs" {
for_each = toset(data.aws_subnets.cluster_node_subnets.ids)

file_system_id = aws_efs_file_system.homedirs.id
subnet_id = data.aws_subnet.cluster_node_subnet.id
subnet_id = each.key
security_groups = [data.aws_security_group.cluster_nodes_shared_security_group.id]
}

Expand Down

0 comments on commit ef1a18e

Please sign in to comment.