Add v6e TPU Head Resource Autoscaling Support #48201

ryanaoleary · 2024-10-22T21:13:46Z

Why are these changes needed?

This PR adds tpu-v6e-slice to the list of known TPU accelerators, enabling the KubeRay autoscaler to add a TPU-v6e...-Head resource to the autoscaling resource config. This PR also adds a unit test to cover test_tpu_node_selectors_to_type.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Ryan O'Leary <[email protected]>

ryanaoleary · 2024-10-22T21:14:16Z

cc: @kevin85421 @andrewsykim

ryanaoleary · 2024-10-23T04:26:41Z

This PR was manually tested as follows:

Build Ray from source with these changes.
Create an autoscaling RayCluster CR with the following TPU worker group:

- replicas: 0
    minReplicas: 0
    maxReplicas: 10
    numOfHosts: 4
    groupName: tpu-group
    rayStartParams: {}
    template:
      spec:
        securityContext:
          runAsUser: 0
        containers:
        - name: ray-worker
          image: $DEV-IMAGE
          imagePullPolicy: Always
          resources:
            limits:
              cpu: "24"
              ephemeral-storage: 10Gi
              google.com/tpu: "4"
              memory: 40G
            requests:
              cpu: "24"
              ephemeral-storage: 10Gi
              google.com/tpu: "4"
              memory: 40G
          env:
          - name: NODE_IP
            valueFrom:
              fieldRef:
                fieldPath: status.hostIP
          - name: VBAR_CONTROL_SERVICE_URL
            value: $(NODE_IP):8353
          - name: JAX_PLATFORMS
            value: tpu,cpu
          - name: ENABLE_PJRT_COMPATIBILITY
            value: "true"
          ports:
          - containerPort: 8081
            name: mxla
        nodeSelector:
            cloud.google.com/gke-tpu-accelerator: tpu-v6e-slice
            cloud.google.com/gke-tpu-topology: 4x4

Verify the TPU head resource is added to the autoscaler resources after a replica is scaled up:

Usage:
 0.0/96.0 CPU
 0.0/16.0 TPU
 0.0/1.0 TPU-v6e-16-head
 0B/150.87GiB memory
 0B/45.02GiB object_store_memory
 0.0/4.0 tpu-group-0

Demands:
 (no resource demands)
2024-10-22 21:23:50,547 - INFO - The autoscaler took 0.046 seconds to complete the update iteration.
2024-10-22 21:23:50,547 INFO autoscaler.py:470 -- The autoscaler took 0.046 seconds to complete the update iteration.

andrewsykim

LGTM, just a minor nit

python/ray/autoscaler/_private/kuberay/utils.py

Signed-off-by: Ryan O'Leary <[email protected]>

ryanaoleary · 2024-11-15T19:59:41Z

@kevin85421 @hongchaodeng pinging this again to see if it can get reviewed/approved by a code-owner, with v6e TPUs in private preview in GKE it would be good to ensure the Ray autoscaler support is there

kevin85421 · 2024-11-27T06:56:20Z

python/ray/autoscaler/_private/kuberay/utils.py

-        default_num_cores_per_chip = 2
-        if generation == "v5e":
-            default_num_cores_per_chip = 1
+        default_num_cores_per_chip = 1


How can I determine the exact value of default_num_cores_per_chip to verify that this logic is correct? I briefly used Ctrl + F to search for some keywords in https://cloud.google.com/kubernetes-engine/docs/how-to/tpus#run, but I couldn't find the information.

The relation between cores to number of chips for each TPU generation in one spot is viewable in the Cloud TPU documentation for each version. Under System Architecture it states the number of TensorCores per TPU chip.

kevin85421 · 2024-12-03T04:04:21Z

I have already pinged my colleague to merge this PR.

Add v6e TPU Head Resource Autoscaling Support

2f12455

Signed-off-by: Ryan O'Leary <[email protected]>

ryanaoleary requested review from hongchaodeng and a team as code owners October 22, 2024 21:13

Merge branch 'master' into v6e-update

a1baeb4

andrewsykim mentioned this pull request Oct 23, 2024

[WIP] add support for TPU v6e in kuberay autoscaler #48085

Closed

8 tasks

Merge branch 'master' into v6e-update

43b8290

andrewsykim approved these changes Nov 1, 2024

View reviewed changes

python/ray/autoscaler/_private/kuberay/utils.py Outdated Show resolved Hide resolved

ryanaoleary and others added 2 commits November 1, 2024 18:48

Fix nits

9ad2663

Signed-off-by: Ryan O'Leary <[email protected]>

Merge branch 'master' into v6e-update

badd9cc

kevin85421 self-assigned this Nov 6, 2024

ryanaoleary added 2 commits November 6, 2024 20:25

Merge branch 'master' into v6e-update

7eec6bd

Merge branch 'master' into v6e-update

4013ff6

kevin85421 reviewed Nov 27, 2024

View reviewed changes

ryanaoleary requested a review from kevin85421 December 2, 2024 17:45

kevin85421 approved these changes Dec 2, 2024

View reviewed changes

kevin85421 added the go add ONLY when ready to merge, run all tests label Dec 2, 2024

ryanaoleary added 2 commits December 2, 2024 19:22

Merge branch 'master' into v6e-update

6dcf943

Merge branch 'master' into v6e-update

cd75a19

jjyao merged commit eab7c3f into ray-project:master Dec 3, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add v6e TPU Head Resource Autoscaling Support #48201

Add v6e TPU Head Resource Autoscaling Support #48201

ryanaoleary commented Oct 22, 2024

ryanaoleary commented Oct 22, 2024

ryanaoleary commented Oct 23, 2024

andrewsykim left a comment

ryanaoleary commented Nov 15, 2024

kevin85421 Nov 27, 2024

ryanaoleary Dec 2, 2024

kevin85421 commented Dec 3, 2024

Add v6e TPU Head Resource Autoscaling Support #48201

Add v6e TPU Head Resource Autoscaling Support #48201

Conversation

ryanaoleary commented Oct 22, 2024

Why are these changes needed?

Related issue number

Checks

ryanaoleary commented Oct 22, 2024

ryanaoleary commented Oct 23, 2024

andrewsykim left a comment

Choose a reason for hiding this comment

ryanaoleary commented Nov 15, 2024

kevin85421 Nov 27, 2024

Choose a reason for hiding this comment

ryanaoleary Dec 2, 2024

Choose a reason for hiding this comment

kevin85421 commented Dec 3, 2024