Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add v6e TPU Head Resource Autoscaling Support #48201

Merged
merged 9 commits into from
Dec 3, 2024

Conversation

ryanaoleary
Copy link
Contributor

Why are these changes needed?

This PR adds tpu-v6e-slice to the list of known TPU accelerators, enabling the KubeRay autoscaler to add a TPU-v6e...-Head resource to the autoscaling resource config. This PR also adds a unit test to cover test_tpu_node_selectors_to_type.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@ryanaoleary ryanaoleary requested review from hongchaodeng and a team as code owners October 22, 2024 21:13
@ryanaoleary
Copy link
Contributor Author

cc: @kevin85421 @andrewsykim

@ryanaoleary
Copy link
Contributor Author

This PR was manually tested as follows:

  1. Build Ray from source with these changes.
  2. Create an autoscaling RayCluster CR with the following TPU worker group:
- replicas: 0
    minReplicas: 0
    maxReplicas: 10
    numOfHosts: 4
    groupName: tpu-group
    rayStartParams: {}
    template:
      spec:
        securityContext:
          runAsUser: 0
        containers:
        - name: ray-worker
          image: $DEV-IMAGE
          imagePullPolicy: Always
          resources:
            limits:
              cpu: "24"
              ephemeral-storage: 10Gi
              google.com/tpu: "4"
              memory: 40G
            requests:
              cpu: "24"
              ephemeral-storage: 10Gi
              google.com/tpu: "4"
              memory: 40G
          env:
          - name: NODE_IP
            valueFrom:
              fieldRef:
                fieldPath: status.hostIP
          - name: VBAR_CONTROL_SERVICE_URL
            value: $(NODE_IP):8353
          - name: JAX_PLATFORMS
            value: tpu,cpu
          - name: ENABLE_PJRT_COMPATIBILITY
            value: "true"
          ports:
          - containerPort: 8081
            name: mxla
        nodeSelector:
            cloud.google.com/gke-tpu-accelerator: tpu-v6e-slice
            cloud.google.com/gke-tpu-topology: 4x4
  1. Verify the TPU head resource is added to the autoscaler resources after a replica is scaled up:
Usage:
 0.0/96.0 CPU
 0.0/16.0 TPU
 0.0/1.0 TPU-v6e-16-head
 0B/150.87GiB memory
 0B/45.02GiB object_store_memory
 0.0/4.0 tpu-group-0

Demands:
 (no resource demands)
2024-10-22 21:23:50,547 - INFO - The autoscaler took 0.046 seconds to complete the update iteration.
2024-10-22 21:23:50,547 INFO autoscaler.py:470 -- The autoscaler took 0.046 seconds to complete the update iteration.

Copy link
Contributor

@andrewsykim andrewsykim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a minor nit

python/ray/autoscaler/_private/kuberay/utils.py Outdated Show resolved Hide resolved
@kevin85421 kevin85421 self-assigned this Nov 6, 2024
@ryanaoleary
Copy link
Contributor Author

@kevin85421 @hongchaodeng pinging this again to see if it can get reviewed/approved by a code-owner, with v6e TPUs in private preview in GKE it would be good to ensure the Ray autoscaler support is there

default_num_cores_per_chip = 2
if generation == "v5e":
default_num_cores_per_chip = 1
default_num_cores_per_chip = 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can I determine the exact value of default_num_cores_per_chip to verify that this logic is correct? I briefly used Ctrl + F to search for some keywords in https://cloud.google.com/kubernetes-engine/docs/how-to/tpus#run, but I couldn't find the information.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The relation between cores to number of chips for each TPU generation in one spot is viewable in the Cloud TPU documentation for each version. Under System Architecture it states the number of TensorCores per TPU chip.

@kevin85421 kevin85421 added the go add ONLY when ready to merge, run all tests label Dec 2, 2024
@kevin85421
Copy link
Member

I have already pinged my colleague to merge this PR.

@jjyao jjyao merged commit eab7c3f into ray-project:master Dec 3, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants