-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ClusterConfiguration should support tolerations #413
Comments
Since I've submitted this, I've come across the need to add other fields, such as nodeSelectors, schedulerName, annotations, etc to my ray clusters. As I don't think we can anticipate all the needs of the user, I think we should build in more flexibility into cluster creation. I'd like to be able to read the ray cluster specs and submit a python dictionary to update the ray cluster. For example update_dict = {
"spec": {
"workerGroupSpecs": [
{
"groupName": "small-group-raycluster",
"template": {
"spec": {
"tolerations": [
{
"key": "nvidia.com/gpu",
"operator": "Exists",
"effect": "NoSchedule"
}
]
}
}
}
]
}
} That way I could update not just tolerations, but any other field I need to change, such as nodeSelector or schedulerName |
I totally agree the current API design is not flexible enough. I really like your suggestion to offer the option for users to provide their own patch. We could try to be in line with the different patching mechanisms that are well known in Kubernetes already: https://kubernetes.io/docs/tasks/manage-kubernetes-objects/update-api-object-kubectl-patch/. |
We could add two keyword arguments, one to provide the Pod template for the worker nodes, another one for the head node Pod template, using official Kubernetes Python client types so auto-complete works, e.g.: from kubernetes import V1PodTemplateSpec, V1PodSpec, V1Toleration
cluster = Cluster(ClusterConfiguration(
num_workers=N,
worker_template=V1PodTemplateSpec(
spec=V1PodSpec(
tolerations=[V1Toleration(
key="nvidia.com/gpu",
operator="Exists",
effect="NoSchedule",
)],
node_selector={
"nvidia.com/gpu.present": "true",
},
)
),
head_template=V1PodTemplateSpec(...),
)) These types from https://github.com/kubernetes-client/python provides a @cfchase let us know if that's close enough to what you had in mind. |
ClusterConfiguration should support tolerations
Run Ray clusters (especially the worker pods) on tainted nodes.
Description of Problem the Feature Should Solve
You cannot create a Ray cluster with tolerations using the CodeFlare SDK
cluster = Cluster(ClusterConfiguration(...))
Often, machine nodes are tainted to prevent unwanted workloads. This is especially the case in GPU nodes which are often tainted. In addition different nodes will have different sized gpus, which would also use taints to make sure the correct workers land on the correct nodes.
You might also want to add a toleration to the
headGroupSpec
Describe the Solution You Would Like to See
Add
worker_tolerations
andhead_tolderations
as optional parameters forClusterConfiguration
Describe Alternatives You Have Considered
Editing the yaml file and just using kuberay directly. You can currently manually edit an AppWrapper yaml to include a toleration for these taints.
The text was updated successfully, but these errors were encountered: