-
Notifications
You must be signed in to change notification settings - Fork 523
[Feature][RayCluster]: Implement the HeadReady condition #2261
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
return false, "HeadPodNotRunning", "Head pod is not running" | ||
} | ||
for _, cond := range pod.Status.Conditions { | ||
if cond.Type == corev1.PodReady && cond.Status != corev1.ConditionTrue { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we use the utils.IsRunningAndReady
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my thought was that utils.IsRunningAndReady
only give us the information about Pod phase is Running + Pod is ready, it didn't tell us more fine-grain like whether it's due to Pod not running yet or Pod not ready yet (raylet may not pass readiness check yet) so i wrote the own checking
@@ -1180,6 +1188,21 @@ func (r *RayClusterReconciler) calculateStatus(ctx context.Context, instance *ra | |||
newInstance.Status.State = rayv1.Ready | |||
} | |||
|
|||
// Check if the head node is running and ready with proper reason and message | |||
if features.Enabled(features.RayClusterStatusConditions) { | |||
isRayHeadRunning, reason, message := utils.CheckRayHeadRunningAndReady(ctx, runtimePods) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can use PodReady directly. We don't need to check whether the Pod is running, and we don't need to generate the reason and the message ourselves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the suggestion, I tried using PodReady as much as possible, and notice that pod status condition's reason and message are empty all the time, and since metav1.Condition need to have an non-empty reason
// reason contains a programmatic identifier indicating the reason for the condition's last transition.
// Producers of specific condition types may define expected values and meanings for this field,
// and whether the values are considered a guaranteed API.
// The value should be a CamelCase string.
// This field may not be empty.
// +required
// +kubebuilder:validation:Required
// +kubebuilder:validation:MaxLength=1024
// +kubebuilder:validation:MinLength=1
// +kubebuilder:validation:Pattern=`^[A-Za-z]([A-Za-z0-9_,:]*[A-Za-z0-9_])?$`
Reason [string](https://pkg.go.dev/builtin#string) `json:"reason" protobuf:"bytes,5,opt,name=reason"`
otherwise will see this error during reconcile
{"level":"info","ts":"2024-07-23T00:24:52.098Z","logger":"controllers.RayCluster","msg":"Error updating status","RayCluster":{"name":"rayservice-sample-raycluster-45jnx","namespace":"default"},"reconcileID":"2fb18e10-3cd9-41a0-b9e8-4660293fb634","name":"rayservice-sample-raycluster-45jnx","error":"RayCluster.ray.io \"rayservice-sample-raycluster-45jnx\" is invalid: status.conditions[0].reason: Invalid value: \"\": status.conditions[0].reason in body should be at least 1 chars long","RayCluster":{"apiVersion":"ray.io/v1","kind":"RayCluster","namespace":"default","name":"rayservice-sample-raycluster-45jnx"}}
thus i add reason HeadPodRunningAndReady
and HeadPodNotReady
respectively
could you also rebase with the master branch? Thanks! |
replicaHeadReadyCondition := metav1.Condition{ | ||
Type: string(rayv1.HeadReady), | ||
Status: metav1.ConditionFalse, | ||
Reason: "HeadPodNotReady", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This reason is not particular useful because it's already implied with the HeadReady
condition being False
. Can the reason be more specific, like CrashLoopBack
, ContainerCreating
, Pending
, etc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope to make the condition consistent with Pod's PodReady condition which both Reason
and Message
are empty. However, @cchen777 found that metav1.Condition
has a limitation where Reason
should not be empty, which is different from PodCondition. See #2261 (comment) for more details.
We can't keep Reason
and Message
empty due to metav1.Condition
's limitation, but I still want to make our HeadReady
condition closer to PodReady
. Hence, I may prefer to keep the Reason
and Condition
simple ("HeadPodNotReady" is OK for me). We can add more details, such as CrashLoopBack
, ContainerCreating
, and Pending
, when we observe enough use cases in the future.
I am curious whether there are obvious use cases for details about whether the Pod is ready or not. I assume that if there are obvious use cases, the upstream PodCondition
should not leave Reason
and Message
empty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pod status condition's reason and message are empty all the time
apology that this statement i mentioned was actually not true, Reason
is empty when pod is running and ready, but when pod is not ready yet, there's actually non empty reasons, i think we can propagate reason from Pod itself but keep a customized reason HeadPodRunningAndReady
when it's up and running to fit with metav1.Condition
's limitation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason being empty or something like HeadPodReady
when the pod is ready is fine I think. But when it's not ready, the reason should never be empty because we have information from the Pod for why it's not ready (failing readiness probe, crashing, OOMkill, etc)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update the logic to use
kuberay/ray-operator/controllers/ray/utils/util.go
Lines 82 to 96 in 41bd563
if cond.Type == corev1.PodReady { | |
if cond.Status == corev1.ConditionTrue { | |
replicaPodReadyCondition = metav1.Condition{ | |
Type: string(condType), | |
Status: metav1.ConditionTrue, | |
Reason: v1.PodRunningAndReady, // metav1.Condition.Reason requires a non-empty value | |
Message: cond.Message, | |
} | |
} else { | |
replicaPodReadyCondition = metav1.Condition{ | |
Type: string(condType), | |
Status: metav1.ConditionFalse, | |
Reason: cond.Reason, // PodReady condition comes with a reason when it's not ready, e.g. ContainersNotReady | |
Message: cond.Message, | |
} |
kubectl get raycluster rayservice-sample-raycluster-tnm5m -o json | jq .status
{
"conditions": [
{
"lastTransitionTime": "2024-07-25T04:25:32Z",
"message": "containers with unready status: [ray-head]",
"reason": "ContainersNotReady",
"status": "False",
"type": "Ready"
}
],
"desiredCPU": "2500m",
"desiredGPU": "0",
"desiredMemory": "4Gi",
"desiredTPU": "0",
"desiredWorkerReplicas": 1,
"endpoints": {
"client": "10001",
"dashboard": "8265",
"gcs-server": "6379",
"metrics": "8080",
"serve": "8000"
},
"head": {
"podIP": "10.244.0.9",
"podName": "rayservice-sample-raycluster-tnm5m-head-m8h95",
"serviceIP": "10.244.0.9",
"serviceName": "rayservice-sample-raycluster-tnm5m-head-svc"
},
"lastUpdateTime": "2024-07-25T04:25:54Z",
"maxWorkerReplicas": 5,
"minWorkerReplicas": 1,
"observedGeneration": 1
}
} | ||
|
||
replicaHeadReadyCondition := metav1.Condition{ | ||
Type: string(rayv1.HeadReady), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can rename it from HeadReady
to HeadPodReady
because the upstream PodConidtion
uses PodReady instead of Ready
for the condition type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update to HeadPodReady
with Ready
as value
HeadPodReady RayClusterConditionType = "Ready" |
#2266 has been merged |
@@ -1203,6 +1203,16 @@ func (r *RayClusterReconciler) calculateStatus(ctx context.Context, instance *ra | |||
newInstance.Status.State = rayv1.Ready | |||
} | |||
|
|||
// Check if the head node is running and ready by checking the head pod's status. | |||
if features.Enabled(features.RayClusterStatusConditions) { | |||
headPod, err := common.GetRayClusterHeadPod(ctx, r, newInstance) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's possible for GetRayClusterHeadPod
to return nil, nil
when no head Pod exists. We need to handle the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @rueian maybe we can return a non-nil error when there is no head Pod exists?
I found that we always return a non-nil error.
headPod, err := common.GetRayClusterHeadPod(ctx, r, newInstance)
if headPod == nil {
// return an error
}
} | ||
|
||
for _, cond := range pod.Status.Conditions { | ||
if cond.Type == corev1.PodReady { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if cond.Type == corev1.PodReady {
var reason := cond.Reason
if reason is empty {
// metav1.Condition.Reason requires a non-empty value
reason = cond.Reason
}
replicaPodReadyCondition = metav1.Condition{
Type: string(condType),
Status: cond.Status,
Reason: reason,
Message: cond.Message,
}
break
}
@@ -71,6 +72,35 @@ func IsCreated(pod *corev1.Pod) bool { | |||
return pod.Status.Phase != "" | |||
} | |||
|
|||
func FindPodReadyCondition(pod *corev1.Pod, condType rayv1.RayClusterConditionType) metav1.Condition { | |||
replicaPodReadyCondition := metav1.Condition{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible for pod.Status.Conditions
not to have corev1.PodReady
? If so, we will create a condition with an empty Reason
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes you are right, i update the logic here to have an initial Condition
kuberay/ray-operator/controllers/ray/utils/util.go
Lines 75 to 79 in 01e90f6
replicaPodReadyCondition := metav1.Condition{ | |
Type: string(condType), | |
Status: metav1.ConditionFalse, | |
Reason: rayv1.UnknownReason, | |
} |
plus this logic
kuberay/ray-operator/controllers/ray/utils/util.go
Lines 95 to 98 in 01e90f6
// Update the reason if it's not empty | |
if reason != "" { | |
replicaPodReadyCondition.Reason = reason | |
} |
to prevent empty reason sneak into later SetStatusCondition
ray-operator/controllers/ray/raycluster_controller_unit_test.go
Outdated
Show resolved
Hide resolved
} | ||
|
||
// Update the reason if it's not empty | ||
if reason != "" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems to be unnecessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think it could happen when cond.Status != corev1.ConditionTrue
and cond.Reason
for some reason give us empty string, which i think is pretty rare when we are checking cond.Type == PodReady
, but just to be safe i add a check here so it will fall back to default rayv1.UnknownReason
non empty reason
Why are these changes needed?
In this PR, we do the following
--feature-gate
in helm chartRelated issue number
#2237, which is part of the ray-project/enhancements#54, the step 5 in the design doc
Checks
I've tested the feature with feature gate enabled and verified that ray service controller can work properly using new condition