Skip to content

[Feature][RayCluster]: Implement the HeadReady condition #2261

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 21 commits into from
Jul 28, 2024

Conversation

cchen777
Copy link
Contributor

@cchen777 cchen777 commented Jul 21, 2024

Why are these changes needed?

In this PR, we do the following

  1. Add the implementation of HeadReady condition.
  2. Update the logic in rayservice controller, where we will use HeadReady condition to determine whether a ray head is running instead o f checking the pod status directly.
  3. Ensure this change is behind feature gate.
  4. Support --feature-gate in helm chart

Related issue number

#2237, which is part of the ray-project/enhancements#54, the step 5 in the design doc

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests

I've tested the feature with feature gate enabled and verified that ray service controller can work properly using new condition

$ kind delete cluster --name kind && kind create cluster --image=kindest/node:v1.24.0 
$ IMG=kuberay/operator:nightly make docker-build && kind load docker-image kuberay/operator:nightly
$ helm install kuberay-operator --set image.repository=kuberay/operator --set image.tag=nightly --set "featureGates[0].name=RayClusterStatusConditions,featureGates[0].enabled=true" ../helm-chart/kuberay-operator
$ kubectl apply -f config/samples/ray-service.sample.yaml
# verify feature gate is enabled
$ kubectl get deployment kuberay-operator -o json | jq '.spec.template.spec.containers[0].args'  
[
  "--feature-gates=RayClusterStatusConditions=true"
]
$ kubectl logs -f kuberay-operator-599d9fbfd-2k9dc | grep "Loaded feature gates"
{"level":"info","ts":"2024-07-21T07:31:06.024Z","logger":"setup","msg":"Loaded feature gates","featureGates":{"RayClusterStatusConditions":true}}
# verify ray service is up and running
$ kubectl get rayservice
NAME                SERVICE STATUS   NUM SERVE ENDPOINTS
rayservice-sample   Running          2
# verify conditions are set correctly in raycluster
$ kubectl get raycluster rayservice-sample-raycluster-hjlxx -o json | jq .status.conditions
[
  {
    "lastTransitionTime": "2024-07-21T07:33:16Z",
    "message": "Head pod is running and ready",
    "reason": "HeadPodRunningAndReady",
    "status": "True",
    "type": "HeadReady"
  }
]

@cchen777 cchen777 changed the title [WIP][Feature][RayCluster]: Implement the HeadReady condition [Feature][RayCluster]: Implement the HeadReady condition Jul 21, 2024
@cchen777 cchen777 marked this pull request as ready for review July 21, 2024 07:42
return false, "HeadPodNotRunning", "Head pod is not running"
}
for _, cond := range pod.Status.Conditions {
if cond.Type == corev1.PodReady && cond.Status != corev1.ConditionTrue {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use the utils.IsRunningAndReady here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my thought was that utils.IsRunningAndReady only give us the information about Pod phase is Running + Pod is ready, it didn't tell us more fine-grain like whether it's due to Pod not running yet or Pod not ready yet (raylet may not pass readiness check yet) so i wrote the own checking

@kevin85421 kevin85421 self-assigned this Jul 22, 2024
@@ -1180,6 +1188,21 @@ func (r *RayClusterReconciler) calculateStatus(ctx context.Context, instance *ra
newInstance.Status.State = rayv1.Ready
}

// Check if the head node is running and ready with proper reason and message
if features.Enabled(features.RayClusterStatusConditions) {
isRayHeadRunning, reason, message := utils.CheckRayHeadRunningAndReady(ctx, runtimePods)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can use PodReady directly. We don't need to check whether the Pod is running, and we don't need to generate the reason and the message ourselves.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the suggestion, I tried using PodReady as much as possible, and notice that pod status condition's reason and message are empty all the time, and since metav1.Condition need to have an non-empty reason

// reason contains a programmatic identifier indicating the reason for the condition's last transition.
	// Producers of specific condition types may define expected values and meanings for this field,
	// and whether the values are considered a guaranteed API.
	// The value should be a CamelCase string.
	// This field may not be empty.
	// +required
	// +kubebuilder:validation:Required
	// +kubebuilder:validation:MaxLength=1024
	// +kubebuilder:validation:MinLength=1
	// +kubebuilder:validation:Pattern=`^[A-Za-z]([A-Za-z0-9_,:]*[A-Za-z0-9_])?$`
	Reason [string](https://pkg.go.dev/builtin#string) `json:"reason" protobuf:"bytes,5,opt,name=reason"`

otherwise will see this error during reconcile

{"level":"info","ts":"2024-07-23T00:24:52.098Z","logger":"controllers.RayCluster","msg":"Error updating status","RayCluster":{"name":"rayservice-sample-raycluster-45jnx","namespace":"default"},"reconcileID":"2fb18e10-3cd9-41a0-b9e8-4660293fb634","name":"rayservice-sample-raycluster-45jnx","error":"RayCluster.ray.io \"rayservice-sample-raycluster-45jnx\" is invalid: status.conditions[0].reason: Invalid value: \"\": status.conditions[0].reason in body should be at least 1 chars long","RayCluster":{"apiVersion":"ray.io/v1","kind":"RayCluster","namespace":"default","name":"rayservice-sample-raycluster-45jnx"}}

thus i add reason HeadPodRunningAndReady and HeadPodNotReady respectively

@kevin85421
Copy link
Member

could you also rebase with the master branch? Thanks!

replicaHeadReadyCondition := metav1.Condition{
Type: string(rayv1.HeadReady),
Status: metav1.ConditionFalse,
Reason: "HeadPodNotReady",
Copy link
Collaborator

@andrewsykim andrewsykim Jul 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reason is not particular useful because it's already implied with the HeadReady condition being False. Can the reason be more specific, like CrashLoopBack, ContainerCreating, Pending, etc?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope to make the condition consistent with Pod's PodReady condition which both Reason and Message are empty. However, @cchen777 found that metav1.Condition has a limitation where Reason should not be empty, which is different from PodCondition. See #2261 (comment) for more details.

We can't keep Reason and Message empty due to metav1.Condition's limitation, but I still want to make our HeadReady condition closer to PodReady. Hence, I may prefer to keep the Reason and Condition simple ("HeadPodNotReady" is OK for me). We can add more details, such as CrashLoopBack, ContainerCreating, and Pending, when we observe enough use cases in the future.

I am curious whether there are obvious use cases for details about whether the Pod is ready or not. I assume that if there are obvious use cases, the upstream PodCondition should not leave Reason and Message empty.

Copy link
Contributor Author

@cchen777 cchen777 Jul 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pod status condition's reason and message are empty all the time

apology that this statement i mentioned was actually not true, Reason is empty when pod is running and ready, but when pod is not ready yet, there's actually non empty reasons, i think we can propagate reason from Pod itself but keep a customized reason HeadPodRunningAndReady when it's up and running to fit with metav1.Condition's limitation

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason being empty or something like HeadPodReady when the pod is ready is fine I think. But when it's not ready, the reason should never be empty because we have information from the Pod for why it's not ready (failing readiness probe, crashing, OOMkill, etc)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update the logic to use

if cond.Type == corev1.PodReady {
if cond.Status == corev1.ConditionTrue {
replicaPodReadyCondition = metav1.Condition{
Type: string(condType),
Status: metav1.ConditionTrue,
Reason: v1.PodRunningAndReady, // metav1.Condition.Reason requires a non-empty value
Message: cond.Message,
}
} else {
replicaPodReadyCondition = metav1.Condition{
Type: string(condType),
Status: metav1.ConditionFalse,
Reason: cond.Reason, // PodReady condition comes with a reason when it's not ready, e.g. ContainersNotReady
Message: cond.Message,
}
so that when pod is not ready yet we can get the more accurate reason

kubectl get raycluster rayservice-sample-raycluster-tnm5m -o json | jq .status
{
  "conditions": [
    {
      "lastTransitionTime": "2024-07-25T04:25:32Z",
      "message": "containers with unready status: [ray-head]",
      "reason": "ContainersNotReady",
      "status": "False",
      "type": "Ready"
    }
  ],
  "desiredCPU": "2500m",
  "desiredGPU": "0",
  "desiredMemory": "4Gi",
  "desiredTPU": "0",
  "desiredWorkerReplicas": 1,
  "endpoints": {
    "client": "10001",
    "dashboard": "8265",
    "gcs-server": "6379",
    "metrics": "8080",
    "serve": "8000"
  },
  "head": {
    "podIP": "10.244.0.9",
    "podName": "rayservice-sample-raycluster-tnm5m-head-m8h95",
    "serviceIP": "10.244.0.9",
    "serviceName": "rayservice-sample-raycluster-tnm5m-head-svc"
  },
  "lastUpdateTime": "2024-07-25T04:25:54Z",
  "maxWorkerReplicas": 5,
  "minWorkerReplicas": 1,
  "observedGeneration": 1
}

}

replicaHeadReadyCondition := metav1.Condition{
Type: string(rayv1.HeadReady),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can rename it from HeadReady to HeadPodReady because the upstream PodConidtion uses PodReady instead of Ready for the condition type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update to HeadPodReady with Ready as value

HeadPodReady RayClusterConditionType = "Ready"

@kevin85421
Copy link
Member

#2266 has been merged

@@ -1203,6 +1203,16 @@ func (r *RayClusterReconciler) calculateStatus(ctx context.Context, instance *ra
newInstance.Status.State = rayv1.Ready
}

// Check if the head node is running and ready by checking the head pod's status.
if features.Enabled(features.RayClusterStatusConditions) {
headPod, err := common.GetRayClusterHeadPod(ctx, r, newInstance)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible for GetRayClusterHeadPod to return nil, nil when no head Pod exists. We need to handle the case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @rueian maybe we can return a non-nil error when there is no head Pod exists?

I found that we always return a non-nil error.

        headPod, err := common.GetRayClusterHeadPod(ctx, r, newInstance)
	if headPod == nil {
		// return an error
	}

}

for _, cond := range pod.Status.Conditions {
if cond.Type == corev1.PodReady {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if cond.Type == corev1.PodReady {
    var reason := cond.Reason
    if reason is empty {
        // metav1.Condition.Reason requires a non-empty value
        reason = cond.Reason
    }
    replicaPodReadyCondition = metav1.Condition{
        Type:    string(condType),
        Status:  cond.Status,
        Reason:  reason,
        Message: cond.Message,
    }
    break
}

@@ -71,6 +72,35 @@ func IsCreated(pod *corev1.Pod) bool {
return pod.Status.Phase != ""
}

func FindPodReadyCondition(pod *corev1.Pod, condType rayv1.RayClusterConditionType) metav1.Condition {
replicaPodReadyCondition := metav1.Condition{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible for pod.Status.Conditions not to have corev1.PodReady? If so, we will create a condition with an empty Reason.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes you are right, i update the logic here to have an initial Condition

replicaPodReadyCondition := metav1.Condition{
Type: string(condType),
Status: metav1.ConditionFalse,
Reason: rayv1.UnknownReason,
}

plus this logic

// Update the reason if it's not empty
if reason != "" {
replicaPodReadyCondition.Reason = reason
}

to prevent empty reason sneak into later SetStatusCondition

}

// Update the reason if it's not empty
if reason != "" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems to be unnecessary?

Copy link
Contributor Author

@cchen777 cchen777 Jul 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it could happen when cond.Status != corev1.ConditionTrue and cond.Reason for some reason give us empty string, which i think is pretty rare when we are checking cond.Type == PodReady, but just to be safe i add a check here so it will fall back to default rayv1.UnknownReason non empty reason

@kevin85421 kevin85421 merged commit 5062a8c into ray-project:master Jul 28, 2024
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants