Skip to content

REP: RayCluster status improvement #54

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
137 changes: 137 additions & 0 deletions reps/2024-07-01-raycluster-crd-status-improvement.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,137 @@
# RayCluster Status improvement

Author: @kevin85421, @rueian, @andrewsykim

## Context

CRD status is useful for observability. There are several GitHub issues and Slack threads about this topic. For example,
* https://github.com/ray-project/kuberay/issues/2182
* https://github.com/ray-project/kuberay/issues/1631
* https://github.com/ray-project/kuberay/pull/1930
* https://ray-distributed.slack.com/archives/C02GFQ82JPM/p1718207636330739
* https://ray-distributed.slack.com/archives/C01DLHZHRBJ/p1718261392154959
* @smit-kiri: “We've run into a issues where `RayJob` is stuck in `Initializing` state when the head / worker pods aren't able to spin up - like CrashLoopBackOff / ImagePullBackOff / not being able to pull a secret / etc”

Currently, the RayCluster state machine is quite messy, and the [states](https://github.com/ray-project/kuberay/blob/8453f9b6c749abf3b72b09e0d41032bbfcf17367/ray-operator/apis/ray/v1/raycluster_types.go#L112-L116) are not well-defined.

For example, when all Pods in the RayCluster are running for the first time, the RayCluster CR’s `Status.State` will transition to `ready` ([code](https://github.com/ray-project/kuberay/blob/8453f9b6c749abf3b72b09e0d41032bbfcf17367/ray-operator/apis/ray/v1/raycluster_types.go#L113)). However, the state will remain in ready even if we delete the head Pod. KubeRay will then create a new head Pod and restart the worker Pods.

Take another `ClusterState rayv1.Failed` ([code](https://github.com/ray-project/kuberay/blob/8453f9b6c749abf3b72b09e0d41032bbfcf17367/ray-operator/apis/ray/v1/raycluster_types.go#L114)) as an example, the RayCluster CR transitions to `rayv1.Failed` in many different scenarios, including when the YAML spec is invalid, when it fails to connect to the Kubernetes API server, when it fails to list information from the local informer, and so on. However, it doesn’t make sense to mark a RayCluster as `rayv1.Failed` when the KubeRay operator fails to communicate with the Kubernetes API server or fails to list local informer.

## Proposal

By design, the `RayCluster CRD` is similar to a bunch of Kubernetes `ReplicaSets`; both the head group and worker groups are analogous to `ReplicaSets`. Therefore, we can refer to [appsv1.ReplicaSetStatus](https://pkg.go.dev/k8s.io/api/apps/v1#ReplicaSetStatus).

This doc wants to propose several changes:

1. Remove `rayv1.Failed` from `Status.State`.
* As mentioned in the 'Context' section, currently, there are many conditions for KubeRay to update a RayCluster’s state to `Failed`, including when KubeRay fails to connect to the Kubernetes API server and fails to list information from the local informer, which are not related to the RayCluster.
* The [ReplicaSetStatus](https://pkg.go.dev/k8s.io/api/apps/v1#ReplicaSetStatus) doesn’t have the concept of `Failed`. Instead, it uses the `[]ReplicaSetCondition`. Reference:
* [ReplicaSetConditionType](https://github.com/kubernetes/api/blob/857a946a225f212b64d42c68a7da0dc44837636f/apps/v1/types.go#L915)
* [DeploymentConditionType](https://github.com/kubernetes/api/blob/857a946a225f212b64d42c68a7da0dc44837636f/apps/v1/types.go#L532-L542)
* [Kueue](https://github.com/kubernetes-sigs/kueue)
* It’s pretty hard to define `Failed` for RayCluster CRD.
* For example, if you are using Ray Core to build your Ray applications, a worker Pod failure may cause your Ray jobs to fail. In this case, we may consider the RayCluster as `Failed` when a worker Pod crashes.
* If users use Ray Train or Ray Data, these Ray AI libraries provide some application-level fault tolerance. To elaborate, if some Ray tasks/actors on the worker Pods fail, these libraries will launch new Ray tasks/actors to recover the job. In this case, we may consider the RayCluster as `Failed` when the head Pod crashes.
* If users use Ray Serve with RayService and enable GCS fault tolerance for high availability, the RayCluster should still be able to serve requests even when the head Pod is down. Should we still consider the RayCluster as `Failed` when the head Pod crashes?

2. Introduce `Status.Conditions` to surface errors with underlying Ray Pods.
* TODO: We still need to define which types of information we want to surface in the CR status, operator logs, or Kubernetes events.

3. Make suspending a RayCluster an atomic operation.
* Like RayJob, RayCluster also supports the [suspend](https://github.com/ray-project/kuberay/pull/1711) operation. However, the suspend operation is [atomic](https://github.com/ray-project/kuberay/pull/1798) in RayJob. In RayCluster, if we set suspend to true and then set it back to false before the RayCluster transitions to the state [rayv1.Suspended](https://github.com/ray-project/kuberay/blob/8453f9b6c749abf3b72b09e0d41032bbfcf17367/ray-operator/apis/ray/v1/raycluster_types.go#L115), there might be some side effects.

4. Add a new condition `Status.Conditions[HeadReady]` to indicate whether the Ray head Pod is ready for requests or not. This is also useful for RayJob and RayService, where KubeRay needs to communicate with the Ray head Pod.

5. Redefine `rayv1.Ready`.
* As mentioned in the “Context” section, when all Pods in the RayCluster are running for the first time, the RayCluster CR’s `Status.State` will transition to `ready` ([code](https://github.com/ray-project/kuberay/blob/8453f9b6c749abf3b72b09e0d41032bbfcf17367/ray-operator/apis/ray/v1/raycluster_types.go#L113)). It is highly possible that the RayCluster will remain in the `ready` state even if some incidents occur, such as manually deleting the head Pod.
* Here, we redefine `ready` as follows:
* When all Pods in the RayCluster are running for the first time, the RayCluster CR’s `Status.State` will transition to `ready`.
* After the RayCluster CR transitions to `ready`, KubeRay only checks whether the Ray head Pod is ready to determine if the RayCluster is `ready`.
* This new definition is mainly to maintain backward compatibility.
* TODO
* I am considering making `ready` a condition in `Status.Conditions` and removing `rayv1.Ready` from `Status.State`. Kubernetes Pod also makes a [PodCondition](https://pkg.go.dev/k8s.io/api/core/v1#PodConditionType), but this may be a bit aggressive.
* In the future, we can expose a way for users to self-define the definition of `ready` ([#1631](https://github.com/ray-project/kuberay/issues/1631)).

## Implementation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some breaking changes, could you talk about why you think it's ok to have those breaking changes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kevin85421 I don't think we need to make any breaking changes right? We can deprecate the existing .status.state field and point useers to the new conditions without deleting .status.state


### Step 1: Introduce Status.Conditions

```go
type RayClusterStatus struct {
...
Conditions []metav1.Condition `json:"conditions,omitempty"`
}

type RayClusterConditionType string

const (
HeadReady RayClusterConditionType = "HeadReady"
// Pod failures. See DeploymentReplicaFailure and
// ReplicaSetReplicaFailure for more details.
RayClusterReplicaFailure RayClusterConditionType = "ReplicaFailure"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a follow-up iteration of the conditions, I think a condition like AllWorkersReady indicating all worker replicas are ready will be useful too. The example that comes to mind is in RayJob controller to know when a job should start (assuming we deprecate the status.state field). Relying on head readiness won't be enough

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using AllRayPodsReady, which indicates that both the head and workers are ready? If we use AllWorkersReady, we need to check both AllWorkersReady and HeadReady to determine whether we can submit the job. If we have AllRayPodsReady, we only need to check AllRayPodsReady.

)
```

* Future works
* Add `RayClusterAvailable` or `RayClusterReady` to `RayClusterConditionType` in the future for the RayCluster readiness.
* Reference: [DeploymentAvailable](https://github.com/kubernetes/api/blob/857a946a225f212b64d42c68a7da0dc44837636f/apps/v1/types.go#L532-L542), [LeaderWorkerSetAvailable](https://github.com/kubernetes-sigs/lws/blob/557dfd8b14b8f94633309f6d7633a4929dcc10c3/api/leaderworkerset/v1/leaderworkerset_types.go#L272)
* Add `RayClusterK8sFailure` to surface the Kubernetes resource failures that are not Pods.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might actually be worthwhile to introduce a generic condition type like RayClusterInitializing that encapsulates all the other resource dependencies that aren't pods.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not? Additional conditions for other resource dependencies are good, but a RayClusterInitializing may be too vague and could be easily misused. It smells like another Ready.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RayClusterK8sFailure is pretty vague too. Maybe RayClusterResourceInitialization is more specific?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RayClusterK8sFailure is indeed vague. I think conditions RayClusterXXXFailure, where XXX is specific, are better.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still considering whether to add the field or not. Maybe creating K8s events for the non-Pod failures is enough?


### Step 2: Remove `rayv1.Failed` from `Status.State`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove or deprecate?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can leave Failed ClusterState = "failed" in raycluster_types.go. For Status.State, I prefer not to assign Status.State to rayv1.Failed anymore.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated in the Google doc.

* Add the information about Pod failures to `RayClusterReplicaFailure` instead.

### Step 3: Make sure every reconciliation which has status change goes through `inconsistentRayClusterStatus`, and we only call `r.Status().Update(...)` when `inconsistentRayClusterStatus` returns true.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this needs to be it's own step, it would have to happen when we introduce the new conditions in step 1 right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind adding more details? In my plan, step 1 just updates the CRD.


```go
if r.inconsistentRayClusterStatus(oldStatus, newStatus) {
// This should be the one and only place to call Status().Update(...)
// in RayCluster controller.
if err := r.Status().Update(...); err != nil {
...
}
}
```

* Currently, the status update in the RayCluster controller is quite random. For example, if some errors are thrown, the function `rayClusterReconcile` returns immediately and updates some parts of the RayCluster CR status randomly. Sometimes it doesn’t update anything; sometimes it updates `Status.State` and/or `Status.Reason`.
* Currently, the RayCluster controller calls `r.Status().Update(...)` in 3 places without any reason. Ideally, we should only call it in a single place for the whole controller. See RayJob controller as an example.

### Step 4: Add `PodName` and `ServiceName` to `HeadInfo`.

```go
type HeadInfo struct {
...
PodName string `json:"podName,omitempty"`
ServiceName string `json:"serviceName,omitempty"`
}
```

* `ServiceName` has already been added by [#2089](https://github.com/ray-project/kuberay/pull/2089).
* If we want to retrieve head Pod or head service in RayJob and RayService controllers, we should use `RayCluster.Status.Head.{PodName|ServiceName}` instead of something like [this function](https://github.com/ray-project/kuberay/blob/a43217bd2864961ee188a134807df57fd06e0a77/ray-operator/controllers/ray/rayservice_controller.go#L1223-L1235).

### Step 5: Use `Condition[HeadReady]` instead of [isHeadPodRunningAndReady](https://github.com/ray-project/kuberay/blob/a43217bd2864961ee188a134807df57fd06e0a77/ray-operator/controllers/ray/rayservice_controller.go#L1211-L1217).

### Step 6: Redefine `rayv1.Ready`
* As mentioned in the “Context” section, when all Pods in the RayCluster are running for the first time, the RayCluster CR’s `Status.State` will transition to `ready` ([code](https://github.com/ray-project/kuberay/blob/8453f9b6c749abf3b72b09e0d41032bbfcf17367/ray-operator/apis/ray/v1/raycluster_types.go#L113)). It is highly possible that the RayCluster will remain in the `ready` state even if some incidents occur, such as manually deleting the head Pod.
* Here, we redefine `ready` as follows:
* When all Pods in the RayCluster are running for the first time, the RayCluster CR’s `Status.State` will transition to `ready`.
* After the RayCluster CR transitions to `ready`, KubeRay only checks whether the Ray head Pod is ready to determine if the RayCluster is `ready`.
* This new definition is mainly to maintain backward compatibility.

### Step 7: Remove `rayv1.Suspended` from `Status.State`. Make suspending a RayCluster an atomic operation.

```go
type RayClusterStatus struct {
...
Conditions []metav1.Condition `json:"conditions,omitempty"`
}

type RayClusterConditionType string

const (
RayClusterSuspending RayClusterConditionType = "Suspending"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make the addition of the new suspending / suspended conditions in step 1 and keep step 7 only about deprecating the old state?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may prefer to add the suspending and suspended conditions when we decide to start to work on making the suspend operation atomic. In my expectation, Step 1 will be in v1.2.0, but Step 7 will be in v1.3.0.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we add the conditions in Step 1, these two conditions may be unused in v1.2.0.

RayClusterSuspended RayClusterConditionType = "Suspended"
)
```

* Introduce `RayClusterSuspending` and `RayClusterSuspended` to `Status.Conditions`. Then, we can refer to [RayJob](https://github.com/ray-project/kuberay/pull/1798) to make the suspend operation atomic.