Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ACM-12251: New addon status design document #1472

Merged
merged 5 commits into from
Sep 11, 2024

Conversation

thibaultmg
Copy link
Contributor

@thibaultmg thibaultmg commented Jun 7, 2024

The addon status update is currently causing issues such as:

  • Flapping states
  • Inconsistent states

This document aims to specify the expected behavior. It is a target for achieving a consistent management of the spoke addon status being reported to the hub.
It does not describe the current implementation.

Main differences are:

  • Clarification of State Transitions: The document clarifies state transitions with added restrictions. Specifically, transitions between degraded and available states are restricted to avoid flapping.
  • Progressing Status on Metrics-Collector Update: On each metrics-collector update, the status goes back to progressing state.
  • Addition of new condition types for each collector that are aggregated by the status controller.

Copy link

openshift-ci bot commented Jun 7, 2024

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@thibaultmg thibaultmg marked this pull request as draft June 12, 2024 11:35
@douglascamata
Copy link
Contributor

douglascamata commented Jun 18, 2024

I'm missing an analogy here to help visualize what's the objective of this new design. Overall, I think we need a solution similar to the Node status conditions. I think this design might be inspired by them, but I'm not sure.

Node's in k8s can have a bunch of conditions with different types:

  • Ready.
  • DiskPressure.
  • MemoryPressure.
  • PIDPressure.
  • NetworkUnavailable.

All these conditions are totally independent and more than one of them can be present at the same time.

Note that these conditions follow the conventions that "abnormal is true". Example: if the "normal" state is to not have disk pressure, we know that the default value has to be DiskPressure=False. We could even decide which polarity we will adopt: positive true is good/healthy, while negative means false is good/healthy.

Considering we must have those conditions and can't have super custom ones (we don't have decision power on the status of the addons), I think the proposal is good.

When aggregating the spokes addon status into the hub's, will we increase a custom "message" that will allow a reader to quickly identify the spoke(s) with issues?

A small note: I don't see a way for the Endpoint Metrics Operator to transition into "Available". That stats only has outgoing transitions (update successful/failure, disabled metrics), but no incoming transitions.

@thibaultmg
Copy link
Contributor Author

thibaultmg commented Jun 19, 2024

Overall, I think we need a solution similar to the Node status conditions. I think this design might be inspired by them, but I'm not sure

No, I did not know about it, but I guess it's good news.

All these conditions are totally independent and more than one of them can be present at the same time.

Yes, this is the spirit of conditions. They can be treated as simple observations. But in our case, IIUC, we are reporting to the hub wether we are in one of the following states: Available, Progressing, Degraded. See addon documentation.
These are mutually exclusive states. Hence the state machine approach. Also, as both the operator and collector are updating the same observations, we need strict transition rules to avoid flapping and inconsistencies.
For the other states, we can treat them as independent observations.

Note that these conditions follow the conventions that "abnormal is true".

I haven't described this part, will do some additions.

When aggregating the spokes addon status into the hub's, will we increase a custom "message" that will allow a reader to quickly identify the spoke(s) with issues?

This is the idea. I didn't want to go too deep into details so that this document remains easy to maintain. But I'll add a word about it.

I don't see a way for the Endpoint Metrics Operator to transition into "Available". That stats only has outgoing transitions (update successful/failure, disabled metrics), but no incoming transitions.

The operator only installs and updates the collectors. It can't state wether it is functioning properly, only the collector can make this "observation".

Given our conversation, I wonder wether we can reduce the condition types to only the 3 standard ones, plus a unique one by collector. Because to me, NotSupported type could just be Degraded, and maybe the same for Disabled? I must check what is reported for the addon when it is in that state.

@douglascamata
Copy link
Contributor

@thibaultmg I'm aware of the specific statuses that the addons should have according to their documentation, that's why I wrote Considering we must have those conditions and can't have super custom ones (we don't have decision power on the status of the addons), I think the proposal is good..

I'm not saying we need extra status conditions like the Node object has. They would definitely be useful though, as they can indicate the problem directly in the addon resource without requiring a deeper investigation. Paired with what I mentioned about the message of the status aggregating the names of the spokes with issues, this could easily tell anyone clusters X, Y, and Z are failing to send metrics, clusters A, B, C are failing to upgrade, clusters Q, W, E are disabled and so on.

@douglascamata
Copy link
Contributor

On a separate note, we have to be careful with the state and their interaction with upgrades. For instance, we could end up with an upgrade blocked because fo the status of the monitoring addons.

@thibaultmg thibaultmg changed the title Add addon status design document ACM-12251: New addon status design document Jun 19, 2024
@thibaultmg
Copy link
Contributor Author

I should have addressed most comments with the latest update.

  • I now focus on reasons for better clarity, avoiding mixing condition types
  • Added a reasons managers section that should clarify responsibilities
  • Simplified state transitions diagram
  • And a few other minor updates

On a separate note, we have to be careful with the state and their interaction with upgrades. For instance, we could end up with an upgrade blocked because fo the status of the monitoring addons.

@douglascamata I agree, let me know if you think of a scenario where this can happen.

@thibaultmg thibaultmg marked this pull request as ready for review June 19, 2024 14:47
Copy link

Signed-off-by: Thibault Mange <[email protected]>
Signed-off-by: Thibault Mange <[email protected]>
Signed-off-by: Thibault Mange <[email protected]>
Signed-off-by: Thibault Mange <[email protected]>
Signed-off-by: Thibault Mange <[email protected]>
Copy link

Copy link
Collaborator

@subbarao-meduri subbarao-meduri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Copy link

openshift-ci bot commented Sep 11, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: subbarao-meduri, thibaultmg

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [subbarao-meduri,thibaultmg]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot bot merged commit 57cd852 into stolostron:main Sep 11, 2024
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants