NO-JIRA: Add collection mode EP for must-gather #1718

ardaguclu · 2024-11-22T08:38:34Z

This PR adds a new proposal for must-gather and inspect commands to work on large clusters functional.

openshift-ci-robot · 2024-11-22T08:38:38Z

@ardaguclu: This pull request explicitly references no jira issue.

In response to this:

This PR adds a new proposal for must-gather and inspect commands to work on large clusters functional.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2024-11-22T08:51:29Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign sudhaponnaganti for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

palonsoro

Thanks for opening this. Adding some thoughts.

palonsoro · 2024-11-22T09:45:32Z

enhancements/oc/must-gather-collection-mode.md

+Since there are multiple must-gather images and none of them (apart from the default must-gather in here) does not
+adopt this flag, `--collection-mode` will be marked as hidden.


Why marking it hidden? We can instead document that the behavior is dependent on the must-gather image.

This way, each must-gather image author (not only other products but even users making their custom must-gather image) can leverage the collection mode concept and give it whatever meaning they like.

In this scenario, documenting the meaning of each collection mode would be the responsibility of each must-gather image author. The command would just tell "it depends on each must-gather image" and that's it.

Another thought about this: If we hide this flag, we will have to tell users to use it each time needed. If it exists, they may try it even without asking us.

Maybe we have some confused customer asking like "why doesn't the collection mode make a difference in ${WHATEVER_COMPONENT} must-gather image? If we just document that it depends on the concrete image, we can easily explain them "because that image hasn't implemented it yet". It would even help on the flag adoption, because the customer may then request us RFE to have that flag in the image where it is missing.

On the other hand, reading below: I'd agree to have the flag hidden if it is technology preview. But I'd definitely advocate on making it visible in GA, regardless of adoption, document that it depends on the image and use the presence of the tag to motivate others to implement it in their images.

In general, it is practical to align with the following when introducing a new flag for oc adm must-gather that needs their MG image counter parts:

DP: hidden + setting a new pod env that "can" control behavior of MG images + adopting the new env in the default MG

TP: visible but with experimental note + asking MG owners for broader adoption

GA: visible + no experimental note

I fully agree with that approach. What I wanted to avoid is the flag to remain hidden on GA. But as long as that doesn't happen, then perfect for me.

enhancements/oc/must-gather-collection-mode.md

palonsoro · 2024-11-22T09:49:54Z

enhancements/oc/must-gather-collection-mode.md

+Based on the adoption rate of the `COLLECTION_MODE` environment variable by other must-gather images, we can decide again
+marking the flag in must-gather as visible.


Why wait? We won't encourage other must-gather image authors to use the flag if hidden. And if what we want is to avoid confusion, we can just explain that the collection modes depend on the concrete image user as suggested above.

Thinking wider: We need our users to be ready for this flag to not always work anyway. We may face situations like customer with modern oc using oc adm must-gather --image whatever on an older component whose must-gather image was not updated for the collection profiles.

palonsoro · 2024-11-22T10:00:37Z

enhancements/oc/must-gather-collection-mode.md

+* oc that is used to invoke must-gather is new, must-gather script is new, oc in must-gather script is old; since the old oc in must-gather does not know `--node-selector`, it will fail
+* oc that is used to invoke must-gather is new, must-gather script is old, oc in must-gather script is new; since must-gather script does not pass `--node-selector`, no issue


Neither of this situations can happen and if they did, it would be a bug:

In OCP, we deliver full release payloads and the must-gather image is one of the images within that payload. So we can either have:

A must-gather image with an oc that understands node-selector in inspect and the gather script that leverages that for collection profiles

A must-gather image with an older oc that doesn't have node-selector in inspect but also doesn't have a gather script trying to use that.

If we had an incompatible combination of oc and gather script in a must-gather image, that would be a bug in that image, because we should have never built such an image. So I don't think we need to document this in the strategy, maybe just somewhere else warn that we need to be careful with this while doing backports.

The same reasoning holds for --since and --since-time flags. They translate into pod envs. It is not assumed any MG image will interpret it properly. So unless a MG image of a given version already adds support for it all the bug reports will be interpreted as feature requests and forwarded to owning teams.

What you are saying in this comment is not what is referred in the points I signaled but in the previous point:

* oc that is used to invoke must-gather is new, must-gather script is old; no issue, environment variable will be ignored

In such cases, as the point already explains, the environment variable is just ignored and it's just fine.

In these two points that I signaled it is suggested is that the same must-gather image has a script that understands the new flags and an oc client which doesn't or vice-versa. That's an inconsistency in the must-gather image.

Either the must-gather image fully adopts the flag, at both its oc client and its script, or it doesn't. Fully and properly adopting the flags is indeed a new feature. Half-adopting the flags as you suggest in either of these points, is a bug.

Sorry for the confusion. I was not referring to a specific point.

You are right. "must-gather script is old, oc in must-gather script is new;" and "must-gather script is new, oc in must-gather script is old;" are not expected to happen as the must-gather script is bound to a specific ocp version which pulls an oc binary for that ocp version. If that's not the case it's a bug.

ingvagabund · 2024-11-25T13:45:42Z

enhancements/oc/must-gather-collection-mode.md

+
+### `oc adm must-gather`
+
+There is a new flag in must-gather, namely `--collection-mode`. This flag's default value is set to `medium` that


It is important to define what the current behavior is. Looking a year back at the proposal from now will give "the current behavior" a different meaning. The same for the limited mode. Worth making a survey to identify what resources we currently collect and categorizing them based on their importance and impact on solving customer issues/debugging clusters.

Maybe the wording should change "the current behavior" to something that makes it clearer that it is the behavior before this feature was introduced?

In what regards the survey, I agree and believe it is a good idea. But I guess that the impact on solving issues shouldn't be the only driver, but other factors like potential collection size may matter.

+1 for identifying ranges of sizes of collected data. Some scripts are collecting more or less constant or linear sized data wrt. the number of nodes. Other scripts will behave differently. Anything that's constant/linear or under a certain threshold (few MBs?) could fall into the limited category?

Worth checking https://github.com/openshift/must-gather/tree/master/collection-scripts to see what we actually collect these days. We still don't keep any estimation on which scripts are the biggest contributors.

@ingvagabund bear in mind that not all those scripts are gathered by default. A good example is the gather_audit_logs (which used to be collected by default in early 4.y versions but not currently, due to huge size of audit logs).

It is important to define what the current behavior is. Looking a year back at the proposal from now will give "the current behavior" a different meaning.

@ingvagabund I tried to elaborate what actually current behavior means. Let me know what you think.

Worth making a survey to identify what resources we currently collect and categorizing them based on their importance and impact on solving customer issues/debugging clusters.

That is good idea. What about doing this after setting the foundations around collection-mode in the code. I agree that it would make sense to extend limited mode after the results of the survey.

openshift-ci · 2024-11-26T08:56:19Z

@ardaguclu: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Nov 22, 2024

ardaguclu force-pushed the must-gather-collection-mode branch from bbed88e to 012fd00 Compare November 22, 2024 08:51

palonsoro reviewed Nov 22, 2024

View reviewed changes

ardaguclu force-pushed the must-gather-collection-mode branch from 012fd00 to c016f8a Compare November 22, 2024 10:09

ingvagabund reviewed Nov 25, 2024

View reviewed changes

Add collection mode EP for must-gather

03cc95d

ardaguclu force-pushed the must-gather-collection-mode branch from 909cdf0 to 03cc95d Compare November 26, 2024 08:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NO-JIRA: Add collection mode EP for must-gather #1718

NO-JIRA: Add collection mode EP for must-gather #1718

ardaguclu commented Nov 22, 2024

openshift-ci-robot commented Nov 22, 2024

openshift-ci bot commented Nov 22, 2024

palonsoro left a comment

palonsoro Nov 22, 2024

palonsoro Nov 22, 2024

palonsoro Nov 22, 2024

ingvagabund Nov 25, 2024 •

edited

Loading

palonsoro Nov 25, 2024

palonsoro Nov 22, 2024

palonsoro Nov 22, 2024

ingvagabund Nov 25, 2024 •

edited

Loading

palonsoro Nov 25, 2024 •

edited

Loading

ingvagabund Nov 25, 2024 •

edited

Loading

ingvagabund Nov 25, 2024

palonsoro Nov 25, 2024

ingvagabund Nov 25, 2024 •

edited

Loading

palonsoro Nov 25, 2024

ardaguclu Nov 26, 2024

openshift-ci bot commented Nov 26, 2024

		Since there are multiple must-gather images and none of them (apart from the default must-gather in here) does not
		adopt this flag, `--collection-mode` will be marked as hidden.

		Based on the adoption rate of the `COLLECTION_MODE` environment variable by other must-gather images, we can decide again
		marking the flag in must-gather as visible.

		* oc that is used to invoke must-gather is new, must-gather script is new, oc in must-gather script is old; since the old oc in must-gather does not know `--node-selector`, it will fail
		* oc that is used to invoke must-gather is new, must-gather script is old, oc in must-gather script is new; since must-gather script does not pass `--node-selector`, no issue


		### `oc adm must-gather`

		There is a new flag in must-gather, namely `--collection-mode`. This flag's default value is set to `medium` that

NO-JIRA: Add collection mode EP for must-gather #1718

Are you sure you want to change the base?

NO-JIRA: Add collection mode EP for must-gather #1718

Conversation

ardaguclu commented Nov 22, 2024

openshift-ci-robot commented Nov 22, 2024

openshift-ci bot commented Nov 22, 2024

palonsoro left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ingvagabund Nov 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ingvagabund Nov 25, 2024 • edited Loading

Choose a reason for hiding this comment

palonsoro Nov 25, 2024 • edited Loading

Choose a reason for hiding this comment

ingvagabund Nov 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ingvagabund Nov 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

openshift-ci bot commented Nov 26, 2024

ingvagabund Nov 25, 2024 •

edited

Loading

ingvagabund Nov 25, 2024 •

edited

Loading

palonsoro Nov 25, 2024 •

edited

Loading

ingvagabund Nov 25, 2024 •

edited

Loading

ingvagabund Nov 25, 2024 •

edited

Loading