Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NO-JIRA: Add collection mode EP for must-gather #1718

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ardaguclu
Copy link
Member

This PR adds a new proposal for must-gather and inspect commands to work on large clusters functional.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Nov 22, 2024
@openshift-ci-robot
Copy link

@ardaguclu: This pull request explicitly references no jira issue.

In response to this:

This PR adds a new proposal for must-gather and inspect commands to work on large clusters functional.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Contributor

openshift-ci bot commented Nov 22, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign sudhaponnaganti for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link

@palonsoro palonsoro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for opening this. Adding some thoughts.

Comment on lines +71 to +72
Since there are multiple must-gather images and none of them (apart from the default must-gather in here) does not
adopt this flag, `--collection-mode` will be marked as hidden.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why marking it hidden? We can instead document that the behavior is dependent on the must-gather image.

This way, each must-gather image author (not only other products but even users making their custom must-gather image) can leverage the collection mode concept and give it whatever meaning they like.

In this scenario, documenting the meaning of each collection mode would be the responsibility of each must-gather image author. The command would just tell "it depends on each must-gather image" and that's it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thought about this: If we hide this flag, we will have to tell users to use it each time needed. If it exists, they may try it even without asking us.

Maybe we have some confused customer asking like "why doesn't the collection mode make a difference in ${WHATEVER_COMPONENT} must-gather image? If we just document that it depends on the concrete image, we can easily explain them "because that image hasn't implemented it yet". It would even help on the flag adoption, because the customer may then request us RFE to have that flag in the image where it is missing.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand, reading below: I'd agree to have the flag hidden if it is technology preview. But I'd definitely advocate on making it visible in GA, regardless of adoption, document that it depends on the image and use the presence of the tag to motivate others to implement it in their images.

Copy link
Member

@ingvagabund ingvagabund Nov 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, it is practical to align with the following when introducing a new flag for oc adm must-gather that needs their MG image counter parts:

  • DP: hidden + setting a new pod env that "can" control behavior of MG images + adopting the new env in the default MG
  • TP: visible but with experimental note + asking MG owners for broader adoption
  • GA: visible + no experimental note

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fully agree with that approach. What I wanted to avoid is the flag to remain hidden on GA. But as long as that doesn't happen, then perfect for me.

enhancements/oc/must-gather-collection-mode.md Outdated Show resolved Hide resolved
Comment on lines +92 to +93
Based on the adoption rate of the `COLLECTION_MODE` environment variable by other must-gather images, we can decide again
marking the flag in must-gather as visible.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why wait? We won't encourage other must-gather image authors to use the flag if hidden. And if what we want is to avoid confusion, we can just explain that the collection modes depend on the concrete image user as suggested above.

Thinking wider: We need our users to be ready for this flag to not always work anyway. We may face situations like customer with modern oc using oc adm must-gather --image whatever on an older component whose must-gather image was not updated for the collection profiles.

Comment on lines +179 to +184
* oc that is used to invoke must-gather is new, must-gather script is new, oc in must-gather script is old; since the old oc in must-gather does not know `--node-selector`, it will fail
* oc that is used to invoke must-gather is new, must-gather script is old, oc in must-gather script is new; since must-gather script does not pass `--node-selector`, no issue

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neither of this situations can happen and if they did, it would be a bug:

In OCP, we deliver full release payloads and the must-gather image is one of the images within that payload. So we can either have:

  • A must-gather image with an oc that understands node-selector in inspect and the gather script that leverages that for collection profiles
  • A must-gather image with an older oc that doesn't have node-selector in inspect but also doesn't have a gather script trying to use that.

If we had an incompatible combination of oc and gather script in a must-gather image, that would be a bug in that image, because we should have never built such an image. So I don't think we need to document this in the strategy, maybe just somewhere else warn that we need to be careful with this while doing backports.

Copy link
Member

@ingvagabund ingvagabund Nov 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same reasoning holds for --since and --since-time flags. They translate into pod envs. It is not assumed any MG image will interpret it properly. So unless a MG image of a given version already adds support for it all the bug reports will be interpreted as feature requests and forwarded to owning teams.

Copy link

@palonsoro palonsoro Nov 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you are saying in this comment is not what is referred in the points I signaled but in the previous point:

* oc that is used to invoke must-gather is new, must-gather script is old; no issue, environment variable will be ignored

In such cases, as the point already explains, the environment variable is just ignored and it's just fine.

In these two points that I signaled it is suggested is that the same must-gather image has a script that understands the new flags and an oc client which doesn't or vice-versa. That's an inconsistency in the must-gather image.

Either the must-gather image fully adopts the flag, at both its oc client and its script, or it doesn't. Fully and properly adopting the flags is indeed a new feature. Half-adopting the flags as you suggest in either of these points, is a bug.

Copy link
Member

@ingvagabund ingvagabund Nov 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the confusion. I was not referring to a specific point.

You are right. "must-gather script is old, oc in must-gather script is new;" and "must-gather script is new, oc in must-gather script is old;" are not expected to happen as the must-gather script is bound to a specific ocp version which pulls an oc binary for that ocp version. If that's not the case it's a bug.


### `oc adm must-gather`

There is a new flag in must-gather, namely `--collection-mode`. This flag's default value is set to `medium` that
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is important to define what the current behavior is. Looking a year back at the proposal from now will give "the current behavior" a different meaning. The same for the limited mode. Worth making a survey to identify what resources we currently collect and categorizing them based on their importance and impact on solving customer issues/debugging clusters.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the wording should change "the current behavior" to something that makes it clearer that it is the behavior before this feature was introduced?

In what regards the survey, I agree and believe it is a good idea. But I guess that the impact on solving issues shouldn't be the only driver, but other factors like potential collection size may matter.

Copy link
Member

@ingvagabund ingvagabund Nov 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for identifying ranges of sizes of collected data. Some scripts are collecting more or less constant or linear sized data wrt. the number of nodes. Other scripts will behave differently. Anything that's constant/linear or under a certain threshold (few MBs?) could fall into the limited category?

Worth checking https://github.com/openshift/must-gather/tree/master/collection-scripts to see what we actually collect these days. We still don't keep any estimation on which scripts are the biggest contributors.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ingvagabund bear in mind that not all those scripts are gathered by default. A good example is the gather_audit_logs (which used to be collected by default in early 4.y versions but not currently, due to huge size of audit logs).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is important to define what the current behavior is. Looking a year back at the proposal from now will give "the current behavior" a different meaning.

@ingvagabund I tried to elaborate what actually current behavior means. Let me know what you think.

Worth making a survey to identify what resources we currently collect and categorizing them based on their importance and impact on solving customer issues/debugging clusters.

That is good idea. What about doing this after setting the foundations around collection-mode in the code. I agree that it would make sense to extend limited mode after the results of the survey.

Copy link
Contributor

openshift-ci bot commented Nov 26, 2024

@ardaguclu: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants