-
Notifications
You must be signed in to change notification settings - Fork 475
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NO-JIRA: Add collection mode EP for must-gather #1718
base: master
Are you sure you want to change the base?
Conversation
@ardaguclu: This pull request explicitly references no jira issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
bbed88e
to
012fd00
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for opening this. Adding some thoughts.
Since there are multiple must-gather images and none of them (apart from the default must-gather in here) does not | ||
adopt this flag, `--collection-mode` will be marked as hidden. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why marking it hidden? We can instead document that the behavior is dependent on the must-gather image.
This way, each must-gather image author (not only other products but even users making their custom must-gather image) can leverage the collection mode concept and give it whatever meaning they like.
In this scenario, documenting the meaning of each collection mode would be the responsibility of each must-gather image author. The command would just tell "it depends on each must-gather image" and that's it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thought about this: If we hide this flag, we will have to tell users to use it each time needed. If it exists, they may try it even without asking us.
Maybe we have some confused customer asking like "why doesn't the collection mode make a difference in ${WHATEVER_COMPONENT}
must-gather image? If we just document that it depends on the concrete image, we can easily explain them "because that image hasn't implemented it yet". It would even help on the flag adoption, because the customer may then request us RFE to have that flag in the image where it is missing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the other hand, reading below: I'd agree to have the flag hidden if it is technology preview. But I'd definitely advocate on making it visible in GA, regardless of adoption, document that it depends on the image and use the presence of the tag to motivate others to implement it in their images.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, it is practical to align with the following when introducing a new flag for oc adm must-gather
that needs their MG image counter parts:
- DP: hidden + setting a new pod env that "can" control behavior of MG images + adopting the new env in the default MG
- TP: visible but with experimental note + asking MG owners for broader adoption
- GA: visible + no experimental note
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fully agree with that approach. What I wanted to avoid is the flag to remain hidden on GA. But as long as that doesn't happen, then perfect for me.
Based on the adoption rate of the `COLLECTION_MODE` environment variable by other must-gather images, we can decide again | ||
marking the flag in must-gather as visible. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why wait? We won't encourage other must-gather image authors to use the flag if hidden. And if what we want is to avoid confusion, we can just explain that the collection modes depend on the concrete image user as suggested above.
Thinking wider: We need our users to be ready for this flag to not always work anyway. We may face situations like customer with modern oc using oc adm must-gather --image whatever
on an older component whose must-gather image was not updated for the collection profiles.
* oc that is used to invoke must-gather is new, must-gather script is new, oc in must-gather script is old; since the old oc in must-gather does not know `--node-selector`, it will fail | ||
* oc that is used to invoke must-gather is new, must-gather script is old, oc in must-gather script is new; since must-gather script does not pass `--node-selector`, no issue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neither of this situations can happen and if they did, it would be a bug:
In OCP, we deliver full release payloads and the must-gather image is one of the images within that payload. So we can either have:
- A must-gather image with an
oc
that understands node-selector in inspect and thegather
script that leverages that for collection profiles - A must-gather image with an older
oc
that doesn't have node-selector in inspect but also doesn't have agather
script trying to use that.
If we had an incompatible combination of oc and gather script in a must-gather image, that would be a bug in that image, because we should have never built such an image. So I don't think we need to document this in the strategy, maybe just somewhere else warn that we need to be careful with this while doing backports.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same reasoning holds for --since
and --since-time
flags. They translate into pod envs. It is not assumed any MG image will interpret it properly. So unless a MG image of a given version already adds support for it all the bug reports will be interpreted as feature requests and forwarded to owning teams.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What you are saying in this comment is not what is referred in the points I signaled but in the previous point:
* oc that is used to invoke must-gather is new, must-gather script is old; no issue, environment variable will be ignored
In such cases, as the point already explains, the environment variable is just ignored and it's just fine.
In these two points that I signaled it is suggested is that the same must-gather image has a script that understands the new flags and an oc client which doesn't or vice-versa. That's an inconsistency in the must-gather image.
Either the must-gather image fully adopts the flag, at both its oc client and its script, or it doesn't. Fully and properly adopting the flags is indeed a new feature. Half-adopting the flags as you suggest in either of these points, is a bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the confusion. I was not referring to a specific point.
You are right. "must-gather script is old, oc in must-gather script is new;" and "must-gather script is new, oc in must-gather script is old;" are not expected to happen as the must-gather script is bound to a specific ocp version which pulls an oc binary for that ocp version. If that's not the case it's a bug.
012fd00
to
c016f8a
Compare
|
||
### `oc adm must-gather` | ||
|
||
There is a new flag in must-gather, namely `--collection-mode`. This flag's default value is set to `medium` that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is important to define what the current behavior is. Looking a year back at the proposal from now will give "the current behavior" a different meaning. The same for the limited mode. Worth making a survey to identify what resources we currently collect and categorizing them based on their importance and impact on solving customer issues/debugging clusters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the wording should change "the current behavior" to something that makes it clearer that it is the behavior before this feature was introduced?
In what regards the survey, I agree and believe it is a good idea. But I guess that the impact on solving issues shouldn't be the only driver, but other factors like potential collection size may matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for identifying ranges of sizes of collected data. Some scripts are collecting more or less constant or linear sized data wrt. the number of nodes. Other scripts will behave differently. Anything that's constant/linear or under a certain threshold (few MBs?) could fall into the limited category?
Worth checking https://github.com/openshift/must-gather/tree/master/collection-scripts to see what we actually collect these days. We still don't keep any estimation on which scripts are the biggest contributors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ingvagabund bear in mind that not all those scripts are gathered by default. A good example is the gather_audit_logs
(which used to be collected by default in early 4.y versions but not currently, due to huge size of audit logs).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is important to define what the current behavior is. Looking a year back at the proposal from now will give "the current behavior" a different meaning.
@ingvagabund I tried to elaborate what actually current behavior means. Let me know what you think.
Worth making a survey to identify what resources we currently collect and categorizing them based on their importance and impact on solving customer issues/debugging clusters.
That is good idea. What about doing this after setting the foundations around collection-mode in the code. I agree that it would make sense to extend limited mode after the results of the survey.
909cdf0
to
03cc95d
Compare
@ardaguclu: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
This PR adds a new proposal for must-gather and inspect commands to work on large clusters functional.