-
Notifications
You must be signed in to change notification settings - Fork 475
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NO-JIRA: Add collection mode EP for must-gather #1718
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,211 @@ | ||
--- | ||
title: must-gather-collection-mode | ||
authors: | ||
- "@ardaguclu" | ||
reviewers: | ||
- "@ingvagabund" | ||
approvers: | ||
- "@deads2k" | ||
api-approvers: | ||
- "@deads2k" | ||
creation-date: 2024-11-22 | ||
last-updated: 2024-11-22 | ||
tracking-link: | ||
- https://issues.redhat.com/browse/OCPBUGS-37344 | ||
see-also: | ||
- "/enhancements/oc/must-gather.md" | ||
- "/enhancements/oc/inspect.md" | ||
--- | ||
|
||
# must-gather: Collection Mode | ||
|
||
## Summary | ||
|
||
This proposal introduces a new flag (i.e. `--collection-mode`) in must-gather targeting to large clusters by skipping some | ||
logs to take less time and storage size. In addition to that, this proposal introduces a new `--node-selector` flag | ||
in oc adm inspect command to only collect the daemonset pods running on the given node selector. | ||
|
||
## Motivation | ||
|
||
must-gather, due to its nature, aims to collect every log in the cluster in best effort to provide extensive insights. | ||
However, this comes with a drawback that on large clusters (e.g. clusters whose node count is greater than 20), completion | ||
of the must-gather takes excessive time and storage which eventually hurts the usability (or worse, collection failure). As maintainers, we are usually | ||
under the pressure of two opposite sides; adding more and more logs for better troubleshooting experience, cutting some logs | ||
for short completion duration and less storage size. It is hard to find the optimum balance in regard to the default behavior. | ||
|
||
As a result, we need to find a mitigation plan for large clusters by skipping some logs that are marked as less critical | ||
meanwhile preserving the default behavior and expectations. | ||
|
||
### User Stories | ||
|
||
#### Story 1 | ||
|
||
As a cluster administrator maintaining 250 nodes of cluster, I want to have a mechanism to collect the logs quickly and efficiently. | ||
Besides, I'm fine skipping some logs and providing them separately. Because currently must-gather takes 6 hours that even may end up with a failure. | ||
|
||
#### Story 2 | ||
|
||
As a cluster administrator maintaining 250 nodes of clusters. I usually need to troubleshoot networking | ||
that are directly associated to the daemonset pod logs running on workers nodes. So that, I would like to collect everything without skipping | ||
any log (accepting the long time and a risk of collection failure). | ||
|
||
### Goals | ||
|
||
1. Introducing a new `--collection-mode` in must-gather that will be used on large clusters. | ||
2. Introducing a new `--node-selector` in inspect command that will be used to collect daemonsets logs only on the given node selector. | ||
|
||
### Non-Goals | ||
|
||
1. Changing the default behavior of must-gather and inspect. | ||
|
||
## Proposal | ||
|
||
### `oc adm must-gather` | ||
|
||
There is a new flag in must-gather, namely `--collection-mode`. This flag's default value is set to `medium` that | ||
represents the current behavior [1][2]. Its type is string rather than boolean because in the future, we may want to have different modes | ||
such as "extensive" (collect everything that was skipped previously due to the time and size constraints, etc.). | ||
Once user invokes the oc adm must-gather command with `--collection-mode=limited`, | ||
must-gather command will export an environment variable `COLLECTION_MODE=limited` into its collection pod. | ||
|
||
Since there are multiple must-gather images and none of them (apart from the default must-gather in here) does not | ||
adopt this flag, `--collection-mode` will be marked as hidden. | ||
Comment on lines
+71
to
+72
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why marking it hidden? We can instead document that the behavior is dependent on the must-gather image. This way, each must-gather image author (not only other products but even users making their custom must-gather image) can leverage the collection mode concept and give it whatever meaning they like. In this scenario, documenting the meaning of each collection mode would be the responsibility of each must-gather image author. The command would just tell "it depends on each must-gather image" and that's it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Another thought about this: If we hide this flag, we will have to tell users to use it each time needed. If it exists, they may try it even without asking us. Maybe we have some confused customer asking like "why doesn't the collection mode make a difference in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. On the other hand, reading below: I'd agree to have the flag hidden if it is technology preview. But I'd definitely advocate on making it visible in GA, regardless of adoption, document that it depends on the image and use the presence of the tag to motivate others to implement it in their images. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In general, it is practical to align with the following when introducing a new flag for
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I fully agree with that approach. What I wanted to avoid is the flag to remain hidden on GA. But as long as that doesn't happen, then perfect for me. |
||
|
||
### `oc adm inspect` | ||
|
||
There is a new flag in inspect command, namely `--node-selector`. If this flag is empty, every log is collected to preserve | ||
the default behavior. Once this flag is set, only the daemonset pod logs whose running on the given node selector will be collected and | ||
the rest will be ignored. If ignored ones are necessary at some point during the troubleshooting, customer can run this command | ||
with different node selectors separately (e.g. `--node-selector=!node-role.kubernetes.io/control-plane`). | ||
|
||
### `must-gather` script | ||
|
||
Default must-gather's gathering script checks the existence of `COLLECTION_MODE` environment variable and if it exists | ||
and is set to `limited`, script passes `--node-selector=node-role.kubernetes.io/control-plane` in every "oc adm inspect" invocation. | ||
This ensures that only the daemonset pod logs running on control plane are collected. | ||
|
||
In the future, we can skip more resources based on this `COLLECTION_MODE=limited` (or add more resources based on different collection modes). | ||
Skipping daemonset logs running on workers can be considered as a first attempt for the limited mode as it seems like the overt one. | ||
|
||
### must-gather Images | ||
|
||
Based on the adoption rate of the `COLLECTION_MODE` environment variable by other must-gather images, we can decide again | ||
marking the flag in must-gather as visible. | ||
Comment on lines
+92
to
+93
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why wait? We won't encourage other must-gather image authors to use the flag if hidden. And if what we want is to avoid confusion, we can just explain that the collection modes depend on the concrete image user as suggested above. Thinking wider: We need our users to be ready for this flag to not always work anyway. We may face situations like customer with modern oc using |
||
|
||
[1] from must-gather point of view, current behavior represents [this gathering script](https://github.com/openshift/must-gather/blob/release-4.18/collection-scripts/gather) | ||
|
||
[2] from oc adm inspect point of view, current behavior represents collecting [built-in resources](https://github.com/openshift/oc/blob/9e568296a9b2f774e9321825e2b0c314ee9df566/pkg/cli/admin/inspect/namespace.go#L22-L33), cluster operators, CRDs, webhooks, namespaces that are associated via related objects chain. | ||
|
||
### Workflow Description | ||
|
||
There is no change in default behavior. However, on large clusters that is the typical usage of must-gather; | ||
|
||
```shell | ||
oc adm must-gather --collection-mode=limited | ||
``` | ||
|
||
If it is decided that the logs of daemonsets running on worker nodes are essential, additionally this command is an example; | ||
|
||
```shell | ||
oc adm inspect namespace openshift-multus --node-selector='node-role.kubernetes.io/worker' | ||
``` | ||
|
||
### API Extensions | ||
|
||
There is no API related change. | ||
|
||
### Topology Considerations | ||
|
||
#### Hypershift / Hosted Control Planes | ||
|
||
No impact | ||
|
||
#### Standalone Clusters | ||
|
||
No impact | ||
|
||
#### Single-node Deployments or MicroShift | ||
|
||
No impact | ||
|
||
### Implementation Details/Notes/Constraints | ||
|
||
`--collection-mode` flag will be invisible and that would be difficult to find and use it without any prior knowledge. | ||
People may complain about the excessive duration but in reality, there is a way we just don't to expose them. | ||
|
||
### Risks and Mitigations | ||
|
||
Some logs are skipped in _limited_ mode, even though they are essential. That causes the less usability of limited mode. | ||
Because it is an additional requirement and back and forth to collect more data with a separate command. | ||
|
||
### Drawbacks | ||
|
||
This brings about an inevitable but slightly maintenance burden as it introduces new flags and environment variables. | ||
We have to skip the collection of some logs due to the constraints we have and the drawback is some logs are not essential | ||
for some clusters but contrarily these logs could be very essential for some clusters. This proposal eliminates | ||
some logs which may end up that limited mode can't be usable for some clusters. | ||
|
||
## Open Questions [optional] | ||
|
||
None | ||
|
||
## Test Plan | ||
|
||
We have several pinpoints in our CI testing the must-gather's default behavior. That can assure that this change does not | ||
break it. | ||
|
||
## Graduation Criteria | ||
|
||
### Dev Preview -> Tech Preview | ||
|
||
In this case, we can merge Dev Preview and Tech Preview and we can have this; | ||
|
||
* `--collection-mode` flag will be hidden in must-gather, but it is triggerable if you know the flag | ||
* `--node-selector` flag will be visible but marked as experimental | ||
|
||
### Tech Preview -> GA | ||
|
||
* After the adoption of `--collection-mode` flag by other must-gather images, we can mark this flag as visible for general use | ||
* `--node-selector` will be marked as GA by removing the experimental | ||
|
||
### Removing a deprecated feature | ||
|
||
None | ||
|
||
## Upgrade / Downgrade Strategy | ||
|
||
None | ||
|
||
## Version Skew Strategy | ||
|
||
* oc that is used to invoke must-gather is old, regardless of must-gather is new or old; no issue as the environment variable will not be set. | ||
* oc that is used to invoke must-gather is new, must-gather script is old; no issue, environment variable will be ignored | ||
* oc that is used to invoke must-gather is new, must-gather script is new, oc in must-gather script is old; since the old oc in must-gather does not know `--node-selector`, it will fail | ||
* oc that is used to invoke must-gather is new, must-gather script is old, oc in must-gather script is new; since must-gather script does not pass `--node-selector`, no issue | ||
Comment on lines
+183
to
+184
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Neither of this situations can happen and if they did, it would be a bug: In OCP, we deliver full release payloads and the must-gather image is one of the images within that payload. So we can either have:
If we had an incompatible combination of oc and gather script in a must-gather image, that would be a bug in that image, because we should have never built such an image. So I don't think we need to document this in the strategy, maybe just somewhere else warn that we need to be careful with this while doing backports. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The same reasoning holds for There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What you are saying in this comment is not what is referred in the points I signaled but in the previous point:
In such cases, as the point already explains, the environment variable is just ignored and it's just fine. In these two points that I signaled it is suggested is that the same must-gather image has a script that understands the new flags and an oc client which doesn't or vice-versa. That's an inconsistency in the must-gather image. Either the must-gather image fully adopts the flag, at both its oc client and its script, or it doesn't. Fully and properly adopting the flags is indeed a new feature. Half-adopting the flags as you suggest in either of these points, is a bug. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry for the confusion. I was not referring to a specific point. You are right. "must-gather script is old, oc in must-gather script is new;" and "must-gather script is new, oc in must-gather script is old;" are not expected to happen as the must-gather script is bound to a specific ocp version which pulls an oc binary for that ocp version. If that's not the case it's a bug. |
||
|
||
Only the 3rd case gets an error. | ||
|
||
## Operational Aspects of API Extensions | ||
|
||
No API changes | ||
|
||
## Support Procedures | ||
|
||
Collection fails with an error and default collection behavior can be used. | ||
|
||
## Alternatives | ||
|
||
Previously without exposing any flags, we tested this rule as default behavior; | ||
|
||
```markdown | ||
Do not collect daemonset pods logs running on worker nodes, if the cluster's worker node count is greater than 18 | ||
``` | ||
|
||
However, people whose are responsible from the large clusters do not strongly embrace this rule and push towards decreasing worker node threshold to fewer values | ||
to include the mid/small clusters (because cluster size is subjective topic based on the number of nodes). | ||
Additionally, people whose are responsible from the network or MCO components do not embrace this rule as they strictly rely on the daemonset logs we skip as default and | ||
now they have to ask for an additional command execution. | ||
|
||
## Infrastructure Needed [optional] | ||
|
||
Not applicable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is important to define what the current behavior is. Looking a year back at the proposal from now will give "the current behavior" a different meaning. The same for the limited mode. Worth making a survey to identify what resources we currently collect and categorizing them based on their importance and impact on solving customer issues/debugging clusters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the wording should change "the current behavior" to something that makes it clearer that it is the behavior before this feature was introduced?
In what regards the survey, I agree and believe it is a good idea. But I guess that the impact on solving issues shouldn't be the only driver, but other factors like potential collection size may matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for identifying ranges of sizes of collected data. Some scripts are collecting more or less constant or linear sized data wrt. the number of nodes. Other scripts will behave differently. Anything that's constant/linear or under a certain threshold (few MBs?) could fall into the limited category?
Worth checking https://github.com/openshift/must-gather/tree/master/collection-scripts to see what we actually collect these days. We still don't keep any estimation on which scripts are the biggest contributors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ingvagabund bear in mind that not all those scripts are gathered by default. A good example is the
gather_audit_logs
(which used to be collected by default in early 4.y versions but not currently, due to huge size of audit logs).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ingvagabund I tried to elaborate what actually current behavior means. Let me know what you think.
That is good idea. What about doing this after setting the foundations around collection-mode in the code. I agree that it would make sense to extend limited mode after the results of the survey.