Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NO-JIRA: Add collection mode EP for must-gather #1718

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
211 changes: 211 additions & 0 deletions enhancements/oc/must-gather-collection-mode.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,211 @@
---
title: must-gather-collection-mode
authors:
- "@ardaguclu"
reviewers:
- "@ingvagabund"
approvers:
- "@deads2k"
api-approvers:
- "@deads2k"
creation-date: 2024-11-22
last-updated: 2024-11-22
tracking-link:
- https://issues.redhat.com/browse/OCPBUGS-37344
see-also:
- "/enhancements/oc/must-gather.md"
- "/enhancements/oc/inspect.md"
---

# must-gather: Collection Mode

## Summary

This proposal introduces a new flag (i.e. `--collection-mode`) in must-gather targeting to large clusters by skipping some
logs to take less time and storage size. In addition to that, this proposal introduces a new `--node-selector` flag
in oc adm inspect command to only collect the daemonset pods running on the given node selector.

## Motivation

must-gather, due to its nature, aims to collect every log in the cluster in best effort to provide extensive insights.
However, this comes with a drawback that on large clusters (e.g. clusters whose node count is greater than 20), completion
of the must-gather takes excessive time and storage which eventually hurts the usability (or worse, collection failure). As maintainers, we are usually
under the pressure of two opposite sides; adding more and more logs for better troubleshooting experience, cutting some logs
for short completion duration and less storage size. It is hard to find the optimum balance in regard to the default behavior.

As a result, we need to find a mitigation plan for large clusters by skipping some logs that are marked as less critical
meanwhile preserving the default behavior and expectations.

### User Stories

#### Story 1

As a cluster administrator maintaining 250 nodes of cluster, I want to have a mechanism to collect the logs quickly and efficiently.
Besides, I'm fine skipping some logs and providing them separately. Because currently must-gather takes 6 hours that even may end up with a failure.

#### Story 2

As a cluster administrator maintaining 250 nodes of clusters. I usually need to troubleshoot networking
that are directly associated to the daemonset pod logs running on workers nodes. So that, I would like to collect everything without skipping
any log (accepting the long time and a risk of collection failure).

### Goals

1. Introducing a new `--collection-mode` in must-gather that will be used on large clusters.
2. Introducing a new `--node-selector` in inspect command that will be used to collect daemonsets logs only on the given node selector.

### Non-Goals

1. Changing the default behavior of must-gather and inspect.

## Proposal

### `oc adm must-gather`

There is a new flag in must-gather, namely `--collection-mode`. This flag's default value is set to `medium` that
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is important to define what the current behavior is. Looking a year back at the proposal from now will give "the current behavior" a different meaning. The same for the limited mode. Worth making a survey to identify what resources we currently collect and categorizing them based on their importance and impact on solving customer issues/debugging clusters.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the wording should change "the current behavior" to something that makes it clearer that it is the behavior before this feature was introduced?

In what regards the survey, I agree and believe it is a good idea. But I guess that the impact on solving issues shouldn't be the only driver, but other factors like potential collection size may matter.

Copy link
Member

@ingvagabund ingvagabund Nov 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for identifying ranges of sizes of collected data. Some scripts are collecting more or less constant or linear sized data wrt. the number of nodes. Other scripts will behave differently. Anything that's constant/linear or under a certain threshold (few MBs?) could fall into the limited category?

Worth checking https://github.com/openshift/must-gather/tree/master/collection-scripts to see what we actually collect these days. We still don't keep any estimation on which scripts are the biggest contributors.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ingvagabund bear in mind that not all those scripts are gathered by default. A good example is the gather_audit_logs (which used to be collected by default in early 4.y versions but not currently, due to huge size of audit logs).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is important to define what the current behavior is. Looking a year back at the proposal from now will give "the current behavior" a different meaning.

@ingvagabund I tried to elaborate what actually current behavior means. Let me know what you think.

Worth making a survey to identify what resources we currently collect and categorizing them based on their importance and impact on solving customer issues/debugging clusters.

That is good idea. What about doing this after setting the foundations around collection-mode in the code. I agree that it would make sense to extend limited mode after the results of the survey.

represents the current behavior [1][2]. Its type is string rather than boolean because in the future, we may want to have different modes
such as "extensive" (collect everything that was skipped previously due to the time and size constraints, etc.).
Once user invokes the oc adm must-gather command with `--collection-mode=limited`,
must-gather command will export an environment variable `COLLECTION_MODE=limited` into its collection pod.

Since there are multiple must-gather images and none of them (apart from the default must-gather in here) does not
adopt this flag, `--collection-mode` will be marked as hidden.
Comment on lines +71 to +72

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why marking it hidden? We can instead document that the behavior is dependent on the must-gather image.

This way, each must-gather image author (not only other products but even users making their custom must-gather image) can leverage the collection mode concept and give it whatever meaning they like.

In this scenario, documenting the meaning of each collection mode would be the responsibility of each must-gather image author. The command would just tell "it depends on each must-gather image" and that's it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thought about this: If we hide this flag, we will have to tell users to use it each time needed. If it exists, they may try it even without asking us.

Maybe we have some confused customer asking like "why doesn't the collection mode make a difference in ${WHATEVER_COMPONENT} must-gather image? If we just document that it depends on the concrete image, we can easily explain them "because that image hasn't implemented it yet". It would even help on the flag adoption, because the customer may then request us RFE to have that flag in the image where it is missing.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand, reading below: I'd agree to have the flag hidden if it is technology preview. But I'd definitely advocate on making it visible in GA, regardless of adoption, document that it depends on the image and use the presence of the tag to motivate others to implement it in their images.

Copy link
Member

@ingvagabund ingvagabund Nov 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, it is practical to align with the following when introducing a new flag for oc adm must-gather that needs their MG image counter parts:

  • DP: hidden + setting a new pod env that "can" control behavior of MG images + adopting the new env in the default MG
  • TP: visible but with experimental note + asking MG owners for broader adoption
  • GA: visible + no experimental note

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fully agree with that approach. What I wanted to avoid is the flag to remain hidden on GA. But as long as that doesn't happen, then perfect for me.


### `oc adm inspect`

There is a new flag in inspect command, namely `--node-selector`. If this flag is empty, every log is collected to preserve
the default behavior. Once this flag is set, only the daemonset pod logs whose running on the given node selector will be collected and
the rest will be ignored. If ignored ones are necessary at some point during the troubleshooting, customer can run this command
with different node selectors separately (e.g. `--node-selector=!node-role.kubernetes.io/control-plane`).

### `must-gather` script

Default must-gather's gathering script checks the existence of `COLLECTION_MODE` environment variable and if it exists
and is set to `limited`, script passes `--node-selector=node-role.kubernetes.io/control-plane` in every "oc adm inspect" invocation.
This ensures that only the daemonset pod logs running on control plane are collected.

In the future, we can skip more resources based on this `COLLECTION_MODE=limited` (or add more resources based on different collection modes).
Skipping daemonset logs running on workers can be considered as a first attempt for the limited mode as it seems like the overt one.

### must-gather Images

Based on the adoption rate of the `COLLECTION_MODE` environment variable by other must-gather images, we can decide again
marking the flag in must-gather as visible.
Comment on lines +92 to +93

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why wait? We won't encourage other must-gather image authors to use the flag if hidden. And if what we want is to avoid confusion, we can just explain that the collection modes depend on the concrete image user as suggested above.

Thinking wider: We need our users to be ready for this flag to not always work anyway. We may face situations like customer with modern oc using oc adm must-gather --image whatever on an older component whose must-gather image was not updated for the collection profiles.


[1] from must-gather point of view, current behavior represents [this gathering script](https://github.com/openshift/must-gather/blob/release-4.18/collection-scripts/gather)

[2] from oc adm inspect point of view, current behavior represents collecting [built-in resources](https://github.com/openshift/oc/blob/9e568296a9b2f774e9321825e2b0c314ee9df566/pkg/cli/admin/inspect/namespace.go#L22-L33), cluster operators, CRDs, webhooks, namespaces that are associated via related objects chain.

### Workflow Description

There is no change in default behavior. However, on large clusters that is the typical usage of must-gather;

```shell
oc adm must-gather --collection-mode=limited
```

If it is decided that the logs of daemonsets running on worker nodes are essential, additionally this command is an example;

```shell
oc adm inspect namespace openshift-multus --node-selector='node-role.kubernetes.io/worker'
```

### API Extensions

There is no API related change.

### Topology Considerations

#### Hypershift / Hosted Control Planes

No impact

#### Standalone Clusters

No impact

#### Single-node Deployments or MicroShift

No impact

### Implementation Details/Notes/Constraints

`--collection-mode` flag will be invisible and that would be difficult to find and use it without any prior knowledge.
People may complain about the excessive duration but in reality, there is a way we just don't to expose them.

### Risks and Mitigations

Some logs are skipped in _limited_ mode, even though they are essential. That causes the less usability of limited mode.
Because it is an additional requirement and back and forth to collect more data with a separate command.

### Drawbacks

This brings about an inevitable but slightly maintenance burden as it introduces new flags and environment variables.
We have to skip the collection of some logs due to the constraints we have and the drawback is some logs are not essential
for some clusters but contrarily these logs could be very essential for some clusters. This proposal eliminates
some logs which may end up that limited mode can't be usable for some clusters.

## Open Questions [optional]

None

## Test Plan

We have several pinpoints in our CI testing the must-gather's default behavior. That can assure that this change does not
break it.

## Graduation Criteria

### Dev Preview -> Tech Preview

In this case, we can merge Dev Preview and Tech Preview and we can have this;

* `--collection-mode` flag will be hidden in must-gather, but it is triggerable if you know the flag
* `--node-selector` flag will be visible but marked as experimental

### Tech Preview -> GA

* After the adoption of `--collection-mode` flag by other must-gather images, we can mark this flag as visible for general use
* `--node-selector` will be marked as GA by removing the experimental

### Removing a deprecated feature

None

## Upgrade / Downgrade Strategy

None

## Version Skew Strategy

* oc that is used to invoke must-gather is old, regardless of must-gather is new or old; no issue as the environment variable will not be set.
* oc that is used to invoke must-gather is new, must-gather script is old; no issue, environment variable will be ignored
* oc that is used to invoke must-gather is new, must-gather script is new, oc in must-gather script is old; since the old oc in must-gather does not know `--node-selector`, it will fail
* oc that is used to invoke must-gather is new, must-gather script is old, oc in must-gather script is new; since must-gather script does not pass `--node-selector`, no issue
Comment on lines +183 to +184

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neither of this situations can happen and if they did, it would be a bug:

In OCP, we deliver full release payloads and the must-gather image is one of the images within that payload. So we can either have:

  • A must-gather image with an oc that understands node-selector in inspect and the gather script that leverages that for collection profiles
  • A must-gather image with an older oc that doesn't have node-selector in inspect but also doesn't have a gather script trying to use that.

If we had an incompatible combination of oc and gather script in a must-gather image, that would be a bug in that image, because we should have never built such an image. So I don't think we need to document this in the strategy, maybe just somewhere else warn that we need to be careful with this while doing backports.

Copy link
Member

@ingvagabund ingvagabund Nov 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same reasoning holds for --since and --since-time flags. They translate into pod envs. It is not assumed any MG image will interpret it properly. So unless a MG image of a given version already adds support for it all the bug reports will be interpreted as feature requests and forwarded to owning teams.

Copy link

@palonsoro palonsoro Nov 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you are saying in this comment is not what is referred in the points I signaled but in the previous point:

* oc that is used to invoke must-gather is new, must-gather script is old; no issue, environment variable will be ignored

In such cases, as the point already explains, the environment variable is just ignored and it's just fine.

In these two points that I signaled it is suggested is that the same must-gather image has a script that understands the new flags and an oc client which doesn't or vice-versa. That's an inconsistency in the must-gather image.

Either the must-gather image fully adopts the flag, at both its oc client and its script, or it doesn't. Fully and properly adopting the flags is indeed a new feature. Half-adopting the flags as you suggest in either of these points, is a bug.

Copy link
Member

@ingvagabund ingvagabund Nov 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the confusion. I was not referring to a specific point.

You are right. "must-gather script is old, oc in must-gather script is new;" and "must-gather script is new, oc in must-gather script is old;" are not expected to happen as the must-gather script is bound to a specific ocp version which pulls an oc binary for that ocp version. If that's not the case it's a bug.


Only the 3rd case gets an error.

## Operational Aspects of API Extensions

No API changes

## Support Procedures

Collection fails with an error and default collection behavior can be used.

## Alternatives

Previously without exposing any flags, we tested this rule as default behavior;

```markdown
Do not collect daemonset pods logs running on worker nodes, if the cluster's worker node count is greater than 18
```

However, people whose are responsible from the large clusters do not strongly embrace this rule and push towards decreasing worker node threshold to fewer values
to include the mid/small clusters (because cluster size is subjective topic based on the number of nodes).
Additionally, people whose are responsible from the network or MCO components do not embrace this rule as they strictly rely on the daemonset logs we skip as default and
now they have to ask for an additional command execution.

## Infrastructure Needed [optional]

Not applicable