Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add experimental reconcile/rest API metric sanity test to E2E Tests #84

Merged
merged 5 commits into from
Nov 19, 2024

Conversation

jgwest
Copy link
Member

@jgwest jgwest commented Aug 16, 2024

What does this PR do / why we need it:

  • This PR adds logic to sanity test the behaviour of the operator during the tests:
    • We check the (prometheus) metrics coming from the 'localhost:8080/metrics' endpoint of the operator.
    • For example, if the reported # of Reconcile calls was unusually high, this might mean that the operator was stuck in a Reconcile loop
    • Or, if the number of REST client POST requests (e.g. k8s objection creation) was equal to the number of PUT request (e.g. k8s object update), this may imply we are updating .status or .spec of an object too frequently.
  • After this was enabled, I noticed we were reconciling too often, and I discovered that in many cases we were incorrectly detecting cases where nothing had changed, as something having changed.
    • This was due to known issues with comparing maps using reflect.DeepEqual, so I switched over to use a substitute function for this case.
    • I also noticed we were stripping out rbac.authorization.k8s.io/* as a user-defined label, when it is in fact not user-defined, so added it to ignore list.

Have you updated the necessary documentation?

  • Documentation update is required by this PR, and has been updated.

Which issue(s) this PR fixes:
N/A

@jgwest jgwest force-pushed the add-metric-sanity-test-april-2024 branch 3 times, most recently from 16cd12a to de91d75 Compare October 2, 2024 14:43
@jgwest jgwest force-pushed the add-metric-sanity-test-april-2024 branch from de91d75 to f399ca6 Compare October 31, 2024 07:53
# - If the number is higher, this implies we are updating the .status or .spec fields of resources more than is necessary.
PUT_REQUEST_PERCENT=`expr "$DELTA_PUT_REQUESTS"00 / $DELTA_POST_REQUESTS`

if [[ "`expr $PUT_REQUEST_PERCENT \> 40`" == "1" ]]; then
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are these thresholds calculated? Do we compare the metrics before and after and use them as the baseline?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are just arbitrary thresholds (guesses) about what a 'normal' set of metrics should be:

  • If we find that they are too restrictive, we should increase them.
  • If we find that they are too lax, we should decrease them.

But these numbers aren't scientific by any means, they are just a guess that can serve as a starting point to allow us to refine them.

Copy link
Collaborator

@chetan-rns chetan-rns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @jgwest

@jgwest jgwest merged commit 397793c into argoproj-labs:main Nov 19, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants