Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Prometheus] Update prometheus histogram #36537

Merged
merged 14 commits into from
Sep 22, 2023
Merged

[Prometheus] Update prometheus histogram #36537

merged 14 commits into from
Sep 22, 2023

Conversation

constanca-m
Copy link
Contributor

@constanca-m constanca-m commented Sep 8, 2023

Important: All documents from the tests commited were the result of running go test -data, since the tests were causing this to fail.

What

The issue is explained in #36317.

To fix it, we just need to change the function histogramMetricName to have support for the correct suffixes:

  • _count
  • _sum
  • _bucket
  • _created - falls in default and is later ignored.
  • _gcount
  • _gsum
  • everything else is ignored

The problem was occurring because we had no distinction between *_bucket metrics and all the others, so they were being treated as the same.

How to test this PR locally

  1. Clone this branch.
  2. Build metricbeat binary and run it.

Related issues

Closes #36317.

Results

I am running a spring application with the following metrics:
image

If we go to Discover, we can check that the documents are all there now, since the histogram is correct:
image

@constanca-m constanca-m added the Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team label Sep 8, 2023
@constanca-m constanca-m requested a review from a team September 8, 2023 09:21
@constanca-m constanca-m requested a review from a team as a code owner September 8, 2023 09:21
@constanca-m constanca-m self-assigned this Sep 8, 2023
@botelastic botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Sep 8, 2023
@elasticmachine
Copy link
Collaborator

elasticmachine commented Sep 8, 2023

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2023-09-22T09:03:34.776+0000

  • Duration: 84 min 11 sec

Test stats 🧪

Test Results
Failed 0
Passed 4956
Skipped 1051
Total 6007

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@mergify
Copy link
Contributor

mergify bot commented Sep 8, 2023

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @constanca-m? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

@constanca-m constanca-m added the backport-skip Skip notification from the automated backport with mergify label Sep 8, 2023
@constanca-m constanca-m closed this Sep 8, 2023
@constanca-m constanca-m reopened this Sep 8, 2023
@StephanErb
Copy link

Given the discussion and linked tickets in #36317, I am wondering if we should drop all the quantile metrics and just ingest the histogram.

While you managed to ingest both now, you get lot of additional and expensive time series without much added value.

Copy link
Member

@ChrsMark ChrsMark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank's @constanca-m for dealing with this.

I have some comments about the implementation details but the main one is that we will need to ensure the fix with unit tests.

Also since unit-tests might not be enough because of the importance of the this specific code-base we would need to execute some manual testing here once this is merged.
@gizas @tommyers-elastic @bturquet could we mark this one for manual testing somehow? We used to do it in the past and helped a lot. The idea is that some random member of the team (preferably not involved with the PR) would execute some manual tests to ensure that the wide range of modules affected are not broken.

metricbeat/helper/prometheus/textparse.go Show resolved Hide resolved
name = name[:len(name)-7]
}
name = strings.TrimSuffix(name, suffixGSum)
case isBucket(name):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any unit-tests that cover these changes? It's hard to understand if something might break here in any corner case or whatever?

If not would you mind adding some tests? Otherwise we can exclude these changes from the patch if not needed for the fix. Improving code should always go along with adding tests ensuring that functionality is not broken :).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added one more test for this exact function. Do you think that is enough?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that for the new functionality the added test is fine but what about the existing functionality that is touched here? How we can be sure that the code which is changed (improved) is not broken?
I'm talking mostly about the changes like the one at https://github.com/elastic/beats/pull/36537/files#diff-d66f15900f5676672ddf353458a9373cf8b1ab8e237c473191e4a64ae93cbba2L407-R416.

So the point here is that if we want to improve the code we should ensure that we don't break it. So to my mind we should either:

  1. ensure that we have tests for these lines that still work
  2. if we don't have tests then add them along with the corresponding code changes.
  3. if we cannot add tests maybe it's safer to avoid any code changes that might be prone to errors

Could you verify the above cases please?

Copy link
Contributor Author

@constanca-m constanca-m Sep 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We never had tests for those cases. Do you mean adding a test case for every type (histogram, histogram gauge etc)? I only changed it to strings.TrimSuffix because then we know what is happening instead of defining the lenght of the suffix and then subtracting like we were doing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean adding a test case for every type (histogram, histogram gauge etc)?

Ideally yes since we touch these specific lines of code. Otherwise the improvement seems to be incomplete to me and potentially prone to errors :).
Maybe that's out of the scope of this PR but the generic idea is what I mention.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. So should I try to add new tests for this PR or should we leave it to a different one?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be nice since we are on this. Otherwise I doubt it will happen at any time soon.
Another option would be to exclude the specific changes/improvements that are not related to this specific fix and add them later in a follow-up along with tests. That would unblock this one and ensures the quality as well. It's up to you to decide.

metricbeat/helper/prometheus/textparse.go Show resolved Hide resolved
metricbeat/helper/prometheus/textparse.go Outdated Show resolved Hide resolved
@ChrsMark ChrsMark added needs tests test-plan Add this PR to be manual test plan labels Sep 21, 2023
Copy link
Contributor

@tetianakravchenko tetianakravchenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a changelog record

metricbeat/helper/prometheus/textparse.go Outdated Show resolved Hide resolved
metricbeat/helper/prometheus/textparse.go Show resolved Hide resolved
metricbeat/helper/prometheus/textparse.go Show resolved Hide resolved
metricbeat/helper/prometheus/textparse.go Show resolved Hide resolved
name = name[:len(name)-7]
}
name = strings.TrimSuffix(name, suffixGSum)
case isBucket(name):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that for the new functionality the added test is fine but what about the existing functionality that is touched here? How we can be sure that the code which is changed (improved) is not broken?
I'm talking mostly about the changes like the one at https://github.com/elastic/beats/pull/36537/files#diff-d66f15900f5676672ddf353458a9373cf8b1ab8e237c473191e4a64ae93cbba2L407-R416.

So the point here is that if we want to improve the code we should ensure that we don't break it. So to my mind we should either:

  1. ensure that we have tests for these lines that still work
  2. if we don't have tests then add them along with the corresponding code changes.
  3. if we cannot add tests maybe it's safer to avoid any code changes that might be prone to errors

Could you verify the above cases please?

metricbeat/helper/prometheus/textparse.go Show resolved Hide resolved
Copy link
Member

@ChrsMark ChrsMark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes lgtm. Let's make sure that we take care of the tests either here or in a follow-up based on #36537 (comment).

Also we should consider of introducing manual testing phase for this and similar PRs that touch "core" functionality (specially when unit testing is not sufficient). @constanca-m could you please pair with @gizas and/or @tommyers-elastic on this?

@constanca-m
Copy link
Contributor Author

Yes @ChrsMark , I am also working on adding some tests to the other types in this PR.

@constanca-m
Copy link
Contributor Author

Actually, I will open a new PR because I already found a small error, and I don't want new changes to be added here as they are unrelated to the histogram.

@constanca-m constanca-m merged commit 23246fe into elastic:main Sep 22, 2023
6 of 8 checks passed
@constanca-m constanca-m deleted the fix-prometheus-histogram branch September 22, 2023 14:13
@gizas
Copy link
Contributor

gizas commented Sep 25, 2023

@constanca-m please link here the additional PRs for the tests and the additional fixes

@constanca-m
Copy link
Contributor Author

The PR with the tests is this one: #36669.

Scholar-Li pushed a commit to Scholar-Li/beats that referenced this pull request Feb 5, 2024
* Update prometheus histogram.

* Update tests.

* Fix error.

* Revert change on expected files.

* Add missing comments.

* Add unit test.

* Add unit test.

* Run mage check.

* Update changelog.

* Add comment.

* Update CHANGELOG-developer.next.asciidoc

Co-authored-by: Chris Mark <[email protected]>

---------

Co-authored-by: Chris Mark <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-skip Skip notification from the automated backport with mergify needs tests Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team Team:Elastic-Agent Label for the Agent team test-plan Add this PR to be manual test plan
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Prometheus histogram] Histograms are being computed using wrong metrics
8 participants