-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Prometheus] Update prometheus histogram #36537
[Prometheus] Update prometheus histogram #36537
Conversation
This pull request does not have a backport label.
To fixup this pull request, you need to add the backport labels for the needed
|
Given the discussion and linked tickets in #36317, I am wondering if we should drop all the quantile metrics and just ingest the histogram. While you managed to ingest both now, you get lot of additional and expensive time series without much added value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank's @constanca-m for dealing with this.
I have some comments about the implementation details but the main one is that we will need to ensure the fix with unit tests.
Also since unit-tests might not be enough because of the importance of the this specific code-base we would need to execute some manual testing here once this is merged.
@gizas @tommyers-elastic @bturquet could we mark this one for manual testing somehow? We used to do it in the past and helped a lot. The idea is that some random member of the team (preferably not involved with the PR) would execute some manual tests to ensure that the wide range of modules affected are not broken.
name = name[:len(name)-7] | ||
} | ||
name = strings.TrimSuffix(name, suffixGSum) | ||
case isBucket(name): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have any unit-tests that cover these changes? It's hard to understand if something might break here in any corner case or whatever?
If not would you mind adding some tests? Otherwise we can exclude these changes from the patch if not needed for the fix. Improving code should always go along with adding tests ensuring that functionality is not broken :).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added one more test for this exact function. Do you think that is enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that for the new functionality the added test is fine but what about the existing functionality that is touched here? How we can be sure that the code which is changed (improved) is not broken?
I'm talking mostly about the changes like the one at https://github.com/elastic/beats/pull/36537/files#diff-d66f15900f5676672ddf353458a9373cf8b1ab8e237c473191e4a64ae93cbba2L407-R416.
So the point here is that if we want to improve the code we should ensure that we don't break it. So to my mind we should either:
- ensure that we have tests for these lines that still work
- if we don't have tests then add them along with the corresponding code changes.
- if we cannot add tests maybe it's safer to avoid any code changes that might be prone to errors
Could you verify the above cases please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We never had tests for those cases. Do you mean adding a test case for every type (histogram, histogram gauge etc)? I only changed it to strings.TrimSuffix
because then we know what is happening instead of defining the lenght of the suffix and then subtracting like we were doing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean adding a test case for every type (histogram, histogram gauge etc)?
Ideally yes since we touch these specific lines of code. Otherwise the improvement seems to be incomplete to me and potentially prone to errors :).
Maybe that's out of the scope of this PR but the generic idea is what I mention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. So should I try to add new tests for this PR or should we leave it to a different one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be nice since we are on this. Otherwise I doubt it will happen at any time soon.
Another option would be to exclude the specific changes/improvements that are not related to this specific fix and add them later in a follow-up along with tests. That would unblock this one and ensures the quality as well. It's up to you to decide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a changelog record
name = name[:len(name)-7] | ||
} | ||
name = strings.TrimSuffix(name, suffixGSum) | ||
case isBucket(name): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that for the new functionality the added test is fine but what about the existing functionality that is touched here? How we can be sure that the code which is changed (improved) is not broken?
I'm talking mostly about the changes like the one at https://github.com/elastic/beats/pull/36537/files#diff-d66f15900f5676672ddf353458a9373cf8b1ab8e237c473191e4a64ae93cbba2L407-R416.
So the point here is that if we want to improve the code we should ensure that we don't break it. So to my mind we should either:
- ensure that we have tests for these lines that still work
- if we don't have tests then add them along with the corresponding code changes.
- if we cannot add tests maybe it's safer to avoid any code changes that might be prone to errors
Could you verify the above cases please?
Co-authored-by: Chris Mark <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes lgtm. Let's make sure that we take care of the tests either here or in a follow-up based on #36537 (comment).
Also we should consider of introducing manual testing phase for this and similar PRs that touch "core" functionality (specially when unit testing is not sufficient). @constanca-m could you please pair with @gizas and/or @tommyers-elastic on this?
Yes @ChrsMark , I am also working on adding some tests to the other types in this PR. |
Actually, I will open a new PR because I already found a small error, and I don't want new changes to be added here as they are unrelated to the histogram. |
@constanca-m please link here the additional PRs for the tests and the additional fixes |
The PR with the tests is this one: #36669. |
* Update prometheus histogram. * Update tests. * Fix error. * Revert change on expected files. * Add missing comments. * Add unit test. * Add unit test. * Run mage check. * Update changelog. * Add comment. * Update CHANGELOG-developer.next.asciidoc Co-authored-by: Chris Mark <[email protected]> --------- Co-authored-by: Chris Mark <[email protected]>
Important: All documents from the tests commited were the result of running
go test -data
, since the tests were causing this to fail.What
The issue is explained in #36317.
To fix it, we just need to change the function
histogramMetricName
to have support for the correct suffixes:_count
_sum
_bucket
_created
- falls in default and is later ignored._gcount
_gsum
The problem was occurring because we had no distinction between
*_bucket
metrics and all the others, so they were being treated as the same.How to test this PR locally
Related issues
Closes #36317.
Results
I am running a spring application with the following metrics:
If we go to Discover, we can check that the documents are all there now, since the histogram is correct: