-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[receiver/prometheusreceiver] implement append native histogram #28663
[receiver/prometheusreceiver] implement append native histogram #28663
Conversation
b106d14
to
125f22c
Compare
Seems like I need to create a new metricFamily that has type MetricTypeExponentialHistogram . The only issue seems like that Prometheus and Grafana Agent supports scraping classic and native (exponential) histogram version of the same metric, making the key (which is the normalized metric name) the same for the classic and native histogram. So I'm not totally sure what to do here. Maybe reuse the metric family, but make it possible to have a metricGroup inside that has the exponential histogram. This would be possible to route for prometheus remote write exporter, but I doubt it would work for otel. Or MVP: do not enable scraping classic histograms when there is native histogram. Will not cause an issue for OTEL. Only people that would get bit by this is someone trying to switch from prometheus to otel but using classic + native histograms at the same time (i.e. migrating). |
Could we add config to control whether we keep the native vs the fixed bucket histogram? Is there any existing configuration in the prometheus server that controls whether native histograms are enabled or not? |
prometheus will start scraping native histograms only if you enable the feature. And you can ask it to also scrape classic version of the native histograms (since the metric names don't conflict) see |
Ideally we would only scrape native histograms if the feature flag is enabled, and only scrape classic histograms if scrape_classic_histograms is set. |
125f22c
to
4ea6051
Compare
point.SetScale(fh.Schema) | ||
point.SetCount(uint64(fh.Count)) | ||
point.SetSum(fh.Sum) | ||
point.SetZeroCount(uint64(fh.ZeroCount)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no point.SetZeroThreshold function :(
07df0c3
to
4b7420f
Compare
Rebased onto #29153 and added first e2e test. |
@dashpole @bogdandrutu I mean it's not possible to 100% tell from a histogram if it was restarted between two scrapes. A possible bug in the current code is checking the |
IIUC, the sum should not be reported if it includes negative observations (at least for "standard" histograms, since sum is a counter). https://github.com/OpenObservability/OpenMetrics/blob/main/specification/OpenMetrics.md#histogram. Is sum not required to be monotonic for native histograms?
Can you clarify? Are you talking about buckets being merged? Or about the histogram resetting, but receiving more observations than it previously had since the reset? |
I made a simple test with observing some negative values and checked /metrics:
To quote the documentation : "The sum of observations (showing up as a time series with a _sum suffix) behaves like a counter, too, as long as there are no negative observations. " So there is no guarantee that the sum actually behaves like a counter.
The second case. Let's suppose you have 4 observations in the 0th offset positive bucket and that's the only bucket in use, so your count is 4. For the next scrape , schema doesn't change , and you get 2 observations in the 0th, 1st, 2nd positive buckets, so your overall count is 6, but obviously there was a restart since 0th bucket count went down. See code starting from https://github.com/prometheus/prometheus/blob/1bfb3ed062e99bd3c74e05d9ff9a7fa4e30bbe21/tsdb/chunkenc/histogram.go#L272 . But this is all academic if it's not actually that important to detect these cases :) |
Those cases seem rare in practice, and we've accepted those thus far as unsolvable. I'm OK with not properly handling those. It sounds like using count to detect resets is probably our best bet. IIRC, prom protobuf is going to begin include start timestamps soon, so hopefully these problems are obsoleted over time. |
Understood, the one thing we have to make sure is that the exported/prometheusremotewrite MUST NOT set the the ResetHint to "NO", unless it's absolutely sure that there was no reset (as defined by Prometheus). Currently the hint is left to "UNKOWN" which is fine. So basically it would be possible to set "NO" only with the create timestamp, since the current algorithm above is not compatible with the Prometheus definition of counter reset. cc @Aneurysm9 (as maintainer of the exporter). |
@dashpole this is mostly finished, I want to add a reference to a Prometheus issue we need to create and also test in real life with a couple of thousands of native histogram metrics. |
Would you mind adding a comment in the exporter(s), referencing prometheus' definition of reset to ensure that isn't changed? We could explicitly set it to unknown to make it clear that it is the correct value for us. It can be done separately from this PR. |
…negotiation (#29153) The code needs some basic tests that can be later expanded with tests for native histograms use cases. Changes: Refactored `testComponent` function to be easier to customize the configuration of the scrape. Expanded `compareHistogram` to assert on the explicit boundaries as well. Added function `prometheusMetricFamilyToProtoBuf` to helpers to be able to turn serialize a Prometheus metric family into Protobuf. Added simple test of Protobuf based scrape of counters, gauges, summaries and histograms. #26555 Followup to #27030 Related to #28663 **Testing:** Adding simple e2e test for scraping over Protobuf. **Documentation:** Not applicable. --------- Signed-off-by: György Krajcsovits <[email protected]> Co-authored-by: David Ashpole <[email protected]>
ebbd838
to
7d04cd8
Compare
Signed-off-by: György Krajcsovits <[email protected]>
Signed-off-by: György Krajcsovits <[email protected]>
I see it was merged. I updated the branch locally, but running into an issue with the handling of |
Signed-off-by: György Krajcsovits <[email protected]>
It's a bug in Promtheus. Have to make the fix and get it into otel as well |
Adopt open-telemetry#31908 Signed-off-by: György Krajcsovits <[email protected]>
Test with negative schema as well Signed-off-by: György Krajcsovits <[email protected]>
Depends on prometheus/prometheus#13846 and backport to 0.50.2 at least |
Will be included in 0.51.1 for sure. |
} | ||
|
||
for _, tt := range tests { | ||
tt := tt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this line redundant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the loop var fix in Go 1.22 ? We require 1.21 right now. Maybe I'm mixing this up, but there's 3 other tests that have this line (I did copy :) ).
I need to look into #31572 (comment) |
After the Prometheus update tests pass again so I think this is ready for review again. I suggest to focus on whether the feature flag is correctly applied. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm after enable_protobuf_negotiation is removed from the README
Signed-off-by: György Krajcsovits <[email protected]>
Thank you, I don't have the right to merge. I'll keep updating a branch regularly until merged. |
Description:
Implement native histogram append MVP.
Very similar to appending a float sample.
Limitations:
Additionally:
Link to tracking Issue:
Fixes: #26555
Testing:
Added unit tests and e2e tests.
Documentation:
TBD: will have to call out protobuf negotiation while no text format. #27030