-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Prometheus histogram] Histograms are being computed using wrong metrics #36317
Comments
suffixSumHey @constanca-m ! Thanks for spotting this. From what I can see the whole sample you provide defines a "histogram" with the However I'm not sure if this is accepted by Openmetrics/Prometheus standard. So I see two open questions here that we need to cover before moving on to any change:
If we verify that we need to cover this case as well I guess we need to adopt our parser accordingly, which should not be a big deal. @gizas @tetianakravchenko any thoughts on this since you spent time recently on the Prometheus module? |
Nice one @constanca-m , detailed description here. So the main question is how prometheus server accepts them and visualises them. Seeing here, I think we hit the case where we have summaries and histograms. Readind the page I see the comment Second question is why |
If we look at the document:
We can see that the metric name is But notice the document and all the metrics under
I did not include this |
I think the best place to handle it is where the actual parsing of the Histogram metric takes place: beats/metricbeat/helper/prometheus/textparse.go Lines 404 to 418 in ff96091
But still I would like to have the information of why we hit this by answering the 3 questions from #36317 (comment). Otherwise it would be a blind hot-fix without solid reasoning about this. Definitely we need to fix this, but we need to be specific on why we hit this so as to have the answer in the future as well. |
Is this issue related to what looks like mapping issues in our monitoring logs (from metricbeat)? Snippet from the log message:
|
I am a little wary of setting |
I don't think you should be concerned about disabling. Since you will never obtain correct results, there is no need for it to be enabled @nkvoll |
Trying to answer these 3 questions @ChrsMark :
I did some research, and I believe they are coming from the However, to produce the
It appears correctly, exactly as they come: And:
I guess we should accept them then! :) |
I am thinking on how to solve this. To put some context: If we have these 4 metrics:
For this specific metric, we can:
Would it work for us if:
What do you think @elastic/obs-cloudnative-monitoring ? Will this work ok? |
Hey @constanca-m , from my perspective I think we should just add it in addition to what we have today, if needed. However in case of Also do we know what the first metric |
Since we don't include For
We get We get Although I am not sure what this all means... But these metrics have the exact same labels, I checked. @ChrsMark |
I think that's not true: beats/metricbeat/module/prometheus/collector/_meta/testdata/metrics-with-naninf.plain-expected.json Lines 87 to 96 in 3255e53
Then I guess that's not super crucial to include it, so based that it's better to only consume storage for really important data I suggest we just skip it for now. |
Sorry late here, but is interesting all the discussion.
http_server_requests_seconds{application="storage-service",exception="None",host="storage-service-6ffd645fbf-5fz69",method="GET",outcome="SUCCESS",site="COLOGNE",status="200",tenant="unknown",uri="/actuator/prometheus",quantile="0.5",} 0.0 So this is the quantile value. See I post here again the https://prometheus.io/docs/practices/histograms/#quantiles. The page says that we should calculate based on buckets and not on quantiles. So I think we should keep only buckets for now. Does it make sense? Also to summarise the status:
Am I missing something? |
From what @ChrsMark wrote, this is not true.
The
Are the same for our code, we end up using data from |
+1 for discarding them! Just to be clear, when not using types we keep the buckets, the sum and the count. See:
While when using types to store as ES histograms, see: |
this format is wrong by the prometheus definition Also in spring-boot repo there are few (possibly) similar issue
+1 on this |
What is the issue?
Currently, we are parsing the Prometheus metrics through this function. However, we do not distinguish the metrics that match
MetricTypeHistogram
. Example, a metric namedhttp_server_requests_seconds
and another one namedhttp_server_requests_seconds_bucket
will both match this part of the code.This ends up causing incorrect values if we want to enable histogram types for Prometheus - which we do when enabling
use_types
.The way we calculate histogram types for Prometheus can be found here.
Let's consider this realistic example file
We have this line in the file:
As we saw before, it does not matter that this metric is not a bucket, since it will fall under this condition. This means that we will treat it as a bucket and try to compute the bucket upper value as this:
The problem is that our value (variable
qv
) does not exist, so theerr
here will never be nil, and we will start our histogram with a bucket with a upper bound that equals1.8446744073709552e+19
( =math.MaxUint64
). This will later be a problem, because histogram values need to be incrementing, otherwise they will fail with:Which is what is happening with this realistic example file.
PR for the fix
#36537
The text was updated successfully, but these errors were encountered: