Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Prometheus histogram] Histograms are being computed using wrong metrics #36317

Closed
constanca-m opened this issue Aug 14, 2023 · 17 comments · Fixed by #36537
Closed

[Prometheus histogram] Histograms are being computed using wrong metrics #36317

constanca-m opened this issue Aug 14, 2023 · 17 comments · Fixed by #36537
Assignees
Labels
bug Metricbeat Metricbeat Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team

Comments

@constanca-m
Copy link
Contributor

constanca-m commented Aug 14, 2023

What is the issue?

Currently, we are parsing the Prometheus metrics through this function. However, we do not distinguish the metrics that match MetricTypeHistogram. Example, a metric named http_server_requests_seconds and another one named http_server_requests_seconds_bucket will both match this part of the code.

This ends up causing incorrect values if we want to enable histogram types for Prometheus - which we do when enabling use_types.

The way we calculate histogram types for Prometheus can be found here.

Let's consider this realistic example file
# HELP http_server_requests_seconds
# TYPE http_server_requests_seconds histogram
http_server_requests_seconds{application="storage-service",exception="None",host="storage-service-6ffd645fbf-5fz69",method="GET",outcome="SUCCESS",site="COLOGNE",status="200",tenant="unknown",uri="/actuator/prometheus",quantile="0.5",} 0.0
http_server_requests_seconds{application="storage-service",exception="None",host="storage-service-6ffd645fbf-5fz69",method="GET",outcome="SUCCESS",site="COLOGNE",status="200",tenant="unknown",uri="/actuator/prometheus",quantile="0.9",} 0.0
http_server_requests_seconds{application="storage-service",exception="None",host="storage-service-6ffd645fbf-5fz69",method="GET",outcome="SUCCESS",site="COLOGNE",status="200",tenant="unknown",uri="/actuator/prometheus",quantile="0.95",} 0.0
http_server_requests_seconds{application="storage-service",exception="None",host="storage-service-6ffd645fbf-5fz69",method="GET",outcome="SUCCESS",site="COLOGNE",status="200",tenant="unknown",uri="/actuator/prometheus",quantile="0.99",} 0.0
http_server_requests_seconds{application="storage-service",exception="None",host="storage-service-6ffd645fbf-5fz69",method="GET",outcome="SUCCESS",site="COLOGNE",status="200",tenant="unknown",uri="/actuator/prometheus",quantile="0.999",} 0.0
http_server_requests_seconds_bucket{application="storage-service",exception="None",host="storage-service-6ffd645fbf-5fz69",method="GET",outcome="SUCCESS",site="COLOGNE",status="200",tenant="unknown",uri="/actuator/prometheus",le="0.001",} 0.0
http_server_requests_seconds_bucket{application="storage-service",exception="None",host="storage-service-6ffd645fbf-5fz69",method="GET",outcome="SUCCESS",site="COLOGNE",status="200",tenant="unknown",uri="/actuator/prometheus",le="0.001048576",} 0.0
http_server_requests_seconds_bucket{application="storage-service",exception="None",host="storage-service-6ffd645fbf-5fz69",method="GET",outcome="SUCCESS",site="COLOGNE",status="200",tenant="unknown",uri="/actuator/prometheus",le="0.001398101",} 0.0
http_server_requests_seconds_bucket{application="storage-service",exception="None",host="storage-service-6ffd645fbf-5fz69",method="GET",outcome="SUCCESS",site="COLOGNE",status="200",tenant="unknown",uri="/actuator/prometheus",le="0.001747626",} 0.0
http_server_requests_seconds_bucket{application="storage-service",exception="None",host="storage-service-6ffd645fbf-5fz69",method="GET",outcome="SUCCESS",site="COLOGNE",status="200",tenant="unknown",uri="/actuator/prometheus",le="0.002097151",} 0.0
http_server_requests_seconds_bucket{application="storage-service",exception="None",host="storage-service-6ffd645fbf-5fz69",method="GET",outcome="SUCCESS",site="COLOGNE",status="200",tenant="unknown",uri="/actuator/prometheus",le="0.002446676",} 0.0
http_server_requests_seconds_bucket{application="storage-service",exception="None",host="storage-service-6ffd645fbf-5fz69",method="GET",outcome="SUCCESS",site="COLOGNE",status="200",tenant="unknown",uri="/actuator/prometheus",le="0.002796201",} 0.0
http_server_requests_seconds_bucket{application="storage-service",exception="None",host="storage-service-6ffd645fbf-5fz69",method="GET",outcome="SUCCESS",site="COLOGNE",status="200",tenant="unknown",uri="/actuator/prometheus",le="0.003145726",} 0.0
http_server_requests_seconds_bucket{application="storage-service",exception="None",host="storage-service-6ffd645fbf-5fz69",method="GET",outcome="SUCCESS",site="COLOGNE",status="200",tenant="unknown",uri="/actuator/prometheus",le="0.003495251",} 0.0
http_server_requests_seconds_bucket{application="storage-service",exception="None",host="storage-service-6ffd645fbf-5fz69",method="GET",outcome="SUCCESS",site="COLOGNE",status="200",tenant="unknown",uri="/actuator/prometheus",le="0.003844776",} 0.0
http_server_requests_seconds_bucket{application="storage-service",exception="None",host="storage-service-6ffd645fbf-5fz69",method="GET",outcome="SUCCESS",site="COLOGNE",status="200",tenant="unknown",uri="/actuator/prometheus",le="0.004194304",} 0.0
http_server_requests_seconds_bucket{application="storage-service",exception="None",host="storage-service-6ffd645fbf-5fz69",method="GET",outcome="SUCCESS",site="COLOGNE",status="200",tenant="unknown",uri="/actuator/prometheus",le="0.005592405",} 0.0
http_server_requests_seconds_bucket{application="storage-service",exception="None",host="storage-service-6ffd645fbf-5fz69",method="GET",outcome="SUCCESS",site="COLOGNE",status="200",tenant="unknown",uri="/actuator/prometheus",le="0.006990506",} 0.0
http_server_requests_seconds_bucket{application="storage-service",exception="None",host="storage-service-6ffd645fbf-5fz69",method="GET",outcome="SUCCESS",site="COLOGNE",status="200",tenant="unknown",uri="/actuator/prometheus",le="0.008388607",} 0.0
http_server_requests_seconds_bucket{application="storage-service",exception="None",host="storage-service-6ffd645fbf-5fz69",method="GET",outcome="SUCCESS",site="COLOGNE",status="200",tenant="unknown",uri="/actuator/prometheus",le="0.009786708",} 0.0
http_server_requests_seconds_bucket{application="storage-service",exception="None",host="storage-service-6ffd645fbf-5fz69",method="GET",outcome="SUCCESS",site="COLOGNE",status="200",tenant="unknown",uri="/actuator/prometheus",le="0.011184809",} 0.0
http_server_requests_seconds_bucket{application="storage-service",exception="None",host="storage-service-6ffd645fbf-5fz69",method="GET",outcome="SUCCESS",site="COLOGNE",status="200",tenant="unknown",uri="/actuator/prometheus",le="0.01258291",} 0.0
http_server_requests_seconds_bucket{application="storage-service",exception="None",host="storage-service-6ffd645fbf-5fz69",method="GET",outcome="SUCCESS",site="COLOGNE",status="200",tenant="unknown",uri="/actuator/prometheus",le="0.013981011",} 0.0
http_server_requests_seconds_bucket{application="storage-service",exception="None",host="storage-service-6ffd645fbf-5fz69",method="GET",outcome="SUCCESS",site="COLOGNE",status="200",tenant="unknown",uri="/actuator/prometheus",le="0.015379112",} 0.0
http_server_requests_seconds_bucket{application="storage-service",exception="None",host="storage-service-6ffd645fbf-5fz69",method="GET",outcome="SUCCESS",site="COLOGNE",status="200",tenant="unknown",uri="/actuator/prometheus",le="0.016777216",} 1.0
http_server_requests_seconds_bucket{application="storage-service",exception="None",host="storage-service-6ffd645fbf-5fz69",method="GET",outcome="SUCCESS",site="COLOGNE",status="200",tenant="unknown",uri="/actuator/prometheus",le="0.022369621",} 1.0
http_server_requests_seconds_bucket{application="storage-service",exception="None",host="storage-service-6ffd645fbf-5fz69",method="GET",outcome="SUCCESS",site="COLOGNE",status="200",tenant="unknown",uri="/actuator/prometheus",le="0.027962026",} 162.0
http_server_requests_seconds_bucket{application="storage-service",exception="None",host="storage-service-6ffd645fbf-5fz69",method="GET",outcome="SUCCESS",site="COLOGNE",status="200",tenant="unknown",uri="/actuator/prometheus",le="0.033554431",} 5005.0
http_server_requests_seconds_bucket{application="storage-service",exception="None",host="storage-service-6ffd645fbf-5fz69",method="GET",outcome="SUCCESS",site="COLOGNE",status="200",tenant="unknown",uri="/actuator/prometheus",le="0.039146836",} 7425.0
http_server_requests_seconds_bucket{application="storage-service",exception="None",host="storage-service-6ffd645fbf-5fz69",method="GET",outcome="SUCCESS",site="COLOGNE",status="200",tenant="unknown",uri="/actuator/prometheus",le="0.044739241",} 8302.0
http_server_requests_seconds_bucket{application="storage-service",exception="None",host="storage-service-6ffd645fbf-5fz69",method="GET",outcome="SUCCESS",site="COLOGNE",status="200",tenant="unknown",uri="/actuator/prometheus",le="0.050331646",} 9316.0
http_server_requests_seconds_bucket{application="storage-service",exception="None",host="storage-service-6ffd645fbf-5fz69",method="GET",outcome="SUCCESS",site="COLOGNE",status="200",tenant="unknown",uri="/actuator/prometheus",le="0.055924051",} 9761.0
http_server_requests_seconds_bucket{application="storage-service",exception="None",host="storage-service-6ffd645fbf-5fz69",method="GET",outcome="SUCCESS",site="COLOGNE",status="200",tenant="unknown",uri="/actuator/prometheus",le="0.061516456",} 10134.0
http_server_requests_seconds_bucket{application="storage-service",exception="None",host="storage-service-6ffd645fbf-5fz69",method="GET",outcome="SUCCESS",site="COLOGNE",status="200",tenant="unknown",uri="/actuator/prometheus",le="0.067108864",} 10349.0
http_server_requests_seconds_bucket{application="storage-service",exception="None",host="storage-service-6ffd645fbf-5fz69",method="GET",outcome="SUCCESS",site="COLOGNE",status="200",tenant="unknown",uri="/actuator/prometheus",le="0.089478485",} 10755.0
http_server_requests_seconds_bucket{application="storage-service",exception="None",host="storage-service-6ffd645fbf-5fz69",method="GET",outcome="SUCCESS",site="COLOGNE",status="200",tenant="unknown",uri="/actuator/prometheus",le="0.111848106",} 10798.0
http_server_requests_seconds_bucket{application="storage-service",exception="None",host="storage-service-6ffd645fbf-5fz69",method="GET",outcome="SUCCESS",site="COLOGNE",status="200",tenant="unknown",uri="/actuator/prometheus",le="0.134217727",} 10798.0
http_server_requests_seconds_bucket{application="storage-service",exception="None",host="storage-service-6ffd645fbf-5fz69",method="GET",outcome="SUCCESS",site="COLOGNE",status="200",tenant="unknown",uri="/actuator/prometheus",le="0.156587348",} 10800.0
http_server_requests_seconds_bucket{application="storage-service",exception="None",host="storage-service-6ffd645fbf-5fz69",method="GET",outcome="SUCCESS",site="COLOGNE",status="200",tenant="unknown",uri="/actuator/prometheus",le="0.178956969",} 10800.0
http_server_requests_seconds_bucket{application="storage-service",exception="None",host="storage-service-6ffd645fbf-5fz69",method="GET",outcome="SUCCESS",site="COLOGNE",status="200",tenant="unknown",uri="/actuator/prometheus",le="0.20132659",} 10800.0
http_server_requests_seconds_bucket{application="storage-service",exception="None",host="storage-service-6ffd645fbf-5fz69",method="GET",outcome="SUCCESS",site="COLOGNE",status="200",tenant="unknown",uri="/actuator/prometheus",le="0.223696211",} 10800.0
http_server_requests_seconds_bucket{application="storage-service",exception="None",host="storage-service-6ffd645fbf-5fz69",method="GET",outcome="SUCCESS",site="COLOGNE",status="200",tenant="unknown",uri="/actuator/prometheus",le="0.246065832",} 10800.0
http_server_requests_seconds_bucket{application="storage-service",exception="None",host="storage-service-6ffd645fbf-5fz69",method="GET",outcome="SUCCESS",site="COLOGNE",status="200",tenant="unknown",uri="/actuator/prometheus",le="0.268435456",} 10800.0
http_server_requests_seconds_bucket{application="storage-service",exception="None",host="storage-service-6ffd645fbf-5fz69",method="GET",outcome="SUCCESS",site="COLOGNE",status="200",tenant="unknown",uri="/actuator/prometheus",le="0.357913941",} 10800.0
http_server_requests_seconds_bucket{application="storage-service",exception="None",host="storage-service-6ffd645fbf-5fz69",method="GET",outcome="SUCCESS",site="COLOGNE",status="200",tenant="unknown",uri="/actuator/prometheus",le="0.447392426",} 10800.0
http_server_requests_seconds_bucket{application="storage-service",exception="None",host="storage-service-6ffd645fbf-5fz69",method="GET",outcome="SUCCESS",site="COLOGNE",status="200",tenant="unknown",uri="/actuator/prometheus",le="0.536870911",} 10800.0
http_server_requests_seconds_bucket{application="storage-service",exception="None",host="storage-service-6ffd645fbf-5fz69",method="GET",outcome="SUCCESS",site="COLOGNE",status="200",tenant="unknown",uri="/actuator/prometheus",le="0.626349396",} 10800.0
http_server_requests_seconds_bucket{application="storage-service",exception="None",host="storage-service-6ffd645fbf-5fz69",method="GET",outcome="SUCCESS",site="COLOGNE",status="200",tenant="unknown",uri="/actuator/prometheus",le="0.715827881",} 10800.0
http_server_requests_seconds_bucket{application="storage-service",exception="None",host="storage-service-6ffd645fbf-5fz69",method="GET",outcome="SUCCESS",site="COLOGNE",status="200",tenant="unknown",uri="/actuator/prometheus",le="0.805306366",} 10800.0
http_server_requests_seconds_bucket{application="storage-service",exception="None",host="storage-service-6ffd645fbf-5fz69",method="GET",outcome="SUCCESS",site="COLOGNE",status="200",tenant="unknown",uri="/actuator/prometheus",le="0.894784851",} 10800.0
http_server_requests_seconds_bucket{application="storage-service",exception="None",host="storage-service-6ffd645fbf-5fz69",method="GET",outcome="SUCCESS",site="COLOGNE",status="200",tenant="unknown",uri="/actuator/prometheus",le="0.984263336",} 10800.0
http_server_requests_seconds_bucket{application="storage-service",exception="None",host="storage-service-6ffd645fbf-5fz69",method="GET",outcome="SUCCESS",site="COLOGNE",status="200",tenant="unknown",uri="/actuator/prometheus",le="1.073741824",} 10800.0
http_server_requests_seconds_bucket{application="storage-service",exception="None",host="storage-service-6ffd645fbf-5fz69",method="GET",outcome="SUCCESS",site="COLOGNE",status="200",tenant="unknown",uri="/actuator/prometheus",le="1.431655765",} 10800.0
http_server_requests_seconds_bucket{application="storage-service",exception="None",host="storage-service-6ffd645fbf-5fz69",method="GET",outcome="SUCCESS",site="COLOGNE",status="200",tenant="unknown",uri="/actuator/prometheus",le="1.789569706",} 10800.0
http_server_requests_seconds_bucket{application="storage-service",exception="None",host="storage-service-6ffd645fbf-5fz69",method="GET",outcome="SUCCESS",site="COLOGNE",status="200",tenant="unknown",uri="/actuator/prometheus",le="2.147483647",} 10800.0
http_server_requests_seconds_bucket{application="storage-service",exception="None",host="storage-service-6ffd645fbf-5fz69",method="GET",outcome="SUCCESS",site="COLOGNE",status="200",tenant="unknown",uri="/actuator/prometheus",le="2.505397588",} 10800.0
http_server_requests_seconds_bucket{application="storage-service",exception="None",host="storage-service-6ffd645fbf-5fz69",method="GET",outcome="SUCCESS",site="COLOGNE",status="200",tenant="unknown",uri="/actuator/prometheus",le="2.863311529",} 10800.0
http_server_requests_seconds_bucket{application="storage-service",exception="None",host="storage-service-6ffd645fbf-5fz69",method="GET",outcome="SUCCESS",site="COLOGNE",status="200",tenant="unknown",uri="/actuator/prometheus",le="3.22122547",} 10800.0
http_server_requests_seconds_bucket{application="storage-service",exception="None",host="storage-service-6ffd645fbf-5fz69",method="GET",outcome="SUCCESS",site="COLOGNE",status="200",tenant="unknown",uri="/actuator/prometheus",le="3.579139411",} 10800.0
http_server_requests_seconds_bucket{application="storage-service",exception="None",host="storage-service-6ffd645fbf-5fz69",method="GET",outcome="SUCCESS",site="COLOGNE",status="200",tenant="unknown",uri="/actuator/prometheus",le="3.937053352",} 10800.0
http_server_requests_seconds_bucket{application="storage-service",exception="None",host="storage-service-6ffd645fbf-5fz69",method="GET",outcome="SUCCESS",site="COLOGNE",status="200",tenant="unknown",uri="/actuator/prometheus",le="4.294967296",} 10800.0
http_server_requests_seconds_bucket{application="storage-service",exception="None",host="storage-service-6ffd645fbf-5fz69",method="GET",outcome="SUCCESS",site="COLOGNE",status="200",tenant="unknown",uri="/actuator/prometheus",le="5.726623061",} 10800.0
http_server_requests_seconds_bucket{application="storage-service",exception="None",host="storage-service-6ffd645fbf-5fz69",method="GET",outcome="SUCCESS",site="COLOGNE",status="200",tenant="unknown",uri="/actuator/prometheus",le="7.158278826",} 10800.0
http_server_requests_seconds_bucket{application="storage-service",exception="None",host="storage-service-6ffd645fbf-5fz69",method="GET",outcome="SUCCESS",site="COLOGNE",status="200",tenant="unknown",uri="/actuator/prometheus",le="8.589934591",} 10800.0
http_server_requests_seconds_bucket{application="storage-service",exception="None",host="storage-service-6ffd645fbf-5fz69",method="GET",outcome="SUCCESS",site="COLOGNE",status="200",tenant="unknown",uri="/actuator/prometheus",le="10.021590356",} 10800.0
http_server_requests_seconds_bucket{application="storage-service",exception="None",host="storage-service-6ffd645fbf-5fz69",method="GET",outcome="SUCCESS",site="COLOGNE",status="200",tenant="unknown",uri="/actuator/prometheus",le="11.453246121",} 10800.0
http_server_requests_seconds_bucket{application="storage-service",exception="None",host="storage-service-6ffd645fbf-5fz69",method="GET",outcome="SUCCESS",site="COLOGNE",status="200",tenant="unknown",uri="/actuator/prometheus",le="12.884901886",} 10800.0
http_server_requests_seconds_bucket{application="storage-service",exception="None",host="storage-service-6ffd645fbf-5fz69",method="GET",outcome="SUCCESS",site="COLOGNE",status="200",tenant="unknown",uri="/actuator/prometheus",le="14.316557651",} 10800.0
http_server_requests_seconds_bucket{application="storage-service",exception="None",host="storage-service-6ffd645fbf-5fz69",method="GET",outcome="SUCCESS",site="COLOGNE",status="200",tenant="unknown",uri="/actuator/prometheus",le="15.748213416",} 10800.0
http_server_requests_seconds_bucket{application="storage-service",exception="None",host="storage-service-6ffd645fbf-5fz69",method="GET",outcome="SUCCESS",site="COLOGNE",status="200",tenant="unknown",uri="/actuator/prometheus",le="17.179869184",} 10800.0
http_server_requests_seconds_bucket{application="storage-service",exception="None",host="storage-service-6ffd645fbf-5fz69",method="GET",outcome="SUCCESS",site="COLOGNE",status="200",tenant="unknown",uri="/actuator/prometheus",le="22.906492245",} 10800.0
http_server_requests_seconds_bucket{application="storage-service",exception="None",host="storage-service-6ffd645fbf-5fz69",method="GET",outcome="SUCCESS",site="COLOGNE",status="200",tenant="unknown",uri="/actuator/prometheus",le="28.633115306",} 10800.0
http_server_requests_seconds_bucket{application="storage-service",exception="None",host="storage-service-6ffd645fbf-5fz69",method="GET",outcome="SUCCESS",site="COLOGNE",status="200",tenant="unknown",uri="/actuator/prometheus",le="30.0",} 10800.0
http_server_requests_seconds_bucket{application="storage-service",exception="None",host="storage-service-6ffd645fbf-5fz69",method="GET",outcome="SUCCESS",site="COLOGNE",status="200",tenant="unknown",uri="/actuator/prometheus",le="+Inf",} 10800.0

We have this line in the file:

http_server_requests_seconds{application="storage-service",exception="None",host="storage-service-6ffd645fbf-5fz69",method="GET",outcome="SUCCESS",site="COLOGNE",status="200",tenant="unknown",uri="/actuator/prometheus",quantile="0.5",} 0.0

As we saw before, it does not matter that this metric is not a bucket, since it will fall under this condition. This means that we will treat it as a bucket and try to compute the bucket upper value as this:

f, err := strconv.ParseFloat(qv, 64)
if err != nil {
	f = math.MaxUint64
}

The problem is that our value (variable qv) does not exist, so the err here will never be nil, and we will start our histogram with a bucket with a upper bound that equals 1.8446744073709552e+19 ( = math.MaxUint64). This will later be a problem, because histogram values need to be incrementing, otherwise they will fail with:

{"type":"document_parsing_exception","reason":"[1:549] failed to parse field [prometheus.http_server_requests_seconds.histogram] of type [histogram]","caused_by":{"type":"document_parsing_exception","reason":"[1:549] error parsing field [prometheus.http_server_requests_seconds.histogram], [values] values must be in increasing order, got [9.223372036854776E18] but previous value was [1.8446744073709552E19]"}}, dropping event!

Which is what is happening with this realistic example file.

PR for the fix

#36537

@constanca-m constanca-m added the Metricbeat Metricbeat label Aug 14, 2023
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Aug 14, 2023
@constanca-m constanca-m added the Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team label Aug 14, 2023
@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Aug 14, 2023
@constanca-m constanca-m added bug needs_team Indicates that the issue/PR needs a Team:* label labels Aug 14, 2023
@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Aug 14, 2023
@elastic elastic deleted a comment from botelastic bot Aug 14, 2023
@ChrsMark
Copy link
Member

ChrsMark commented Aug 16, 2023

suffixSumHey @constanca-m ! Thanks for spotting this.

From what I can see the whole sample you provide defines a "histogram" with the http_server_requests_seconds metrics providing the quantiles of the histogram.

However I'm not sure if this is accepted by Openmetrics/Prometheus standard. So I see two open questions here that we need to cover before moving on to any change:

  1. Do you know what type of application provides these metrics?
  2. Next we need to check if this type of "histogram" is "accepted" as Prometheus/Openmetrics metric.
  3. Also we can check if a Prometheus server can scrape these. If Prometheus accepts those we should accept those as well.

If we verify that we need to cover this case as well I guess we need to adopt our parser accordingly, which should not be a big deal.

@gizas @tetianakravchenko any thoughts on this since you spent time recently on the Prometheus module?

@gizas
Copy link
Contributor

gizas commented Aug 16, 2023

Nice one @constanca-m , detailed description here.

So the main question is how prometheus server accepts them and visualises them. Seeing here, I think we hit the case where we have summaries and histograms. Readind the page I see the comment aggregating the precomputed quantiles from a summary rarely makes sense.. This makes me think that we can skip the re-calcultation of http_server_requests_seconds but we need to collect them as it is. What do you think?

Second question is why textparse.MetricTypeHistogram (code) recognises this as histogram? Should not only recognise the _bucket as ones?

@constanca-m
Copy link
Contributor Author

Second question is why textparse.MetricTypeHistogram recognises this as histogram?

If we look at the document:

# HELP http_server_requests_seconds
# TYPE http_server_requests_seconds histogram
http_server_requests_seconds{...
http_server_requests_seconds_bucket{...

We can see that the metric name is http_server_requests_seconds and the type is histogram. We registered the type for the metric here. So when reading the line http_server_requests_seconds{ we end up falling in the histogram condition as well.

But notice the document and all the metrics under http_server_requests_seconds:

# HELP http_server_requests_seconds
# TYPE http_server_requests_seconds histogram
http_server_requests_seconds{...
http_server_requests_seconds_bucket{...
...
http_server_requests_seconds_count{...
http_server_requests_seconds_sum{...

I did not include this _sum and _count metric in the description because they are not affecting this issue. But they are always in the metrics we fetch. They also fall under our histogram, but we exclude _sum in the condition, which is what we should be doing for everything that is not _bucket @gizas.

@ChrsMark
Copy link
Member

I think the best place to handle it is where the actual parsing of the Histogram metric takes place:

case isCount(name):
u := uint64(s)
histogram.SampleCount = &u
name = name[:len(name)-6]
case isSum(name):
histogram.SampleSum = &s
name = name[:len(name)-4]
case isGaugeHistogram && isGCount(name):
u := uint64(s)
histogram.SampleCount = &u
name = name[:len(name)-7]
case isGaugeHistogram && isGSum(name):
histogram.SampleSum = &s
name = name[:len(name)-5]
default:

But still I would like to have the information of why we hit this by answering the 3 questions from #36317 (comment). Otherwise it would be a blind hot-fix without solid reasoning about this. Definitely we need to fix this, but we need to be specific on why we hit this so as to have the answer in the future as well.

@nkvoll
Copy link
Member

nkvoll commented Aug 30, 2023

Is this issue related to what looks like mapping issues in our monitoring logs (from metricbeat)?

Snippet from the log message:

{"type":"illegal_argument_exception","reason":"mapper [prometheus.rest_client_request_duration_seconds.histogram.values] cannot be changed from type [float] to [long]"}, dropping event!

@constanca-m
Copy link
Contributor Author

I had a look at the logs and it does not seem like the same error @nkvoll . However, the histogram type for prometheus is always wrong, so it does not make sense to have it enabled. If you disable it, can you see the metrics then?

To disable, just set use_types: false (see here).

@nkvoll
Copy link
Member

nkvoll commented Aug 30, 2023

I am a little wary of setting use_types: false here because it's configured on a metricbeat instance that is used by multiple teams and services, and I'm worried that I'll break someones expectations on the indexed data if I do. Is that concern warranted?

@constanca-m
Copy link
Contributor Author

Is that concern warranted?

I don't think you should be concerned about disabling. Since you will never obtain correct results, there is no need for it to be enabled @nkvoll

@constanca-m
Copy link
Contributor Author

Trying to answer these 3 questions @ChrsMark :

Do you know what type of application provides these metrics?

I did some research, and I believe they are coming from the actuator/prometheus endpoint of a Spring application.

However, to produce the http_server_requests_seconds metrics we need to set a property management.metrics.web.server.request.autotime.percentiles that was already removed in the most recent spring versions.

Next we need to check if this type of "histogram" is "accepted" as Prometheus/Openmetrics metric.

It appears correctly, exactly as they come:

image

And:

image

Also we can check if a Prometheus server can scrape these. If Prometheus accepts those we should accept those as well.

I guess we should accept them then! :)

@constanca-m
Copy link
Contributor Author

constanca-m commented Sep 6, 2023

I am thinking on how to solve this. To put some context:

If we have these 4 metrics:

# HELP http_server_requests_seconds Duration of HTTP server request handling
# TYPE http_server_requests_seconds histogram
http_server_requests_seconds{...
http_server_requests_seconds_bucket{...
http_server_requests_seconds_count{...
http_server_requests_seconds_sum{...
  • When we are using use_types: false we only use the *_sum and *_count and discard the others.
  • When we are using use_types: true we discard those two and use the buckets.

For this specific metric, we can:

  1. Have http_server_requests_seconds and http_server_requests_seconds_bucket present simultaneously;
  2. Have only http_server_requests_seconds present
  3. Have only http_server_requests_seconds_bucket present.

Would it work for us if:

  1. In case the previous 1 is true, we only use *_bucket metrics (they have more data) and discard the others. Additionally, *_bucket is a known suffix for histogram, so it makes sense to favor it.
  2. In case the previous 2 is true, we use http_server_requests_seconds metrics to compute the histogram data.
  3. In case the previous 3 is true, we use http_server_requests_seconds_bucket to compute the histogram data.

What do you think @elastic/obs-cloudnative-monitoring ? Will this work ok?

@ChrsMark
Copy link
Member

ChrsMark commented Sep 6, 2023

Hey @constanca-m ,

from my perspective I think we should just add it in addition to what we have today, if needed. However in case of use_types: true I see we might can naming conflicts since we don't suffix with .bucket the histograms' bucket values.

Also do we know what the first metric http_server_requests_seconds represents? We need to first of all understand this before taking any decisions. If this is not important we can just skip this metric it for now.

@constanca-m
Copy link
Contributor Author

from my perspective I think we should just add it in addition to what we have today, if needed.

Since we don't include *_bucket metrics when use_types: false, then there is nothing to be done on that side.

For use_types: true, if we want to have both, we would have to:

  1. Add some name to this new metric, since *_bucket metrics have their name changed and it causes the conflict, like you said.
  2. Or add the suffix bucket to all *_bucket metrics. But this solution is not good, since it could cause breaking changes as it would affect all _bucket metrics.

Also do we know what the first metric http_server_requests_seconds represents?

We get http_server_requests_seconds from management.metrics.web.server.request.autotime.percentiles. From this, "Computed non-aggregable percentiles to publish.".

We get http_server_requests_seconds_bucket from management.metrics.distribution.percentiles-histogram.*, "that support aggregable percentile calculation".

Although I am not sure what this all means... But these metrics have the exact same labels, I checked. @ChrsMark

@ChrsMark
Copy link
Member

ChrsMark commented Sep 6, 2023

Since we don't include *_bucket metrics when use_types: false, then there is nothing to be done on that side.

I think that's not true:

"prometheus": {
"labels": {
"instance": "127.0.0.1:41287",
"job": "prometheus",
"le": "2"
},
"metrics": {
"http_request_duration_seconds_bucket": 2
}
},

From this, "Computed non-aggregable percentiles to publish.".

Then I guess that's not super crucial to include it, so based that it's better to only consume storage for really important data I suggest we just skip it for now.

@gizas
Copy link
Contributor

gizas commented Sep 7, 2023

Sorry late here, but is interesting all the discussion.

Also do we know what the first metric http_server_requests_seconds represents?

http_server_requests_seconds{application="storage-service",exception="None",host="storage-service-6ffd645fbf-5fz69",method="GET",outcome="SUCCESS",site="COLOGNE",status="200",tenant="unknown",uri="/actuator/prometheus",quantile="0.5",} 0.0

So this is the quantile value. See quantile="0.5"

I post here again the https://prometheus.io/docs/practices/histograms/#quantiles. The page says that we should calculate based on buckets and not on quantiles. So I think we should keep only buckets for now. Does it make sense?

Also to summarise the status:

  • use_types: false
    Keeps buckets, *_sum and *_count
    -use_types: true
    We keep *_bucket and rename them to histograms right? Also here we need to discard the http_server_requests_seconds for correct calclations

Am I missing something?

@constanca-m
Copy link
Contributor Author

use_types: false
Keeps buckets, *_sum and *_count

From what @ChrsMark wrote, this is not true.

We keep *_bucket and rename them to histograms right? Also here we need to discard the http_server_requests_seconds for correct calclations

The *_bucket is already correctly considered a histogram. But since these two metrics:

http_server_requests_seconds{...
http_server_requests_seconds_bucket{...

Are the same for our code, we end up using data from http_server_requests_seconds. I think we should discard http_server_requests_seconds all together then. Especially, because http_server_requests_seconds doesn't have any acceptable suffix for histogram. @gizas

@ChrsMark
Copy link
Member

ChrsMark commented Sep 7, 2023

+1 for discarding them!

Just to be clear, when not using types we keep the buckets, the sum and the count. See:

While when using types to store as ES histograms, see:

@tetianakravchenko
Copy link
Contributor

tetianakravchenko commented Sep 7, 2023

this format is wrong by the prometheus definition

Also in spring-boot repo there are few (possibly) similar issue

  1. Percentile metrics not emitted with standard prometheus histogram format spring-projects/spring-boot#24662 - metrics are also failing to be scraped by the Datadog due to the format
  2. Prometheus histogram containing summary fails to be scraped by Opentelemetry collector micrometer-metrics/micrometer#3494 - failing to be scraped by Opentelemetry collector, I believe it could be a case here as well that simultaneously is configured publishPercentiles(x,x,x) and .publishPercentileHistogram() that cause mixing of summary and histogram. If it is a case - I don't think that we should handle misconfiguration of a spring-boot

I think we should discard http_server_requests_seconds all together then

+1 on this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Metricbeat Metricbeat Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants