Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[prometheus][remote_write] Failing to parse some histogram fields #7893

Closed
tetianakravchenko opened this issue Sep 20, 2023 · 6 comments
Closed
Assignees

Comments

@tetianakravchenko
Copy link
Contributor

tetianakravchenko commented Sep 20, 2023

Some documents are dropped due to:

"prometheus\":{\"apiserver_flowcontrol_priority_level_request_utilization\":{\"histogram\":{\"counts\":[5000945144],\"values\":[0.25]}},\"labels\":{\"instance\":\"10.128.0.10:443\",\"job\":\"kubernetes-apiservers\",\"phase\":\"waiting\",\"priority_level\":\"node-high\"}},\"service\":{\"type\":\"prometheus\"}}, Private:interface {}(nil), TimeSeries:true}, Flags:0x0, Cache:publisher.EventCache{m:mapstr.M(nil)}} (status=400): {\"type\":\"document_parsing_exception\",
    
    \"reason\":\"[1:2472] failed to parse field [prometheus.apiserver_flowcontrol_priority_level_request_utilization.histogram] of type [histogram]\",\"caused_by\":{\"type\":\"illegal_argument_exception\",
    
    "reason\":\"[1:2482] Numeric value (5000945144) out of range of int (-2147483648 - 2147483647)\\n at 
"reason":"[1:3039] failed to parse field [prometheus.go_gc_pauses_seconds_total.histogram] of type [histogram]","caused_by":{"type":"document_parsing_exception","reason":"[1:3039] error parsing field [prometheus.go_gc_pauses_seconds_total.histogram], [values] values must be in increasing order, got [-4.9E-324] but previous value was [0.0]"}}, dropping event!

This could be related to the fact that the datastream was actually dropped first to empty the index

@tetianakravchenko tetianakravchenko changed the title [prometheus][remote_write] [prometheus][remote_write] Failing to parse some histogram fields Sep 20, 2023
@tetianakravchenko tetianakravchenko self-assigned this Sep 20, 2023
@tetianakravchenko
Copy link
Contributor Author

The second error ([values] values must be in increasing order, got [-4.9E-324] but previous value was [0.0]") is related to this issue - elastic/beats#36317, and is going to be fixed soon.

@pjbertels
Copy link

pjbertels commented Sep 21, 2023

'kubernetes-apiservers' and job_name: 'kubernetes-cadvisor' are the two scraping targets that generate the histograms in my setup.

@tetianakravchenko
Copy link
Contributor Author

I was able to reproduce the issue on my setup as well for multiple apiserver_flowcontrol_* histograms, it is actually just 3 metrics: apiserver_flowcontrol_priority_level_request_utilization,
apiserver_flowcontrol_demand_seats,
apiserver_flowcontrol_read_vs_write_current_requests

After some time, I see the histogram metric - prometheus.apiserver_flowcontrol_priority_level_request_utilization.histogram, but it is empty - {"values":[],"counts":[]}, not sure if it is a correct value:
Screenshot 2023-09-22 at 14 24 02

@tetianakravchenko
Copy link
Contributor Author

tetianakravchenko commented Sep 22, 2023

opened elasticsearch issue - elastic/elasticsearch#99820
one thing I can think of for now - add check on the beats side, so not whole document with all other metrics will be dropped

@tetianakravchenko
Copy link
Contributor Author

tetianakravchenko commented Sep 26, 2023

regarding the error: reason":"[1:2805] failed to parse field [prometheus.go_gc_pauses_seconds_total.histogram] of type [histogram]","caused_by":{"type":"document_parsing_exception","reason":"[1:2805] error parsing field [prometheus.go_gc_pauses_seconds_total.histogram], [values] values must be in increasing order, got [-4.9E-324] but previous value was [0.0]"}}, dropping event!

all similar error seems to be coming from the kubernetes-nodes job.

The actual metric looks like:

curl -s localhost:10249/metrics | grep go_gc_pauses_seconds_total
# HELP go_gc_pauses_seconds_total Distribution individual GC-related stop-the-world pause latencies.
# TYPE go_gc_pauses_seconds_total histogram
go_gc_pauses_seconds_total_bucket{le="-5e-324"} 0
go_gc_pauses_seconds_total_bucket{le="9.999999999999999e-10"} 0
go_gc_pauses_seconds_total_bucket{le="9.999999999999999e-09"} 0
go_gc_pauses_seconds_total_bucket{le="9.999999999999998e-08"} 0
go_gc_pauses_seconds_total_bucket{le="1.0239999999999999e-06"} 0
go_gc_pauses_seconds_total_bucket{le="1.0239999999999999e-05"} 24575
go_gc_pauses_seconds_total_bucket{le="0.00010239999999999998"} 25754
go_gc_pauses_seconds_total_bucket{le="0.0010485759999999998"} 51322
go_gc_pauses_seconds_total_bucket{le="0.010485759999999998"} 51579
go_gc_pauses_seconds_total_bucket{le="0.10485759999999998"} 51628
go_gc_pauses_seconds_total_bucket{le="+Inf"} 51628
go_gc_pauses_seconds_total_sum NaN
go_gc_pauses_seconds_total_count 51628

the first bucket actually is a negative value - le="-5e-324"

the same behavior for some other metrics - go_sched_latencies_seconds

this will be fixed by elastic/beats#36647

@tetianakravchenko
Copy link
Contributor Author

first error - Numeric value (5000945144) out of range of int (-2147483648 - 2147483647) should be fixed in elastic/elasticsearch#99820

second error - [values] values must be in increasing order, got [-4.9E-324] but previous value was [0.0]" should be closed in elastic/beats#36647

both PRs were merged and will be available in 8.11.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants