metric: add prometheus-based histogram #81181

tbg · 2022-05-11T01:12:47Z

Our current histogram is based on hdrhistogram. This tends to create
lots of buckets and is inflexible w.r.t the bucket layout. In hindsight,
it was a poor choice of default implementation (I can say that since I
introduced it) and its cost is disincentivizing the introduction of
histograms that would be useful.

This commit introduces a histogram that is based on a completely vanilla
prometheus.Histogram. The only reason we need to wrap it is because we
want to export quantiles to CockraochDB's internal timeseries (it does
not support histograms) and this requires maintaining an internal windowed
histogram (on top of the cumulative histogram).

With this done, we can now introduce metrics with any kind of buckets we
want. Helpfully, we introduce two common kinds of buckets, suitable for
IO-type and RPC-type latencies. These are defined in a human-readable
format by explicitly listing out the buckets.

We can move existing metrics to HistogramV2 easily, assuming we are not
concerned with existing prometheus scrapers getting tripped up by the
changes in bucket boundaries. I assume this is not a big deal in
practice as long as it doesn't happen "all the time". In fact, I would
strongly suggest we move all metrics wholesale and remove the
hdrhistogram-based implementation. If this is not acceptable for some
reason, we ought to at least deprecated it.

We also slightly improve the existing Histogram code by unifying how
the windowed histograms are ticked and by making explicit where their
quantiles are recorded (this dependency was previously hidden via a
local interface assertion).

Resolves #10015.
Resolves #64962.
Alternative to dhartunian@eac3d06

TODO

export quantiles (how to do this? Not clear, don't see the code
laying around in prometheus, might have to hand-roll it but should
be easy enough)

Release note: None

cockroach-teamcity · 2022-05-11T01:12:52Z

This change is

Our current histogram is based on `hdrhistogram`. This tends to create lots of buckets and is inflexible w.r.t the bucket layout. In hindsight, it was a poor choice of default implementation (I can say that since I introduced it) and its cost is disincentivizing the introduction of histograms that would be useful. This commit introduces a histogram that is based on a completely vanilla `prometheus.Histogram`. The only reason we need to wrap it is because we want to export quantiles to CockraochDB's internal timeseries (it does not support histograms) and this requires maintaining an internal windowed histogram (on top of the cumulative histogram). With this done, we can now introduce metrics with any kind of buckets we want. Helpfully, we introduce two common kinds of buckets, suitable for IO-type and RPC-type latencies. These are defined in a human-readable format by explicitly listing out the buckets. We can move existing metrics to HistogramV2 easily, assuming we are not concerned with existing prometheus scrapers getting tripped up by the changes in bucket boundaries. I assume this is not a big deal in practice as long as it doesn't happen "all the time". In fact, I would strongly suggest we move all metrics wholesale and remove the hdrhistogram-based implementation. If this is not acceptable for some reason, we ought to at least deprecated it. We also slightly improve the existing `Histogram` code by unifying how the windowed histograms are ticked and by making explicit where their quantiles are recorded (this dependency was previously hidden via a local interface assertion). Resolves cockroachdb#10015. Resolves cockroachdb#64962. Alternative to dhartunian@eac3d06 TODO - export quantiles (how to do this? Not clear, don't see the code laying around in prometheus, might have to hand-roll it but should be easy enough) Release note: None

a-robinson · 2022-05-19T15:17:28Z

Nice timing, I was just wondering why it was necessary to be exporting hundreds of buckets from so many latency histograms :)

andreimatei · 2022-05-20T01:20:48Z

It's interesting that Prometheus is moving in the direction of logarithmic histograms, with no explicit control over the buckets.

export quantiles (how to do this? Not clear, don't see the code
laying around in prometheus, might have to hand-roll it but should
be easy enough)

If I understand correctly what you mean, I don't think it's surprising that the Prometheus client libraries don't support this. I mean, there is the prometheus Summary metric type, which maintains windowed quantiles, but it kinda deprecated. Other than that, Prometheus doesn't like windowed measurements. It insists on being able to compute mathematically correct values for any time window.

tbg · 2022-05-20T17:59:51Z

If I understand correctly what you mean, I don't think it's surprising that the Prometheus client libraries don't support this. I mean, there is the prometheus Summary metric type, which maintains windowed quantiles, but it kinda deprecated. Other than that, Prometheus doesn't like windowed measurements. It insists on being able to compute mathematically correct values for any time window.

Right, I'm not surprised prom doesn't have this bundled up in a library anywhere, but we need it to support our homegrown internal timeseries DB which doesn't support histograms and so we record quantiles instead and have to be able to turn a histogram snapshot into quantiles on demand.

85990: metric: add prometheus-based histogram r=aadityasondhi a=aadityasondhi From #81181: Our current histogram is based on `hdrhistogram`. This tends to create lots of buckets and is inflexible w.r.t the bucket layout. In hindsight, it was a poor choice of default implementation (I can say that since I introduced it) and its cost is disincentivizing the introduction of histograms that would be useful. This commit introduces a histogram that is based on a completely vanilla `prometheus.Histogram`. The only reason we need to wrap it is because we want to export quantiles to CockraochDB's internal timeseries (it does not support histograms) and this requires maintaining an internal windowed histogram (on top of the cumulative histogram). With this done, we can now introduce metrics with any kind of buckets we want. Helpfully, we introduce two common kinds of buckets, suitable for IO-type and RPC-type latencies. These are defined in a human-readable format by explicitly listing out the buckets. We can move existing metrics to HistogramV2 easily, assuming we are not concerned with existing prometheus scrapers getting tripped up by the changes in bucket boundaries. I assume this is not a big deal in practice as long as it doesn't happen "all the time". In fact, I would strongly suggest we move all metrics wholesale and remove the hdrhistogram-based implementation. If this is not acceptable for some reason, we ought to at least deprecated it. We also slightly improve the existing `Histogram` code by unifying how the windowed histograms are ticked and by making explicit where their quantiles are recorded (this dependency was previously hidden via a local interface assertion). Resolves #10015. Resolves #64962. Alternative to dhartunian@eac3d06 Release justification: low risk, high benefit changes 86007: kvserver: log traces from replicate queue on errors or slow processing r=andreimatei a=AlexTalks While we previously had some logging from the replicate queue as a result of the standard queue logging, this change adds logging to the replicate queue when there are errors in processing a replica, or when processing a replica exceeds a 50% of the timeout duration. When there are errors or the duration threshold is exceeded, any error messages are logged along with the collected tracing spans from the operation. Release note (ops change): Added logging on replicate queue processing in the presence of errors or when the duration exceeds 50% of the timeout. Release justification: Low risk observability change. 86255: upgrades,upgrade,upgrades_test: Added an upgrade for updating invalid column ID in seq's back references r=Xiang-Gu a=Xiang-Gu commit 1: a small refactoring of existing function commit 2: added a field to TenantDeps commit 3: added a cluster version upgrade that attempts to update invalid column ID in seq's back references due to bugs in prior versions. See below for details commit 4: wrote a test for this newly added upgrade Previously, in version 21.1 and prior, `ADD COLUMN DEFAULT nextval('s')` will incorrectly store a 0-valued column id in the sequence 's' back reference `dependedOnBy` because we added this dependency before allocating an ID to the newly added column. Customers ran into issues when upgrading to v22.1 so we relaxed the validation logic as a quick short-term fix, as detailed in #82859. Now we want to give this issue a more proper treatment by adding a cluster upgrade (to v22.2) that attempts to detect such invalid column ID issues and update them with the correct column ID. This PR does exactly this. Fixes: #83124 Release note: None Release justification: fixed a release blocker that will resolve invalid column ID appearance in sequence's back referenced, caused by bugs in older binaries. Co-authored-by: Tobias Grieger <[email protected]> Co-authored-by: Aaditya Sondhi <[email protected]> Co-authored-by: Alex Sarkesian <[email protected]> Co-authored-by: Xiang Gu <[email protected]>

tbg force-pushed the histogram-updates branch from fbc778a to 495744d Compare May 11, 2022 01:23

tbg requested a review from dhartunian May 11, 2022 01:24

tbg mentioned this pull request May 23, 2022

util,kvserver: add sliding window aggregator for l0 max #81256

Merged

tbg mentioned this pull request Jun 2, 2022

Histogram prometheus metrics unusable in production due to cardinality explosion #64962

Closed

aadityasondhi mentioned this pull request Aug 11, 2022

metric: add prometheus-based histogram #85990

Merged

tbg closed this Sep 19, 2022

srosenberg mentioned this pull request Jan 19, 2023

roachtest: expose prometheus endpoint #94711

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

metric: add prometheus-based histogram #81181

metric: add prometheus-based histogram #81181

tbg commented May 11, 2022 •

edited

Loading

cockroach-teamcity commented May 11, 2022

a-robinson commented May 19, 2022

andreimatei commented May 20, 2022

tbg commented May 20, 2022

metric: add prometheus-based histogram #81181

metric: add prometheus-based histogram #81181

Conversation

tbg commented May 11, 2022 • edited Loading

cockroach-teamcity commented May 11, 2022

a-robinson commented May 19, 2022

andreimatei commented May 20, 2022

tbg commented May 20, 2022

tbg commented May 11, 2022 •

edited

Loading