-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
metric: add prometheus-based histogram #85990
metric: add prometheus-based histogram #85990
Conversation
@sravotto would love your feedback on this as you've worked on the metrics translator code before. Our hope is that this work can help folks use our histograms effectively out of the box. @aadityasondhi can you paste some examples of histogram output in |
@dhartunian they'll have bucket boundaries like these: Of course this occurs only for histograms that actually use these bucket bounds. It's not enough to just land this PR - we also need to convert all histograms to HistogramV2 with sane buckets, and then need to also phase out Histogram (at which point we might as well rename HistogramV2 back to Histogram), losing the hdrhistogram dependency. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 6 of 10 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi and @sravotto)
pkg/util/metric/metric.go
line 337 at r1 (raw file):
// resemble those of typical disk latencies, i.e. which are in the micro- and // millisecond range during normal operation. var IOLatencyBuckets = []float64{
In the metrics-exporter sidecar we used log10 linear buckets (for instance, for the 1-1000 interval: [1 2 3 4 5 6 7 8 9 10 20 30 40 50 60 70 80 90 100 200 300 400 500 600 700 800 900 1000...]). We did have many more buckets (around 40-50) per histogram.
I don't have any specific issues with using exponential buckets, assuming we can have different configurations in addition to IOLatencyBuckets and NetworkLatencyBuckets, depending on what we are measuring and how accurate we want estimate quantiles.
943707a
to
8064985
Compare
@dhartunian This PR changes one of the histograms ( Prev:
Cur:
|
8064985
to
ed97b36
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @abarganier, and @sravotto)
pkg/util/metric/metric.go
line 360 at r3 (raw file):
// behave like network latencies, i.e. most measurements are in the ms to sub-second // range during normal operation. var NetworkLatencyBuckets = []float64{
nit: We should provide some comment or docs around here to roughly outline the process for defining new bucket presets, where we advise that the obs-infra team is involved in the review process to make sure the new preset it sound.
df8c9a8
to
c3fe957
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @abarganier and @sravotto)
pkg/util/metric/metric.go
line 337 at r1 (raw file):
Previously, sravotto (silvano) wrote…
In the metrics-exporter sidecar we used log10 linear buckets (for instance, for the 1-1000 interval: [1 2 3 4 5 6 7 8 9 10 20 30 40 50 60 70 80 90 100 200 300 400 500 600 700 800 900 1000...]). We did have many more buckets (around 40-50) per histogram.
I don't have any specific issues with using exponential buckets, assuming we can have different configurations in addition to IOLatencyBuckets and NetworkLatencyBuckets, depending on what we are measuring and how accurate we want estimate quantiles.
Yes, new buckets can be generated using TestHistogramBuckets
.
pkg/util/metric/metric.go
line 360 at r3 (raw file):
Previously, abarganier (Alex Barganier) wrote…
nit: We should provide some comment or docs around here to roughly outline the process for defining new bucket presets, where we advise that the obs-infra team is involved in the review process to make sure the new preset it sound.
I added a line in HistogramV2
and added some comments to TestHistogramBuckets
indicating that we should generate new buckets using it and also involve the obs-inf
team for reviews.
ef9eb07
to
5658e48
Compare
In a previous change, a new prometheus-backed histogram library was introdced to help standardize histogram buckets across the codebase. This change migrates all existing histograms to use the new library. related: cockroachdb#85990 Release justification: low risk, high benefit Release note (ops change): This change introduces a new histogram implementation that will reduce the total number of buckets and standardize them across all usage. This should help increase the usability of histograms when exxported to a UI (i.e. Grafana).
5658e48
to
ad8b581
Compare
This PR should be ready for review and merging. I created a separate PR (#86671) for migrating all of the existing histograms to V2. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
modulo one tiny nit to remove a stale TODO
- nice work on this!
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @aadityasondhi, @abarganier, and @sravotto)
pkg/util/metric/metric.go
line 477 at r9 (raw file):
m := &prometheusgo.Metric{} if err := h.cum.Write(m); err != nil { panic(err) // TODD
nit: remove TODO comment here and in ToPrometheusMetricWindowed
.
I think it's acceptable to panic here - we're passing an externally sourced data structure (the prometheus.Histogram
) to an externally sourced function. If that's not doing what we expect, something is certainly very wrong. Combine that with the fact that this has been run extensively on roachprod already - I feel safe about this.
Code quote:
// TODD
ad8b581
to
41c82f1
Compare
41c82f1
to
28da9c8
Compare
This change builds on the previous one and adds a function to export quantiles from the Prometheus-based histogram. This functionality is used to store histogram data in the internal timeseries database. The hdr library came with a function to do this, while Prometheus does not have a public API for exporting quantiles. The function implemented here is very similar to the one found internally in Prometheus, using linear interpolation to calculate values at a given quantile. This commit also includes some additional testing and general refactoring of the metrics code. Release note: None Release justification: low risk, high benefit changes
a0a5009
to
d7d5838
Compare
bors r+ |
Build failed (retrying...): |
Build succeeded: |
In a previous change, a new prometheus-backed histogram library was introdced to help standardize histogram buckets across the codebase. This change migrates all existing histograms to use the new library. related: cockroachdb#85990 Release justification: low risk, high benefit Release note (ops change): This change introduces a new histogram implementation that will reduce the total number of buckets and standardize them across all usage. This should help increase the usability of histograms when exxported to a UI (i.e. Grafana).
In a previous change, a new prometheus-backed histogram library was introdced to help standardize histogram buckets across the codebase. This change migrates all existing histograms to use the new library. related: cockroachdb#85990 Release justification: low risk, high benefit Release note (ops change): This change introduces a new histogram implementation that will reduce the total number of buckets and standardize them across all usage. This should help increase the usability of histograms when exxported to a UI (i.e. Grafana).
In a previous change, a new prometheus-backed histogram library was introdced to help standardize histogram buckets across the codebase. This change migrates all existing histograms to use the new library. related: cockroachdb#85990 Release justification: low risk, high benefit Release note (ops change): This change introduces a new histogram implementation that will reduce the total number of buckets and standardize them across all usage. This should help increase the usability of histograms when exxported to a UI (i.e. Grafana).
In a previous change, a new prometheus-backed histogram library was introdced to help standardize histogram buckets across the codebase. This change migrates all existing histograms to use the new library. related: cockroachdb#85990 Release justification: low risk, high benefit Release note (ops change): This change introduces a new histogram implementation that will reduce the total number of buckets and standardize them across all usage. This should help increase the usability of histograms when exxported to a UI (i.e. Grafana) and reduce the storage overhead. After applying this patch it is expected to see fewer buckets in prometheus/grafana, but still have similar values for histogram percentiles due to the use of interpolated values by Prometheus.
In a previous change, a new prometheus-backed histogram library was introdced to help standardize histogram buckets across the codebase. This change migrates all existing histograms to use the new library. related: cockroachdb#85990 Release justification: low risk, high benefit Release note (ops change): This change introduces a new histogram implementation that will reduce the total number of buckets and standardize them across all usage. This should help increase the usability of histograms when exxported to a UI (i.e. Grafana) and reduce the storage overhead. After applying this patch it is expected to see fewer buckets in prometheus/grafana, but still have similar values for histogram percentiles due to the use of interpolated values by Prometheus.
In a previous change, a new prometheus-backed histogram library was introdced to help standardize histogram buckets across the codebase. This change migrates all existing histograms to use the new library. related: cockroachdb#85990 Release justification: low risk, high benefit Release note (ops change): This change introduces a new histogram implementation that will reduce the total number of buckets and standardize them across all usage. This should help increase the usability of histograms when exxported to a UI (i.e. Grafana) and reduce the storage overhead. After applying this patch it is expected to see fewer buckets in prometheus/grafana, but still have similar values for histogram percentiles due to the use of interpolated values by Prometheus.
In a previous change, a new prometheus-backed histogram library was introdced to help standardize histogram buckets across the codebase. This change migrates all existing histograms to use the new library. In this change, `NewLatency()` is removed in favor of explicitly defining which buckets to use between `NetworkLatencyBuckets` and `IOLatencyBuckets` when calling `NewHistogram()`. For all histograms that were previously created using the `NewLatency()` func, I tried to place them in appropriate buckets with the new library. For cases where it was unclear, I chose `IOLatencyBuckets` as it allows for a larger range of values. related: cockroachdb#85990 Release justification: low risk, high benefit Release note (ops change): This change introduces a new histogram implementation that will reduce the total number of buckets and standardize them across all usage. This should help increase the usability of histograms when exxported to a UI (i.e. Grafana) and reduce the storage overhead. After applying this patch it is expected to see fewer buckets in prometheus/grafana, but still have similar values for histogram percentiles due to the use of interpolated values by Prometheus.
In a previous change, a new prometheus-backed histogram library was introdced to help standardize histogram buckets across the codebase. This change migrates all existing histograms to use the new library. In this change, `NewLatency()` is removed in favor of explicitly defining which buckets to use between `NetworkLatencyBuckets` and `IOLatencyBuckets` when calling `NewHistogram()`. For all histograms that were previously created using the `NewLatency()` func, I tried to place them in appropriate buckets with the new library. For cases where it was unclear, I chose `IOLatencyBuckets` as it allows for a larger range of values. related: cockroachdb#85990 Release justification: low risk, high benefit Release note (ops change): This change introduces a new histogram implementation that will reduce the total number of buckets and standardize them across all usage. This should help increase the usability of histograms when exxported to a UI (i.e. Grafana) and reduce the storage overhead. After applying this patch it is expected to see fewer buckets in prometheus/grafana, but still have similar values for histogram percentiles due to the use of interpolated values by Prometheus.
86671: metric: migrate all histograms to use prometheus-backed version r=aadityasondhi a=aadityasondhi In a previous change, a new prometheus-backed histogram library was introduced to help standardize histogram buckets across the codebase. This change migrates all existing histograms to use the new library. In this change, `NewLatency()` is removed in favor of explicitly defining which buckets to use between `NetworkLatencyBuckets` and `IOLatencyBuckets` when calling `NewHistogram()`. For all histograms that were previously created using the `NewLatency()` func, I tried to place them in appropriate buckets with the new library. For cases where it was unclear, I chose `IOLatencyBuckets` as it allows for a larger range of values. related: #85990 Release justification: low risk, high benefit Release note (ops change): This change introduces a new histogram implementation that will reduce the total number of buckets and standardize them across all usage. This should help increase the usability of histograms when exported to a UI (i.e. Grafana) and reduce the storage overhead. After applying this patch it is expected to see fewer buckets in prometheus/grafana, but still have similar values for histogram percentiles due to the use of interpolated values by Prometheus. 87084: roachtest: introduce multi tenant TPCH test r=yuzefovich a=yuzefovich **roachtest: sort tests lexicographically** This commit unexports a couple of registration methods of the roachtests and then orders all such method lexicographically in the registry. Release justification: test-only change. Release note: None **roachtest: introduce multi tenant TPCH test** This commit introduces a new roachtest that runs TPCH benchmark on a single node in a single-tenant deployment first followed by another run in a multi-tenant deployment with a single SQL instance. It refactors the tpchvec roachtest a bit to extract a couple of helper methods for performing the latency analysis. In particular, it removes `queriesToRun` parameter since all variants now run all 22 queries. I ran into some difficulties around managing the certificates in different deployment models, so the test is structured that first the single-node cluster is used in a single-tenant deployment model, and then - after running all 22 queries - a tenant with a single SQL instance is created and all queries are run within the tenant. Here is the comparison of averages over 10 runs of single-tenant vs multi-tenant: ``` Q1: 6.34s 13.35s 110.39% Q2: 0.22s 0.49s 122.07% Q3: 4.88s 8.04s 64.67% Q4: 1.51s 4.75s 213.67% Q5: 2.33s 11.69s 400.90% Q6: 5.51s 35.49s 543.89% Q7: 5.75s 24.08s 318.42% Q8: 0.85s 3.82s 348.47% Q9: 7.34s 28.37s 286.25% Q10: 1.99s 5.00s 150.93% Q11: 0.55s 1.92s 249.00% Q12: 6.02s 34.76s 477.05% Q13: 1.88s 3.76s 100.00% Q14: 0.64s 1.10s 73.11% Q15: 3.33s 16.80s 404.23% Q16: 0.88s 1.29s 47.66% Q17: 0.24s 0.60s 145.49% Q18: 7.75s 30.13s 288.48% Q19: 5.20s 13.08s 151.69% Q20: 12.66s 55.30s 336.92% Q21: 6.98s 24.50s 250.77% Q22: 0.62s 1.17s 90.26% ``` Fixes: #71528. Release justification: test-only change. Release note: None Co-authored-by: Aaditya Sondhi <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]>
Drive-by comment: this is very cool work! |
From #81181:
Our current histogram is based on
hdrhistogram
. This tends to createlots of buckets and is inflexible w.r.t the bucket layout. In hindsight,
it was a poor choice of default implementation (I can say that since I
introduced it) and its cost is disincentivizing the introduction of
histograms that would be useful.
This commit introduces a histogram that is based on a completely vanilla
prometheus.Histogram
. The only reason we need to wrap it is because wewant to export quantiles to CockraochDB's internal timeseries (it does
not support histograms) and this requires maintaining an internal windowed
histogram (on top of the cumulative histogram).
With this done, we can now introduce metrics with any kind of buckets we
want. Helpfully, we introduce two common kinds of buckets, suitable for
IO-type and RPC-type latencies. These are defined in a human-readable
format by explicitly listing out the buckets.
We can move existing metrics to HistogramV2 easily, assuming we are not
concerned with existing prometheus scrapers getting tripped up by the
changes in bucket boundaries. I assume this is not a big deal in
practice as long as it doesn't happen "all the time". In fact, I would
strongly suggest we move all metrics wholesale and remove the
hdrhistogram-based implementation. If this is not acceptable for some
reason, we ought to at least deprecated it.
We also slightly improve the existing
Histogram
code by unifying howthe windowed histograms are ticked and by making explicit where their
quantiles are recorded (this dependency was previously hidden via a
local interface assertion).
Resolves #10015.
Resolves #64962.
Alternative to dhartunian@eac3d06
Release justification: low risk, high benefit changes