-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
metric: migrate all histograms to use prometheus-backed version #86671
metric: migrate all histograms to use prometheus-backed version #86671
Conversation
9b4984b
to
ce8c203
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None of the comments are blocking but I think they are substantial to avoid future slip-ups, so consider a follow-up PR if necessary.
I didn't review the changes line by line (only the bucket boundaries), but I checked a few and this seems largely mechanical and hard to get wrong.
536699575188.601318, // 8m56.699575188s | ||
1012173589826.278687, // 16m52.173589826s | ||
1908880541934.094238, // 31m48.880541934s | ||
3599999999999.998535, // 59m59.999999999s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this make sense? Doesn't this record things like jobs which can run for hours, days, weeks? If this isn't the intended use case, consider a rename to make the 1h limit more obvious.
pkg/util/metric/histogram_buckets.go
Outdated
// CountBuckets are prometheus histogram buckets suitable for a histogram that | ||
// records a quantity that is a count (unit-less) in which most measurements are | ||
// in the 1 to ~1000 range during normal operation. | ||
var CountBuckets = []float64{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here too it might make sense to encode the 1k upper limit in the name, Count1KBuckets
, to avoid accidentally using this in an unsuitable context.
pkg/util/metric/histogram_buckets.go
Outdated
|
||
// PercentBuckets are prometheus histogram buckets suitable for a histogram that | ||
// records a percent quantity [0,100] | ||
var PercentBuckets = []float64{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Percent100Buckets could avoid the check that otherwise all readers will do about whether this is [0.0,1.0]
or [0,100]
.
pkg/util/metric/histogram_buckets.go
Outdated
// DataSizeBuckets are prometheus histogram buckets suitable for a histogram that | ||
// records a quantity that is a size (byte-denominated) in which most measurements are | ||
// in the kB to MB range during normal operation. | ||
var DataSizeBuckets = []float64{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto about encoding 16mb in name
pkg/util/metric/histogram_buckets.go
Outdated
|
||
// MemoryUsageBuckets are prometheus histogram buckets suitable for a histogram that | ||
// records memory usage (in Bytes) | ||
var MemoryUsageBuckets = []float64{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto an upper limit of 19kb seems very arbitrary and not sufficient for many future memory monitoring asks. For example, in the raft transport we're dealing with entries that are SSTs, typically of size 16mb, so there we'd want at least the 16mb range to be resolved properly.
ce8c203
to
4d00ddc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @stevendanna and @tbg)
pkg/util/metric/histogram_buckets.go
line 100 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Does this make sense? Doesn't this record things like jobs which can run for hours, days, weeks? If this isn't the intended use case, consider a rename to make the 1h limit more obvious.
Noted, I will revisit the naming for this one.
pkg/util/metric/histogram_buckets.go
line 106 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Here too it might make sense to encode the 1k upper limit in the name,
Count1KBuckets
, to avoid accidentally using this in an unsuitable context.
ack
pkg/util/metric/histogram_buckets.go
line 123 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Percent100Buckets could avoid the check that otherwise all readers will do about whether this is
[0.0,1.0]
or[0,100]
.
ack
pkg/util/metric/histogram_buckets.go
line 140 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
ditto about encoding 16mb in name
ack
pkg/util/metric/histogram_buckets.go
line 161 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
ditto an upper limit of 19kb seems very arbitrary and not sufficient for many future memory monitoring asks. For example, in the raft transport we're dealing with entries that are SSTs, typically of size 16mb, so there we'd want at least the 16mb range to be resolved properly.
This came from what seemed to be a very common upper limit for many buckets across the codebase: https://github.com/cockroachdb/cockroach/blob/master/pkg/sql/mem_metrics.go#L76
Thanks for providing the context on what our memory usage looks like. I will update these buckets!
3d4b86d
to
6b2c224
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking the time to go through all the histograms! Just a few questions from me.
Reviewed 15 of 32 files at r1, 19 of 19 files at r3.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @dhartunian, @stevendanna, and @tbg)
-- commits
line 6 at r3:
Can you add some text about the methodology for conversion? Which calls to NewLatency
get IO buckets vs Network etc.
pkg/ccl/sqlproxyccl/metrics.go
line 230 at r3 (raw file):
metaConnMigrationAttemptedCount, base.DefaultHistogramWindowInterval(), metric.IOLatencyBuckets,
why is connection latency using IO buckets and not Network?
pkg/util/metric/histogram_buckets.go
line 161 at r1 (raw file):
Previously, aadityasondhi (Aaditya Sondhi) wrote…
This came from what seemed to be a very common upper limit for many buckets across the codebase: https://github.com/cockroachdb/cockroach/blob/master/pkg/sql/mem_metrics.go#L76
Thanks for providing the context on what our memory usage looks like. I will update these buckets!
I think we should have a bucket beyond 16 if we want to provide resolution at 16, right? Otherwise, it's unclear if it's at 16 or 512, say, in the max bucket.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian, @stevendanna, and @tbg)
Previously, dhartunian (David Hartunian) wrote…
Can you add some text about the methodology for conversion? Which calls to
NewLatency
get IO buckets vs Network etc.
I followed a very simplistic approach when doing this conversion. Since NewLatency
used a MaxLatency
of 10s for all histograms created using it and IOLatencyBuckets
has the same max, I used IO for all. I will revisit this and change if Network is more appropriate for some of the Histograms.
pkg/ccl/sqlproxyccl/metrics.go
line 230 at r3 (raw file):
Previously, dhartunian (David Hartunian) wrote…
why is connection latency using IO buckets and not Network?
see comment above
pkg/util/metric/histogram_buckets.go
line 161 at r1 (raw file):
Previously, dhartunian (David Hartunian) wrote…
I think we should have a bucket beyond 16 if we want to provide resolution at 16, right? Otherwise, it's unclear if it's at 16 or 512, say, in the max bucket.
We will still be able to resolve at 16 since Prometheus adds an automatic Inf bucket for catch-all greater than the largest bucket. But I will still go ahead and make this 32 MB.
f5739ec
to
ddbd1cd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not a dealbreaker, but would be nice if _status/vars
output contained the buckets in numerical order instead of lexicographic.
Reviewed 4 of 5 files at r4, 11 of 11 files at r5, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @stevendanna and @tbg)
ddbd1cd
to
26e7f21
Compare
In a previous change, a new prometheus-backed histogram library was introdced to help standardize histogram buckets across the codebase. This change migrates all existing histograms to use the new library. In this change, `NewLatency()` is removed in favor of explicitly defining which buckets to use between `NetworkLatencyBuckets` and `IOLatencyBuckets` when calling `NewHistogram()`. For all histograms that were previously created using the `NewLatency()` func, I tried to place them in appropriate buckets with the new library. For cases where it was unclear, I chose `IOLatencyBuckets` as it allows for a larger range of values. related: cockroachdb#85990 Release justification: low risk, high benefit Release note (ops change): This change introduces a new histogram implementation that will reduce the total number of buckets and standardize them across all usage. This should help increase the usability of histograms when exxported to a UI (i.e. Grafana) and reduce the storage overhead. After applying this patch it is expected to see fewer buckets in prometheus/grafana, but still have similar values for histogram percentiles due to the use of interpolated values by Prometheus.
26e7f21
to
a82aa82
Compare
bors r+ |
Build failed (retrying...): |
Build succeeded: |
Front-ports some testing improvements from cockroachdb#88331. Note that 22.2 and master don't exhibit the bug fixed in that PR since we switched to using prometheus histograms in cockroachdb#86671. Release note: None
88446: metric: front-port a regression test r=aadityasondhi a=tbg Front-ports some testing improvements from #88331. Note that 22.2 and master don't exhibit the bug fixed in that PR since we switched to using prometheus histograms in #86671. Release note: None Co-authored-by: Tobias Grieger <[email protected]>
In a previous change, a new prometheus-backed histogram library was
introduced to help standardize histogram buckets across the codebase.
This change migrates all existing histograms to use the new library.
In this change,
NewLatency()
is removed in favor of explicitly definingwhich buckets to use between
NetworkLatencyBuckets
andIOLatencyBuckets
when callingNewHistogram()
. For all histogramsthat were previously created using the
NewLatency()
func, I tried toplace them in appropriate buckets with the new library. For cases where
it was unclear, I chose
IOLatencyBuckets
as it allows for a largerrange of values.
related: #85990
Release justification: low risk, high benefit
Release note (ops change): This change introduces a new histogram
implementation that will reduce the total number of buckets and
standardize them across all usage. This should help increase the
usability of histograms when exported to a UI (i.e. Grafana) and reduce
the storage overhead.
After applying this patch it is expected to see fewer buckets in
prometheus/grafana, but still have similar values for histogram
percentiles due to the use of interpolated values by Prometheus.