metric: migrate all histograms to use prometheus-backed version #86671

aadityasondhi · 2022-08-23T15:38:15Z

In a previous change, a new prometheus-backed histogram library was
introduced to help standardize histogram buckets across the codebase.
This change migrates all existing histograms to use the new library.

In this change, NewLatency() is removed in favor of explicitly defining
which buckets to use between NetworkLatencyBuckets and
IOLatencyBuckets when calling NewHistogram(). For all histograms
that were previously created using the NewLatency() func, I tried to
place them in appropriate buckets with the new library. For cases where
it was unclear, I chose IOLatencyBuckets as it allows for a larger
range of values.

related: #85990

Release justification: low risk, high benefit

Release note (ops change): This change introduces a new histogram
implementation that will reduce the total number of buckets and
standardize them across all usage. This should help increase the
usability of histograms when exported to a UI (i.e. Grafana) and reduce
the storage overhead.

After applying this patch it is expected to see fewer buckets in
prometheus/grafana, but still have similar values for histogram
percentiles due to the use of interpolated values by Prometheus.

cockroach-teamcity · 2022-08-23T15:38:24Z

This change is

tbg

None of the comments are blocking but I think they are substantial to avoid future slip-ups, so consider a follow-up PR if necessary.

I didn't review the changes line by line (only the bucket boundaries), but I checked a few and this seems largely mechanical and hard to get wrong.

tbg · 2022-08-29T14:09:19Z

pkg/util/metric/histogram_buckets.go

+	536699575188.601318,  // 8m56.699575188s
+	1012173589826.278687, // 16m52.173589826s
+	1908880541934.094238, // 31m48.880541934s
+	3599999999999.998535, // 59m59.999999999s


Does this make sense? Doesn't this record things like jobs which can run for hours, days, weeks? If this isn't the intended use case, consider a rename to make the 1h limit more obvious.

tbg · 2022-08-29T14:10:09Z

pkg/util/metric/histogram_buckets.go

+// CountBuckets are prometheus histogram buckets suitable for a histogram that
+// records a quantity that is a count (unit-less) in which most measurements are
+// in the 1 to ~1000 range during normal operation.
+var CountBuckets = []float64{


Here too it might make sense to encode the 1k upper limit in the name, Count1KBuckets, to avoid accidentally using this in an unsuitable context.

tbg · 2022-08-29T14:10:37Z

pkg/util/metric/histogram_buckets.go

+
+// PercentBuckets are prometheus histogram buckets suitable for a histogram that
+// records a percent quantity [0,100]
+var PercentBuckets = []float64{


Percent100Buckets could avoid the check that otherwise all readers will do about whether this is [0.0,1.0] or [0,100].

tbg · 2022-08-29T14:10:49Z

pkg/util/metric/histogram_buckets.go

+// DataSizeBuckets are prometheus histogram buckets suitable for a histogram that
+// records a quantity that is a size (byte-denominated) in which most measurements are
+// in the kB to MB range during normal operation.
+var DataSizeBuckets = []float64{


ditto about encoding 16mb in name

tbg · 2022-08-29T14:11:40Z

pkg/util/metric/histogram_buckets.go

+
+// MemoryUsageBuckets are prometheus histogram buckets suitable for a histogram that
+// records memory usage (in Bytes)
+var MemoryUsageBuckets = []float64{


ditto an upper limit of 19kb seems very arbitrary and not sufficient for many future memory monitoring asks. For example, in the raft transport we're dealing with entries that are SSTs, typically of size 16mb, so there we'd want at least the 16mb range to be resolved properly.

aadityasondhi

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @stevendanna and @tbg)

pkg/util/metric/histogram_buckets.go line 100 at r1 (raw file):