Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metric: migrate all histograms to use prometheus-backed version #86671

Merged

Conversation

aadityasondhi
Copy link
Collaborator

@aadityasondhi aadityasondhi commented Aug 23, 2022

In a previous change, a new prometheus-backed histogram library was
introduced to help standardize histogram buckets across the codebase.
This change migrates all existing histograms to use the new library.

In this change, NewLatency() is removed in favor of explicitly defining
which buckets to use between NetworkLatencyBuckets and
IOLatencyBuckets when calling NewHistogram(). For all histograms
that were previously created using the NewLatency() func, I tried to
place them in appropriate buckets with the new library. For cases where
it was unclear, I chose IOLatencyBuckets as it allows for a larger
range of values.

related: #85990

Release justification: low risk, high benefit

Release note (ops change): This change introduces a new histogram
implementation that will reduce the total number of buckets and
standardize them across all usage. This should help increase the
usability of histograms when exported to a UI (i.e. Grafana) and reduce
the storage overhead.

After applying this patch it is expected to see fewer buckets in
prometheus/grafana, but still have similar values for histogram
percentiles due to the use of interpolated values by Prometheus.

@aadityasondhi aadityasondhi added the do-not-merge bors won't merge a PR with this label. label Aug 23, 2022
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@aadityasondhi aadityasondhi force-pushed the aadityas/histogram-upgrade-v2 branch 15 times, most recently from 9b4984b to ce8c203 Compare August 26, 2022 19:38
@aadityasondhi aadityasondhi removed the do-not-merge bors won't merge a PR with this label. label Aug 26, 2022
@aadityasondhi aadityasondhi marked this pull request as ready for review August 26, 2022 19:39
@aadityasondhi aadityasondhi requested a review from a team as a code owner August 26, 2022 19:39
@aadityasondhi aadityasondhi requested a review from a team August 26, 2022 19:39
@aadityasondhi aadityasondhi requested review from a team as code owners August 26, 2022 19:39
@aadityasondhi aadityasondhi requested a review from a team August 26, 2022 19:39
@aadityasondhi aadityasondhi requested a review from a team as a code owner August 26, 2022 19:39
@aadityasondhi aadityasondhi requested review from stevendanna and removed request for a team August 26, 2022 19:39
Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of the comments are blocking but I think they are substantial to avoid future slip-ups, so consider a follow-up PR if necessary.

I didn't review the changes line by line (only the bucket boundaries), but I checked a few and this seems largely mechanical and hard to get wrong.

536699575188.601318, // 8m56.699575188s
1012173589826.278687, // 16m52.173589826s
1908880541934.094238, // 31m48.880541934s
3599999999999.998535, // 59m59.999999999s
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this make sense? Doesn't this record things like jobs which can run for hours, days, weeks? If this isn't the intended use case, consider a rename to make the 1h limit more obvious.

// CountBuckets are prometheus histogram buckets suitable for a histogram that
// records a quantity that is a count (unit-less) in which most measurements are
// in the 1 to ~1000 range during normal operation.
var CountBuckets = []float64{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here too it might make sense to encode the 1k upper limit in the name, Count1KBuckets, to avoid accidentally using this in an unsuitable context.


// PercentBuckets are prometheus histogram buckets suitable for a histogram that
// records a percent quantity [0,100]
var PercentBuckets = []float64{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Percent100Buckets could avoid the check that otherwise all readers will do about whether this is [0.0,1.0] or [0,100].

// DataSizeBuckets are prometheus histogram buckets suitable for a histogram that
// records a quantity that is a size (byte-denominated) in which most measurements are
// in the kB to MB range during normal operation.
var DataSizeBuckets = []float64{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto about encoding 16mb in name


// MemoryUsageBuckets are prometheus histogram buckets suitable for a histogram that
// records memory usage (in Bytes)
var MemoryUsageBuckets = []float64{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto an upper limit of 19kb seems very arbitrary and not sufficient for many future memory monitoring asks. For example, in the raft transport we're dealing with entries that are SSTs, typically of size 16mb, so there we'd want at least the 16mb range to be resolved properly.

Copy link
Collaborator Author

@aadityasondhi aadityasondhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @stevendanna and @tbg)


pkg/util/metric/histogram_buckets.go line 100 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Does this make sense? Doesn't this record things like jobs which can run for hours, days, weeks? If this isn't the intended use case, consider a rename to make the 1h limit more obvious.

Noted, I will revisit the naming for this one.


pkg/util/metric/histogram_buckets.go line 106 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Here too it might make sense to encode the 1k upper limit in the name, Count1KBuckets, to avoid accidentally using this in an unsuitable context.

ack


pkg/util/metric/histogram_buckets.go line 123 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Percent100Buckets could avoid the check that otherwise all readers will do about whether this is [0.0,1.0] or [0,100].

ack


pkg/util/metric/histogram_buckets.go line 140 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

ditto about encoding 16mb in name

ack


pkg/util/metric/histogram_buckets.go line 161 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

ditto an upper limit of 19kb seems very arbitrary and not sufficient for many future memory monitoring asks. For example, in the raft transport we're dealing with entries that are SSTs, typically of size 16mb, so there we'd want at least the 16mb range to be resolved properly.

This came from what seemed to be a very common upper limit for many buckets across the codebase: https://github.com/cockroachdb/cockroach/blob/master/pkg/sql/mem_metrics.go#L76

Thanks for providing the context on what our memory usage looks like. I will update these buckets!

@aadityasondhi aadityasondhi force-pushed the aadityas/histogram-upgrade-v2 branch 4 times, most recently from 3d4b86d to 6b2c224 Compare August 31, 2022 16:10
Copy link
Collaborator

@dhartunian dhartunian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking the time to go through all the histograms! Just a few questions from me.

Reviewed 15 of 32 files at r1, 19 of 19 files at r3.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @dhartunian, @stevendanna, and @tbg)


-- commits line 6 at r3:
Can you add some text about the methodology for conversion? Which calls to NewLatency get IO buckets vs Network etc.


pkg/ccl/sqlproxyccl/metrics.go line 230 at r3 (raw file):

			metaConnMigrationAttemptedCount,
			base.DefaultHistogramWindowInterval(),
			metric.IOLatencyBuckets,

why is connection latency using IO buckets and not Network?


pkg/util/metric/histogram_buckets.go line 161 at r1 (raw file):

Previously, aadityasondhi (Aaditya Sondhi) wrote…

This came from what seemed to be a very common upper limit for many buckets across the codebase: https://github.com/cockroachdb/cockroach/blob/master/pkg/sql/mem_metrics.go#L76

Thanks for providing the context on what our memory usage looks like. I will update these buckets!

I think we should have a bucket beyond 16 if we want to provide resolution at 16, right? Otherwise, it's unclear if it's at 16 or 512, say, in the max bucket.

Copy link
Collaborator Author

@aadityasondhi aadityasondhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian, @stevendanna, and @tbg)


-- commits line 6 at r3:

Previously, dhartunian (David Hartunian) wrote…

Can you add some text about the methodology for conversion? Which calls to NewLatency get IO buckets vs Network etc.

I followed a very simplistic approach when doing this conversion. Since NewLatency used a MaxLatency of 10s for all histograms created using it and IOLatencyBuckets has the same max, I used IO for all. I will revisit this and change if Network is more appropriate for some of the Histograms.


pkg/ccl/sqlproxyccl/metrics.go line 230 at r3 (raw file):

Previously, dhartunian (David Hartunian) wrote…

why is connection latency using IO buckets and not Network?

see comment above


pkg/util/metric/histogram_buckets.go line 161 at r1 (raw file):

Previously, dhartunian (David Hartunian) wrote…

I think we should have a bucket beyond 16 if we want to provide resolution at 16, right? Otherwise, it's unclear if it's at 16 or 512, say, in the max bucket.

We will still be able to resolve at 16 since Prometheus adds an automatic Inf bucket for catch-all greater than the largest bucket. But I will still go ahead and make this 32 MB.

@aadityasondhi aadityasondhi force-pushed the aadityas/histogram-upgrade-v2 branch 2 times, most recently from f5739ec to ddbd1cd Compare August 31, 2022 18:45
Copy link
Collaborator

@dhartunian dhartunian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: not a dealbreaker, but would be nice if _status/vars output contained the buckets in numerical order instead of lexicographic.

Reviewed 4 of 5 files at r4, 11 of 11 files at r5, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @stevendanna and @tbg)

In a previous change, a new prometheus-backed histogram library was
introdced to help standardize histogram buckets across the codebase.
This change migrates all existing histograms to use the new library.

In this change, `NewLatency()` is removed in favor of explicitly defining
which buckets to use between `NetworkLatencyBuckets` and
`IOLatencyBuckets` when calling `NewHistogram()`. For all histograms
that were previously created using the `NewLatency()` func, I tried to
place them in appropriate buckets with the new library. For cases where
it was unclear, I chose `IOLatencyBuckets` as it allows for a larger
range of values.

related: cockroachdb#85990

Release justification: low risk, high benefit

Release note (ops change): This change introduces a new histogram
implementation that will reduce the total number of buckets and
standardize them across all usage. This should help increase the
usability of histograms when exxported to a UI (i.e. Grafana) and reduce
the storage overhead.

After applying this patch it is expected to see fewer buckets in
prometheus/grafana, but still  have similar values for histogram
percentiles due to the use of interpolated values by Prometheus.
@aadityasondhi
Copy link
Collaborator Author

bors r+

@craig
Copy link
Contributor

craig bot commented Sep 1, 2022

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Sep 1, 2022

Build succeeded:

@craig craig bot merged commit 5aebc22 into cockroachdb:master Sep 1, 2022
@aadityasondhi aadityasondhi deleted the aadityas/histogram-upgrade-v2 branch September 1, 2022 17:18
tbg added a commit to tbg/cockroach that referenced this pull request Sep 22, 2022
Front-ports some testing improvements from cockroachdb#88331.

Note that 22.2 and master don't exhibit the bug fixed in that
PR since we switched to using prometheus histograms in cockroachdb#86671.

Release note: None
craig bot pushed a commit that referenced this pull request Sep 23, 2022
88446: metric: front-port a regression test r=aadityasondhi a=tbg

Front-ports some testing improvements from #88331.

Note that 22.2 and master don't exhibit the bug fixed in that
PR since we switched to using prometheus histograms in #86671.

Release note: None


Co-authored-by: Tobias Grieger <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants