util/metric: `NetworkLatencyBuckets` ceiling too low #104017

erikgrinaker · 2023-05-28T13:20:36Z

The NetworkLatencyBuckets histogram buckets max out at 1 second. This is too low for most of the current users, where latencies can easily be several seconds when clusters are struggling, and we'd want to see the actual latencies -- in particular:

liveness.heartbeatlatency
leases.requests.latency
kv.prober.read.latency
kv.prober.write.latency

@dhartunian What would you recommend for these? Should we change them to IOLatencyBuckets, or increase the range on NetworkLatencyBuckets to 10 seconds? Is it safe to do so from a backwards compatibility perspective?

Jira issue: CRDB-28314

The text was updated successfully, but these errors were encountered:

ericharmeling · 2023-07-05T15:17:52Z

Should we change them to IOLatencyBuckets

This option looks to be the least disruptive, as I think we want to avoid losing fidelity in the histograms for which the current NetworkLatencyBuckets fit well.

#97144 tracks finding a more data-driven solution to fitting bucket sizes and limits to individual metric histograms.

erikgrinaker · 2023-07-05T15:29:07Z

Yeah, we can do that. Will it cause any problems to simply flip them?

I think we want to avoid losing fidelity in the histograms for which the current NetworkLatencyBuckets fit well.

I think the problem is that the current buckets aren't appropriate for any metrics. Latencies can easily exceed a second when clusters are struggling, and we need to capture that.

ericharmeling · 2023-07-05T17:16:40Z

Will it cause any problems to simply flip them?

AFAIK, these buckets are only used in the calculation of the quantile values stored in tsdb, so we should be good to change the buckets used and then backport (noting that there will be some expected changes in the p-values recorded following an upgrade).

I think the problem is that the current buckets aren't appropriate for any metrics. Latencies can easily exceed a second when clusters are struggling, and we need to capture that.

A cursory look at the corresponding prometheus metrics in centmon supports your claim. In addition to the metrics you listed in this issue description, looks like we are really only using NetworkLatencyBuckets for proxy.conn_migration.attempted.latency.

I'm in favor of switching all over to IOLatencyBuckets, especially given the fact that the bucket size increases logarithmically, retaining a good amount of fidelity at the lower end. I'll open a PR.

dhartunian · 2023-07-05T19:16:20Z

AFAIK, these buckets are only used in the calculation of the quantile values stored in tsdb, so we should be good to change the buckets used and then backport (noting that there will be some expected changes in the p-values recorded following an upgrade).

@ericharmeling these buckets are also scraped by any prometheus instance we or the customer uses directly via _status/vars. I think in general changing the bucket boundaries is acceptable especially for increased accuracy because customers are also typically computing percentiles from them as well.

106193: util/metric: change preset buckets from NetworkLatencyBuckets to IOLatencyBuckets r=ericharmeling a=ericharmeling This commit replaces NetworkLatencyBuckets with IOLatencyBuckets as the preset buckets used for the following metrics' histograms: - `liveness.heartbeatlatency` - `leases.requests.latency` - `kv.prober.read.latency` - `kv.prober.write.latency` - `proxy.conn_migration.attempted.latency` The upper limit on NetworkLatencyBuckets (1s) is too low for these metrics. Bucket size for all preset buckets increases logarithmically (see `prometheus.ExponentialBucketsRange`), retaining fidelity at the lower-end of buckets. Fixes #104017. Release note: None 106481: changefeedccl: Update previous row builder when version changes Parquet r=miretskiy a=miretskiy Parquet writer incorrectly cached "value builder" state, even when row version changed. Epic: None Release note: None 106536: row: Avoid allocations when using ConsumeKVProvider r=miretskiy a=miretskiy There is no need to allocate new KVFetcher when calling ConsumeKVProvider repeatedly. Issues: None Epic: None Release note: None Co-authored-by: Eric Harmeling <[email protected]> Co-authored-by: Yevgeniy Miretskiy <[email protected]>

This commit removes the generated NetworkLatencyBuckets and replaces their usage with IOLatencyBuckets as the preset buckets used for the following metrics' histograms: - `liveness.heartbeatlatency` - `leases.requests.latency` - `kv.prober.read.latency` - `kv.prober.write.latency` - `proxy.conn_migration.attempted.latency` The upper limit on NetworkLatencyBuckets (1s) is too low for all metrics that currently use it. Bucket size for all buckets generated with `prometheus.ExponentialBucketsRange` (including IOLatencyBuckets) increases logarithmically, retaining fidelity at the lower-end of buckets. Fixes cockroachdb#104017. Release note: None

erikgrinaker added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-observability-inf A-observability-inf labels May 28, 2023

erikgrinaker mentioned this issue Jul 5, 2023

kvserver: add raft.replication.latency #106094

Merged

ericharmeling self-assigned this Jul 5, 2023

ericharmeling mentioned this issue Jul 5, 2023

util/metric: change preset buckets from NetworkLatencyBuckets to IOLatencyBuckets #106193

Merged

craig bot closed this as completed in 831f979 Jul 10, 2023

This was referenced Jul 18, 2023

release-23.1: util/metric: change preset buckets from NetworkLatencyBuckets to IOLatencyBuckets #107096

Merged

release-22.2: util/metric: change preset buckets from NetworkLatencyBuckets to IOLatencyBuckets #107102

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

util/metric: `NetworkLatencyBuckets` ceiling too low #104017

util/metric: `NetworkLatencyBuckets` ceiling too low #104017

erikgrinaker commented May 28, 2023 •

edited by cockroach-jira-scripts

Loading

ericharmeling commented Jul 5, 2023

erikgrinaker commented Jul 5, 2023

ericharmeling commented Jul 5, 2023 •

edited

Loading

dhartunian commented Jul 5, 2023

util/metric: NetworkLatencyBuckets ceiling too low #104017

util/metric: NetworkLatencyBuckets ceiling too low #104017

Comments

erikgrinaker commented May 28, 2023 • edited by cockroach-jira-scripts Loading

ericharmeling commented Jul 5, 2023

erikgrinaker commented Jul 5, 2023

ericharmeling commented Jul 5, 2023 • edited Loading

dhartunian commented Jul 5, 2023

util/metric: `NetworkLatencyBuckets` ceiling too low #104017

util/metric: `NetworkLatencyBuckets` ceiling too low #104017

erikgrinaker commented May 28, 2023 •

edited by cockroach-jira-scripts

Loading

ericharmeling commented Jul 5, 2023 •

edited

Loading