Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

util/metric: NetworkLatencyBuckets ceiling too low #104017

Closed
erikgrinaker opened this issue May 28, 2023 · 4 comments · Fixed by #106193
Closed

util/metric: NetworkLatencyBuckets ceiling too low #104017

erikgrinaker opened this issue May 28, 2023 · 4 comments · Fixed by #106193
Assignees
Labels
A-observability-inf C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.

Comments

@erikgrinaker
Copy link
Contributor

erikgrinaker commented May 28, 2023

The NetworkLatencyBuckets histogram buckets max out at 1 second. This is too low for most of the current users, where latencies can easily be several seconds when clusters are struggling, and we'd want to see the actual latencies -- in particular:

  • liveness.heartbeatlatency
  • leases.requests.latency
  • kv.prober.read.latency
  • kv.prober.write.latency

@dhartunian What would you recommend for these? Should we change them to IOLatencyBuckets, or increase the range on NetworkLatencyBuckets to 10 seconds? Is it safe to do so from a backwards compatibility perspective?

Jira issue: CRDB-28314

@erikgrinaker erikgrinaker added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-observability-inf A-observability-inf labels May 28, 2023
@ericharmeling ericharmeling self-assigned this Jul 5, 2023
@ericharmeling
Copy link

Should we change them to IOLatencyBuckets

This option looks to be the least disruptive, as I think we want to avoid losing fidelity in the histograms for which the current NetworkLatencyBuckets fit well.

#97144 tracks finding a more data-driven solution to fitting bucket sizes and limits to individual metric histograms.

@erikgrinaker
Copy link
Contributor Author

Yeah, we can do that. Will it cause any problems to simply flip them?

I think we want to avoid losing fidelity in the histograms for which the current NetworkLatencyBuckets fit well.

I think the problem is that the current buckets aren't appropriate for any metrics. Latencies can easily exceed a second when clusters are struggling, and we need to capture that.

@ericharmeling
Copy link

ericharmeling commented Jul 5, 2023

Will it cause any problems to simply flip them?

AFAIK, these buckets are only used in the calculation of the quantile values stored in tsdb, so we should be good to change the buckets used and then backport (noting that there will be some expected changes in the p-values recorded following an upgrade).

I think the problem is that the current buckets aren't appropriate for any metrics. Latencies can easily exceed a second when clusters are struggling, and we need to capture that.

A cursory look at the corresponding prometheus metrics in centmon supports your claim. In addition to the metrics you listed in this issue description, looks like we are really only using NetworkLatencyBuckets for proxy.conn_migration.attempted.latency.

I'm in favor of switching all over to IOLatencyBuckets, especially given the fact that the bucket size increases logarithmically, retaining a good amount of fidelity at the lower end. I'll open a PR.

@dhartunian
Copy link
Collaborator

AFAIK, these buckets are only used in the calculation of the quantile values stored in tsdb, so we should be good to change the buckets used and then backport (noting that there will be some expected changes in the p-values recorded following an upgrade).

@ericharmeling these buckets are also scraped by any prometheus instance we or the customer uses directly via _status/vars. I think in general changing the bucket boundaries is acceptable especially for increased accuracy because customers are also typically computing percentiles from them as well.

craig bot pushed a commit that referenced this issue Jul 10, 2023
106193: util/metric: change preset buckets from NetworkLatencyBuckets to IOLatencyBuckets r=ericharmeling a=ericharmeling

This commit replaces NetworkLatencyBuckets with IOLatencyBuckets as the preset buckets used for the following metrics' histograms:
- `liveness.heartbeatlatency`
- `leases.requests.latency`
- `kv.prober.read.latency`
- `kv.prober.write.latency`
- `proxy.conn_migration.attempted.latency`

The upper limit on NetworkLatencyBuckets (1s) is too low for these metrics. Bucket size for all preset buckets increases logarithmically (see `prometheus.ExponentialBucketsRange`), retaining fidelity at the lower-end of buckets.

Fixes #104017.

Release note: None

106481: changefeedccl: Update previous row builder when version changes Parquet r=miretskiy a=miretskiy

Parquet writer incorrectly cached "value builder" state, even
when row version changed.

Epic: None

Release note: None

106536: row: Avoid allocations when using ConsumeKVProvider r=miretskiy a=miretskiy

There is no need to allocate new KVFetcher when calling
ConsumeKVProvider repeatedly.

Issues: None
Epic: None
Release note: None

Co-authored-by: Eric Harmeling <[email protected]>
Co-authored-by: Yevgeniy Miretskiy <[email protected]>
@craig craig bot closed this as completed in 831f979 Jul 10, 2023
ericharmeling pushed a commit to ericharmeling/cockroach that referenced this issue Jul 18, 2023
This commit removes the generated NetworkLatencyBuckets and
replaces their usage with IOLatencyBuckets as the preset
buckets used for the following metrics' histograms:
- `liveness.heartbeatlatency`
- `leases.requests.latency`
- `kv.prober.read.latency`
- `kv.prober.write.latency`
- `proxy.conn_migration.attempted.latency`

The upper limit on NetworkLatencyBuckets (1s) is too low for
all metrics that currently use it. Bucket size for all buckets
generated with `prometheus.ExponentialBucketsRange` (including
IOLatencyBuckets) increases logarithmically, retaining fidelity
at the lower-end of buckets.

Fixes cockroachdb#104017.

Release note: None
ericharmeling pushed a commit to ericharmeling/cockroach that referenced this issue Jul 18, 2023
This commit removes the generated NetworkLatencyBuckets and
replaces their usage with IOLatencyBuckets as the preset
buckets used for the following metrics' histograms:
- `liveness.heartbeatlatency`
- `leases.requests.latency`
- `kv.prober.read.latency`
- `kv.prober.write.latency`
- `proxy.conn_migration.attempted.latency`

The upper limit on NetworkLatencyBuckets (1s) is too low for
all metrics that currently use it. Bucket size for all buckets
generated with `prometheus.ExponentialBucketsRange` (including
IOLatencyBuckets) increases logarithmically, retaining fidelity
at the lower-end of buckets.

Fixes cockroachdb#104017.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-observability-inf C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
Projects
None yet
3 participants