-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
util/metric: NetworkLatencyBuckets
ceiling too low
#104017
Comments
This option looks to be the least disruptive, as I think we want to avoid losing fidelity in the histograms for which the current #97144 tracks finding a more data-driven solution to fitting bucket sizes and limits to individual metric histograms. |
Yeah, we can do that. Will it cause any problems to simply flip them?
I think the problem is that the current buckets aren't appropriate for any metrics. Latencies can easily exceed a second when clusters are struggling, and we need to capture that. |
AFAIK, these buckets are only used in the calculation of the quantile values stored in tsdb, so we should be good to change the buckets used and then backport (noting that there will be some expected changes in the p-values recorded following an upgrade).
A cursory look at the corresponding prometheus metrics in centmon supports your claim. In addition to the metrics you listed in this issue description, looks like we are really only using I'm in favor of switching all over to |
@ericharmeling these buckets are also scraped by any prometheus instance we or the customer uses directly via |
106193: util/metric: change preset buckets from NetworkLatencyBuckets to IOLatencyBuckets r=ericharmeling a=ericharmeling This commit replaces NetworkLatencyBuckets with IOLatencyBuckets as the preset buckets used for the following metrics' histograms: - `liveness.heartbeatlatency` - `leases.requests.latency` - `kv.prober.read.latency` - `kv.prober.write.latency` - `proxy.conn_migration.attempted.latency` The upper limit on NetworkLatencyBuckets (1s) is too low for these metrics. Bucket size for all preset buckets increases logarithmically (see `prometheus.ExponentialBucketsRange`), retaining fidelity at the lower-end of buckets. Fixes #104017. Release note: None 106481: changefeedccl: Update previous row builder when version changes Parquet r=miretskiy a=miretskiy Parquet writer incorrectly cached "value builder" state, even when row version changed. Epic: None Release note: None 106536: row: Avoid allocations when using ConsumeKVProvider r=miretskiy a=miretskiy There is no need to allocate new KVFetcher when calling ConsumeKVProvider repeatedly. Issues: None Epic: None Release note: None Co-authored-by: Eric Harmeling <[email protected]> Co-authored-by: Yevgeniy Miretskiy <[email protected]>
This commit removes the generated NetworkLatencyBuckets and replaces their usage with IOLatencyBuckets as the preset buckets used for the following metrics' histograms: - `liveness.heartbeatlatency` - `leases.requests.latency` - `kv.prober.read.latency` - `kv.prober.write.latency` - `proxy.conn_migration.attempted.latency` The upper limit on NetworkLatencyBuckets (1s) is too low for all metrics that currently use it. Bucket size for all buckets generated with `prometheus.ExponentialBucketsRange` (including IOLatencyBuckets) increases logarithmically, retaining fidelity at the lower-end of buckets. Fixes cockroachdb#104017. Release note: None
This commit removes the generated NetworkLatencyBuckets and replaces their usage with IOLatencyBuckets as the preset buckets used for the following metrics' histograms: - `liveness.heartbeatlatency` - `leases.requests.latency` - `kv.prober.read.latency` - `kv.prober.write.latency` - `proxy.conn_migration.attempted.latency` The upper limit on NetworkLatencyBuckets (1s) is too low for all metrics that currently use it. Bucket size for all buckets generated with `prometheus.ExponentialBucketsRange` (including IOLatencyBuckets) increases logarithmically, retaining fidelity at the lower-end of buckets. Fixes cockroachdb#104017. Release note: None
The
NetworkLatencyBuckets
histogram buckets max out at 1 second. This is too low for most of the current users, where latencies can easily be several seconds when clusters are struggling, and we'd want to see the actual latencies -- in particular:liveness.heartbeatlatency
leases.requests.latency
kv.prober.read.latency
kv.prober.write.latency
@dhartunian What would you recommend for these? Should we change them to
IOLatencyBuckets
, or increase the range onNetworkLatencyBuckets
to 10 seconds? Is it safe to do so from a backwards compatibility perspective?Jira issue: CRDB-28314
The text was updated successfully, but these errors were encountered: