metric: add prometheus-based histogram #85990

aadityasondhi · 2022-08-11T19:51:50Z

Our current histogram is based on hdrhistogram. This tends to create
lots of buckets and is inflexible w.r.t the bucket layout. In hindsight,
it was a poor choice of default implementation (I can say that since I
introduced it) and its cost is disincentivizing the introduction of
histograms that would be useful.

This commit introduces a histogram that is based on a completely vanilla
prometheus.Histogram. The only reason we need to wrap it is because we
want to export quantiles to CockraochDB's internal timeseries (it does
not support histograms) and this requires maintaining an internal windowed
histogram (on top of the cumulative histogram).

With this done, we can now introduce metrics with any kind of buckets we
want. Helpfully, we introduce two common kinds of buckets, suitable for
IO-type and RPC-type latencies. These are defined in a human-readable
format by explicitly listing out the buckets.

We can move existing metrics to HistogramV2 easily, assuming we are not
concerned with existing prometheus scrapers getting tripped up by the
changes in bucket boundaries. I assume this is not a big deal in
practice as long as it doesn't happen "all the time". In fact, I would
strongly suggest we move all metrics wholesale and remove the
hdrhistogram-based implementation. If this is not acceptable for some
reason, we ought to at least deprecated it.

We also slightly improve the existing Histogram code by unifying how
the windowed histograms are ticked and by making explicit where their
quantiles are recorded (this dependency was previously hidden via a
local interface assertion).

Resolves #10015.
Resolves #64962.
Alternative to dhartunian@eac3d06

Release justification: low risk, high benefit changes

cockroach-teamcity · 2022-08-11T19:52:00Z

This change is

dhartunian · 2022-08-11T21:28:19Z

@sravotto would love your feedback on this as you've worked on the metrics translator code before. Our hope is that this work can help folks use our histograms effectively out of the box.

@aadityasondhi can you paste some examples of histogram output in _status/vars to reference. It's not obvious from the commit msg what the new ones look like.

tbg · 2022-08-12T13:40:40Z

@dhartunian they'll have bucket boundaries like these:

https://github.com/cockroachdb/cockroach/pull/85990/files#diff-7d9143ffc52f8bb21b9a9f1dfdcef120129a14453e54e126e9ba6e3da4b54941R338-R378

Of course this occurs only for histograms that actually use these bucket bounds. It's not enough to just land this PR - we also need to convert all histograms to HistogramV2 with sane buckets, and then need to also phase out Histogram (at which point we might as well rename HistogramV2 back to Histogram), losing the hdrhistogram dependency.

sravotto

Reviewed 6 of 10 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi and @sravotto)

pkg/util/metric/metric.go line 337 at r1 (raw file):

// resemble those of typical disk latencies, i.e. which are in the micro- and
// millisecond range during normal operation.
var IOLatencyBuckets = []float64{

In the metrics-exporter sidecar we used log10 linear buckets (for instance, for the 1-1000 interval: [1 2 3 4 5 6 7 8 9 10 20 30 40 50 60 70 80 90 100 200 300 400 500 600 700 800 900 1000...]). We did have many more buckets (around 40-50) per histogram.
I don't have any specific issues with using exponential buckets, assuming we can have different configurations in addition to IOLatencyBuckets and NetworkLatencyBuckets, depending on what we are measuring and how accurate we want estimate quantiles.

aadityasondhi · 2022-08-14T21:19:34Z

@dhartunian This PR changes one of the histograms (RaftSchedulerLatency) to use the new version. Here is the result:

raft_scheduler_latency_bucket{store="1",le="1023"} 6
raft_scheduler_latency_bucket{store="1",le="2047"} 52
raft_scheduler_latency_bucket{store="1",le="3071"} 146
raft_scheduler_latency_bucket{store="1",le="4095"} 264
raft_scheduler_latency_bucket{store="1",le="5119"} 409
raft_scheduler_latency_bucket{store="1",le="6143"} 597
raft_scheduler_latency_bucket{store="1",le="7167"} 763
raft_scheduler_latency_bucket{store="1",le="8191"} 952
raft_scheduler_latency_bucket{store="1",le="9215"} 1126
raft_scheduler_latency_bucket{store="1",le="10239"} 1313
raft_scheduler_latency_bucket{store="1",le="11263"} 1458
raft_scheduler_latency_bucket{store="1",le="12287"} 1594
raft_scheduler_latency_bucket{store="1",le="13311"} 1721
raft_scheduler_latency_bucket{store="1",le="14335"} 1816
raft_scheduler_latency_bucket{store="1",le="15359"} 1915
raft_scheduler_latency_bucket{store="1",le="16383"} 2007
raft_scheduler_latency_bucket{store="1",le="17407"} 2093
raft_scheduler_latency_bucket{store="1",le="18431"} 2176
raft_scheduler_latency_bucket{store="1",le="19455"} 2234
raft_scheduler_latency_bucket{store="1",le="20479"} 2304
raft_scheduler_latency_bucket{store="1",le="21503"} 2359
raft_scheduler_latency_bucket{store="1",le="22527"} 2400
raft_scheduler_latency_bucket{store="1",le="23551"} 2444
raft_scheduler_latency_bucket{store="1",le="24575"} 2491
raft_scheduler_latency_bucket{store="1",le="25599"} 2530
raft_scheduler_latency_bucket{store="1",le="26623"} 2564
raft_scheduler_latency_bucket{store="1",le="27647"} 2589
raft_scheduler_latency_bucket{store="1",le="28671"} 2611
raft_scheduler_latency_bucket{store="1",le="29695"} 2629
raft_scheduler_latency_bucket{store="1",le="30719"} 2654
raft_scheduler_latency_bucket{store="1",le="31743"} 2676
raft_scheduler_latency_bucket{store="1",le="32767"} 2704
raft_scheduler_latency_bucket{store="1",le="34815"} 2742
raft_scheduler_latency_bucket{store="1",le="36863"} 2784
raft_scheduler_latency_bucket{store="1",le="38911"} 2819
raft_scheduler_latency_bucket{store="1",le="40959"} 2849
raft_scheduler_latency_bucket{store="1",le="43007"} 2890
raft_scheduler_latency_bucket{store="1",le="45055"} 2914
raft_scheduler_latency_bucket{store="1",le="47103"} 2937
raft_scheduler_latency_bucket{store="1",le="49151"} 2963
raft_scheduler_latency_bucket{store="1",le="51199"} 2981
raft_scheduler_latency_bucket{store="1",le="53247"} 3006
raft_scheduler_latency_bucket{store="1",le="55295"} 3024
raft_scheduler_latency_bucket{store="1",le="57343"} 3044
raft_scheduler_latency_bucket{store="1",le="59391"} 3067
raft_scheduler_latency_bucket{store="1",le="61439"} 3083
raft_scheduler_latency_bucket{store="1",le="63487"} 3106
raft_scheduler_latency_bucket{store="1",le="65535"} 3120
raft_scheduler_latency_bucket{store="1",le="69631"} 3163
raft_scheduler_latency_bucket{store="1",le="73727"} 3184
raft_scheduler_latency_bucket{store="1",le="77823"} 3206
raft_scheduler_latency_bucket{store="1",le="81919"} 3226
raft_scheduler_latency_bucket{store="1",le="86015"} 3259
raft_scheduler_latency_bucket{store="1",le="90111"} 3278
raft_scheduler_latency_bucket{store="1",le="94207"} 3304
raft_scheduler_latency_bucket{store="1",le="98303"} 3315
raft_scheduler_latency_bucket{store="1",le="102399"} 3328
raft_scheduler_latency_bucket{store="1",le="106495"} 3342
raft_scheduler_latency_bucket{store="1",le="110591"} 3355
raft_scheduler_latency_bucket{store="1",le="114687"} 3363
raft_scheduler_latency_bucket{store="1",le="118783"} 3374
raft_scheduler_latency_bucket{store="1",le="122879"} 3391
raft_scheduler_latency_bucket{store="1",le="126975"} 3397
raft_scheduler_latency_bucket{store="1",le="131071"} 3405
raft_scheduler_latency_bucket{store="1",le="139263"} 3421
raft_scheduler_latency_bucket{store="1",le="147455"} 3429
raft_scheduler_latency_bucket{store="1",le="155647"} 3438
raft_scheduler_latency_bucket{store="1",le="163839"} 3442
raft_scheduler_latency_bucket{store="1",le="172031"} 3449
raft_scheduler_latency_bucket{store="1",le="180223"} 3450
raft_scheduler_latency_bucket{store="1",le="188415"} 3455
raft_scheduler_latency_bucket{store="1",le="196607"} 3459
raft_scheduler_latency_bucket{store="1",le="204799"} 3464
raft_scheduler_latency_bucket{store="1",le="212991"} 3466
raft_scheduler_latency_bucket{store="1",le="221183"} 3468
raft_scheduler_latency_bucket{store="1",le="229375"} 3470
raft_scheduler_latency_bucket{store="1",le="237567"} 3473
raft_scheduler_latency_bucket{store="1",le="245759"} 3475
raft_scheduler_latency_bucket{store="1",le="253951"} 3476
raft_scheduler_latency_bucket{store="1",le="278527"} 3479
raft_scheduler_latency_bucket{store="1",le="294911"} 3481
raft_scheduler_latency_bucket{store="1",le="311295"} 3488
raft_scheduler_latency_bucket{store="1",le="327679"} 3493
raft_scheduler_latency_bucket{store="1",le="344063"} 3497
raft_scheduler_latency_bucket{store="1",le="360447"} 3500
raft_scheduler_latency_bucket{store="1",le="376831"} 3505
raft_scheduler_latency_bucket{store="1",le="393215"} 3509
raft_scheduler_latency_bucket{store="1",le="409599"} 3511
raft_scheduler_latency_bucket{store="1",le="425983"} 3515
raft_scheduler_latency_bucket{store="1",le="442367"} 3517
raft_scheduler_latency_bucket{store="1",le="458751"} 3520
raft_scheduler_latency_bucket{store="1",le="475135"} 3521
raft_scheduler_latency_bucket{store="1",le="491519"} 3522
raft_scheduler_latency_bucket{store="1",le="507903"} 3525
raft_scheduler_latency_bucket{store="1",le="557055"} 3530
raft_scheduler_latency_bucket{store="1",le="589823"} 3531
raft_scheduler_latency_bucket{store="1",le="622591"} 3535
raft_scheduler_latency_bucket{store="1",le="655359"} 3539
raft_scheduler_latency_bucket{store="1",le="688127"} 3541
raft_scheduler_latency_bucket{store="1",le="720895"} 3542
raft_scheduler_latency_bucket{store="1",le="753663"} 3545
raft_scheduler_latency_bucket{store="1",le="786431"} 3547
raft_scheduler_latency_bucket{store="1",le="851967"} 3548
raft_scheduler_latency_bucket{store="1",le="884735"} 3551
raft_scheduler_latency_bucket{store="1",le="917503"} 3554
raft_scheduler_latency_bucket{store="1",le="983039"} 3557
raft_scheduler_latency_bucket{store="1",le="1.015807e+06"} 3559
raft_scheduler_latency_bucket{store="1",le="1.048575e+06"} 3562
raft_scheduler_latency_bucket{store="1",le="1.114111e+06"} 3565
raft_scheduler_latency_bucket{store="1",le="1.179647e+06"} 3568
raft_scheduler_latency_bucket{store="1",le="1.245183e+06"} 3569
raft_scheduler_latency_bucket{store="1",le="1.310719e+06"} 3575
raft_scheduler_latency_bucket{store="1",le="1.376255e+06"} 3576
raft_scheduler_latency_bucket{store="1",le="1.441791e+06"} 3578
raft_scheduler_latency_bucket{store="1",le="1.507327e+06"} 3581
raft_scheduler_latency_bucket{store="1",le="1.572863e+06"} 3583
raft_scheduler_latency_bucket{store="1",le="1.638399e+06"} 3584
raft_scheduler_latency_bucket{store="1",le="1.769471e+06"} 3587
raft_scheduler_latency_bucket{store="1",le="1.966079e+06"} 3588
raft_scheduler_latency_bucket{store="1",le="2.031615e+06"} 3589
raft_scheduler_latency_bucket{store="1",le="2.097151e+06"} 3591
raft_scheduler_latency_bucket{store="1",le="2.228223e+06"} 3592
raft_scheduler_latency_bucket{store="1",le="2.359295e+06"} 3595
raft_scheduler_latency_bucket{store="1",le="2.490367e+06"} 3596
raft_scheduler_latency_bucket{store="1",le="2.752511e+06"} 3599
raft_scheduler_latency_bucket{store="1",le="2.883583e+06"} 3605
raft_scheduler_latency_bucket{store="1",le="3.145727e+06"} 3607
raft_scheduler_latency_bucket{store="1",le="3.276799e+06"} 3608
raft_scheduler_latency_bucket{store="1",le="3.407871e+06"} 3610
raft_scheduler_latency_bucket{store="1",le="3.538943e+06"} 3611
raft_scheduler_latency_bucket{store="1",le="3.801087e+06"} 3612
raft_scheduler_latency_bucket{store="1",le="3.932159e+06"} 3613
raft_scheduler_latency_bucket{store="1",le="4.063231e+06"} 3615
raft_scheduler_latency_bucket{store="1",le="4.194303e+06"} 3618
raft_scheduler_latency_bucket{store="1",le="4.456447e+06"} 3619
raft_scheduler_latency_bucket{store="1",le="4.718591e+06"} 3621
raft_scheduler_latency_bucket{store="1",le="4.980735e+06"} 3623
raft_scheduler_latency_bucket{store="1",le="5.242879e+06"} 3628
raft_scheduler_latency_bucket{store="1",le="6.029311e+06"} 3629
raft_scheduler_latency_bucket{store="1",le="6.553599e+06"} 3630
raft_scheduler_latency_bucket{store="1",le="6.815743e+06"} 3633
raft_scheduler_latency_bucket{store="1",le="7.077887e+06"} 3634
raft_scheduler_latency_bucket{store="1",le="7.602175e+06"} 3636
raft_scheduler_latency_bucket{store="1",le="8.126463e+06"} 3637
raft_scheduler_latency_bucket{store="1",le="8.388607e+06"} 3638
raft_scheduler_latency_bucket{store="1",le="8.912895e+06"} 3640
raft_scheduler_latency_bucket{store="1",le="9.437183e+06"} 3645
raft_scheduler_latency_bucket{store="1",le="9.961471e+06"} 3646
raft_scheduler_latency_bucket{store="1",le="1.0485759e+07"} 3652
raft_scheduler_latency_bucket{store="1",le="1.1010047e+07"} 3654
raft_scheduler_latency_bucket{store="1",le="1.1534335e+07"} 3662
raft_scheduler_latency_bucket{store="1",le="1.2058623e+07"} 3664
raft_scheduler_latency_bucket{store="1",le="1.2582911e+07"} 3669
raft_scheduler_latency_bucket{store="1",le="1.3107199e+07"} 3671
raft_scheduler_latency_bucket{store="1",le="1.3631487e+07"} 3676
raft_scheduler_latency_bucket{store="1",le="1.4155775e+07"} 3683
raft_scheduler_latency_bucket{store="1",le="1.4680063e+07"} 3686
raft_scheduler_latency_bucket{store="1",le="1.5204351e+07"} 3688
raft_scheduler_latency_bucket{store="1",le="1.5728639e+07"} 3693
raft_scheduler_latency_bucket{store="1",le="1.6252927e+07"} 3700
raft_scheduler_latency_bucket{store="1",le="1.6777215e+07"} 3708
raft_scheduler_latency_bucket{store="1",le="1.7825791e+07"} 3731
raft_scheduler_latency_bucket{store="1",le="1.8874367e+07"} 3774
raft_scheduler_latency_bucket{store="1",le="1.9922943e+07"} 3792
raft_scheduler_latency_bucket{store="1",le="2.0971519e+07"} 3810
raft_scheduler_latency_bucket{store="1",le="2.2020095e+07"} 3823
raft_scheduler_latency_bucket{store="1",le="2.3068671e+07"} 3828
raft_scheduler_latency_bucket{store="1",le="2.4117247e+07"} 3836
raft_scheduler_latency_bucket{store="1",le="2.5165823e+07"} 3840
raft_scheduler_latency_bucket{store="1",le="2.6214399e+07"} 3854
raft_scheduler_latency_bucket{store="1",le="2.7262975e+07"} 3866
raft_scheduler_latency_bucket{store="1",le="2.8311551e+07"} 3880
raft_scheduler_latency_bucket{store="1",le="2.9360127e+07"} 3884
raft_scheduler_latency_bucket{store="1",le="3.0408703e+07"} 3889
raft_scheduler_latency_bucket{store="1",le="3.1457279e+07"} 3898
raft_scheduler_latency_bucket{store="1",le="3.2505855e+07"} 3903
raft_scheduler_latency_bucket{store="1",le="3.3554431e+07"} 3917
raft_scheduler_latency_bucket{store="1",le="3.5651583e+07"} 3958
raft_scheduler_latency_bucket{store="1",le="3.7748735e+07"} 4006
raft_scheduler_latency_bucket{store="1",le="3.9845887e+07"} 4035
raft_scheduler_latency_bucket{store="1",le="4.1943039e+07"} 4045
raft_scheduler_latency_bucket{store="1",le="4.4040191e+07"} 4051
raft_scheduler_latency_bucket{store="1",le="4.6137343e+07"} 4065
raft_scheduler_latency_bucket{store="1",le="4.8234495e+07"} 4084
raft_scheduler_latency_bucket{store="1",le="5.0331647e+07"} 4097
raft_scheduler_latency_bucket{store="1",le="5.2428799e+07"} 4108
raft_scheduler_latency_bucket{store="1",le="5.4525951e+07"} 4138
raft_scheduler_latency_bucket{store="1",le="5.6623103e+07"} 4159
raft_scheduler_latency_bucket{store="1",le="5.8720255e+07"} 4177
raft_scheduler_latency_bucket{store="1",le="6.0817407e+07"} 4189
raft_scheduler_latency_bucket{store="1",le="6.2914559e+07"} 4201
raft_scheduler_latency_bucket{store="1",le="6.5011711e+07"} 4214
raft_scheduler_latency_bucket{store="1",le="6.7108863e+07"} 4236
raft_scheduler_latency_bucket{store="1",le="7.1303167e+07"} 4271
raft_scheduler_latency_bucket{store="1",le="7.5497471e+07"} 4321
raft_scheduler_latency_bucket{store="1",le="7.9691775e+07"} 4342
raft_scheduler_latency_bucket{store="1",le="8.3886079e+07"} 4361
raft_scheduler_latency_bucket{store="1",le="8.8080383e+07"} 4393
raft_scheduler_latency_bucket{store="1",le="9.2274687e+07"} 4423
raft_scheduler_latency_bucket{store="1",le="9.6468991e+07"} 4445
raft_scheduler_latency_bucket{store="1",le="1.00663295e+08"} 4465
raft_scheduler_latency_bucket{store="1",le="1.04857599e+08"} 4495
raft_scheduler_latency_bucket{store="1",le="1.09051903e+08"} 4531
raft_scheduler_latency_bucket{store="1",le="1.13246207e+08"} 4568
raft_scheduler_latency_bucket{store="1",le="1.17440511e+08"} 4590
raft_scheduler_latency_bucket{store="1",le="1.21634815e+08"} 4606
raft_scheduler_latency_bucket{store="1",le="1.25829119e+08"} 4630
raft_scheduler_latency_bucket{store="1",le="1.30023423e+08"} 4667
raft_scheduler_latency_bucket{store="1",le="1.34217727e+08"} 4682
raft_scheduler_latency_bucket{store="1",le="1.42606335e+08"} 4707
raft_scheduler_latency_bucket{store="1",le="1.50994943e+08"} 4755
raft_scheduler_latency_bucket{store="1",le="1.59383551e+08"} 4784
raft_scheduler_latency_bucket{store="1",le="1.67772159e+08"} 4830
raft_scheduler_latency_bucket{store="1",le="1.76160767e+08"} 4869
raft_scheduler_latency_bucket{store="1",le="1.84549375e+08"} 4902
raft_scheduler_latency_bucket{store="1",le="1.92937983e+08"} 4934
raft_scheduler_latency_bucket{store="1",le="2.01326591e+08"} 4967
raft_scheduler_latency_bucket{store="1",le="2.09715199e+08"} 5001
raft_scheduler_latency_bucket{store="1",le="2.18103807e+08"} 5028
raft_scheduler_latency_bucket{store="1",le="2.26492415e+08"} 5061
raft_scheduler_latency_bucket{store="1",le="2.34881023e+08"} 5082
raft_scheduler_latency_bucket{store="1",le="2.43269631e+08"} 5121
raft_scheduler_latency_bucket{store="1",le="2.51658239e+08"} 5145
raft_scheduler_latency_bucket{store="1",le="2.60046847e+08"} 5159
raft_scheduler_latency_bucket{store="1",le="2.68435455e+08"} 5179
raft_scheduler_latency_bucket{store="1",le="2.85212671e+08"} 5210
raft_scheduler_latency_bucket{store="1",le="3.01989887e+08"} 5236
raft_scheduler_latency_bucket{store="1",le="3.18767103e+08"} 5251
raft_scheduler_latency_bucket{store="1",le="3.35544319e+08"} 5262
raft_scheduler_latency_bucket{store="1",le="3.52321535e+08"} 5274
raft_scheduler_latency_bucket{store="1",le="3.69098751e+08"} 5287
raft_scheduler_latency_bucket{store="1",le="3.85875967e+08"} 5307
raft_scheduler_latency_bucket{store="1",le="4.02653183e+08"} 5317
raft_scheduler_latency_bucket{store="1",le="4.19430399e+08"} 5324
raft_scheduler_latency_bucket{store="1",le="4.36207615e+08"} 5328
raft_scheduler_latency_bucket{store="1",le="4.52984831e+08"} 5336
raft_scheduler_latency_bucket{store="1",le="4.69762047e+08"} 5341
raft_scheduler_latency_bucket{store="1",le="4.86539263e+08"} 5351
raft_scheduler_latency_bucket{store="1",le="5.03316479e+08"} 5352
raft_scheduler_latency_bucket{store="1",le="5.20093695e+08"} 5355
raft_scheduler_latency_bucket{store="1",le="5.36870911e+08"} 5361
raft_scheduler_latency_bucket{store="1",le="5.70425343e+08"} 5367
raft_scheduler_latency_bucket{store="1",le="6.03979775e+08"} 5368
raft_scheduler_latency_bucket{store="1",le="6.71088639e+08"} 5369
raft_scheduler_latency_bucket{store="1",le="+Inf"} 5369
raft_scheduler_latency_sum{store="1"} 2.37073882887e+11
raft_scheduler_latency_count{store="1"} 5369

Cur:

# TYPE raft_scheduler_latency histogram
raft_scheduler_latency_bucket{store="1",le="10000"} 1618
raft_scheduler_latency_bucket{store="1",le="26826.957953"} 2618
raft_scheduler_latency_bucket{store="1",le="71968.5673"} 3108
raft_scheduler_latency_bucket{store="1",le="193069.772888"} 3328
raft_scheduler_latency_bucket{store="1",le="517947.467923"} 3367
raft_scheduler_latency_bucket{store="1",le="1.389495494373e+06"} 3392
raft_scheduler_latency_bucket{store="1",le="3.727593720315e+06"} 3419
raft_scheduler_latency_bucket{store="1",le="1e+07"} 3450
raft_scheduler_latency_bucket{store="1",le="2.6826957952797e+07"} 3645
raft_scheduler_latency_bucket{store="1",le="7.1968567300115e+07"} 4129
raft_scheduler_latency_bucket{store="1",le="1.93069772888325e+08"} 4815
raft_scheduler_latency_bucket{store="1",le="5.1794746792312e+08"} 5025
raft_scheduler_latency_bucket{store="1",le="1.389495494373135e+09"} 5035
raft_scheduler_latency_bucket{store="1",le="3.727593720314933e+09"} 5035
raft_scheduler_latency_bucket{store="1",le="9.99999999999998e+09"} 5035
raft_scheduler_latency_bucket{store="1",le="+Inf"} 5035
raft_scheduler_latency_sum{store="1"} 1.72372442e+11
raft_scheduler_latency_count{store="1"} 5035

abarganier

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @abarganier, and @sravotto)

pkg/util/metric/metric.go line 360 at r3 (raw file):

// behave like network latencies, i.e. most measurements are in the ms to sub-second
// range during normal operation.
var NetworkLatencyBuckets = []float64{

nit: We should provide some comment or docs around here to roughly outline the process for defining new bucket presets, where we advise that the obs-infra team is involved in the review process to make sure the new preset it sound.

aadityasondhi

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @abarganier and @sravotto)

pkg/util/metric/metric.go line 337 at r1 (raw file):

Previously, sravotto (silvano) wrote…

In the metrics-exporter sidecar we used log10 linear buckets (for instance, for the 1-1000 interval: [1 2 3 4 5 6 7 8 9 10 20 30 40 50 60 70 80 90 100 200 300 400 500 600 700 800 900 1000...]). We did have many more buckets (around 40-50) per histogram.
I don't have any specific issues with using exponential buckets, assuming we can have different configurations in addition to IOLatencyBuckets and NetworkLatencyBuckets, depending on what we are measuring and how accurate we want estimate quantiles.

Yes, new buckets can be generated using TestHistogramBuckets.

pkg/util/metric/metric.go line 360 at r3 (raw file):

Previously, abarganier (Alex Barganier) wrote…

nit: We should provide some comment or docs around here to roughly outline the process for defining new bucket presets, where we advise that the obs-infra team is involved in the review process to make sure the new preset it sound.

I added a line in HistogramV2 and added some comments to TestHistogramBuckets indicating that we should generate new buckets using it and also involve the obs-inf team for reviews.

In a previous change, a new prometheus-backed histogram library was introdced to help standardize histogram buckets across the codebase. This change migrates all existing histograms to use the new library. related: cockroachdb#85990 Release justification: low risk, high benefit Release note (ops change): This change introduces a new histogram implementation that will reduce the total number of buckets and standardize them across all usage. This should help increase the usability of histograms when exxported to a UI (i.e. Grafana).

aadityasondhi · 2022-08-23T15:43:35Z

This PR should be ready for review and merging. I created a separate PR (#86671) for migrating all of the existing histograms to V2.

abarganier

modulo one tiny nit to remove a stale TODO - nice work on this!

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @aadityasondhi, @abarganier, and @sravotto)

pkg/util/metric/metric.go line 477 at r9 (raw file):

	m := &prometheusgo.Metric{}
	if err := h.cum.Write(m); err != nil {
		panic(err) // TODD

nit: remove TODO comment here and in ToPrometheusMetricWindowed.

I think it's acceptable to panic here - we're passing an externally sourced data structure (the prometheus.Histogram) to an externally sourced function. If that's not doing what we expect, something is certainly very wrong. Combine that with the fact that this has been run extensively on roachprod already - I feel safe about this.

Code quote:

 // TODD

This change builds on the previous one and adds a function to export quantiles from the Prometheus-based histogram. This functionality is used to store histogram data in the internal timeseries database. The hdr library came with a function to do this, while Prometheus does not have a public API for exporting quantiles. The function implemented here is very similar to the one found internally in Prometheus, using linear interpolation to calculate values at a given quantile. This commit also includes some additional testing and general refactoring of the metrics code. Release note: None Release justification: low risk, high benefit changes

aadityasondhi · 2022-08-26T16:44:56Z

bors r+

craig · 2022-08-26T17:04:57Z

Build failed (retrying...):

Bazel Essential CI (Cockroach)

craig · 2022-08-26T19:27:02Z

Build succeeded:

Bazel Essential CI (Cockroach)

In a previous change, a new prometheus-backed histogram library was introdced to help standardize histogram buckets across the codebase. This change migrates all existing histograms to use the new library. related: cockroachdb#85990 Release justification: low risk, high benefit Release note (ops change): This change introduces a new histogram implementation that will reduce the total number of buckets and standardize them across all usage. This should help increase the usability of histograms when exxported to a UI (i.e. Grafana).

In a previous change, a new prometheus-backed histogram library was introdced to help standardize histogram buckets across the codebase. This change migrates all existing histograms to use the new library. related: cockroachdb#85990 Release justification: low risk, high benefit Release note (ops change): This change introduces a new histogram implementation that will reduce the total number of buckets and standardize them across all usage. This should help increase the usability of histograms when exxported to a UI (i.e. Grafana) and reduce the storage overhead. After applying this patch it is expected to see fewer buckets in prometheus/grafana, but still have similar values for histogram percentiles due to the use of interpolated values by Prometheus.

In a previous change, a new prometheus-backed histogram library was introdced to help standardize histogram buckets across the codebase. This change migrates all existing histograms to use the new library. In this change, `NewLatency()` is removed in favor of explicitly defining which buckets to use between `NetworkLatencyBuckets` and `IOLatencyBuckets` when calling `NewHistogram()`. For all histograms that were previously created using the `NewLatency()` func, I tried to place them in appropriate buckets with the new library. For cases where it was unclear, I chose `IOLatencyBuckets` as it allows for a larger range of values. related: cockroachdb#85990 Release justification: low risk, high benefit Release note (ops change): This change introduces a new histogram implementation that will reduce the total number of buckets and standardize them across all usage. This should help increase the usability of histograms when exxported to a UI (i.e. Grafana) and reduce the storage overhead. After applying this patch it is expected to see fewer buckets in prometheus/grafana, but still have similar values for histogram percentiles due to the use of interpolated values by Prometheus.

86671: metric: migrate all histograms to use prometheus-backed version r=aadityasondhi a=aadityasondhi In a previous change, a new prometheus-backed histogram library was introduced to help standardize histogram buckets across the codebase. This change migrates all existing histograms to use the new library. In this change, `NewLatency()` is removed in favor of explicitly defining which buckets to use between `NetworkLatencyBuckets` and `IOLatencyBuckets` when calling `NewHistogram()`. For all histograms that were previously created using the `NewLatency()` func, I tried to place them in appropriate buckets with the new library. For cases where it was unclear, I chose `IOLatencyBuckets` as it allows for a larger range of values. related: #85990 Release justification: low risk, high benefit Release note (ops change): This change introduces a new histogram implementation that will reduce the total number of buckets and standardize them across all usage. This should help increase the usability of histograms when exported to a UI (i.e. Grafana) and reduce the storage overhead. After applying this patch it is expected to see fewer buckets in prometheus/grafana, but still have similar values for histogram percentiles due to the use of interpolated values by Prometheus. 87084: roachtest: introduce multi tenant TPCH test r=yuzefovich a=yuzefovich **roachtest: sort tests lexicographically** This commit unexports a couple of registration methods of the roachtests and then orders all such method lexicographically in the registry. Release justification: test-only change. Release note: None **roachtest: introduce multi tenant TPCH test** This commit introduces a new roachtest that runs TPCH benchmark on a single node in a single-tenant deployment first followed by another run in a multi-tenant deployment with a single SQL instance. It refactors the tpchvec roachtest a bit to extract a couple of helper methods for performing the latency analysis. In particular, it removes `queriesToRun` parameter since all variants now run all 22 queries. I ran into some difficulties around managing the certificates in different deployment models, so the test is structured that first the single-node cluster is used in a single-tenant deployment model, and then - after running all 22 queries - a tenant with a single SQL instance is created and all queries are run within the tenant. Here is the comparison of averages over 10 runs of single-tenant vs multi-tenant: ``` Q1: 6.34s 13.35s 110.39% Q2: 0.22s 0.49s 122.07% Q3: 4.88s 8.04s 64.67% Q4: 1.51s 4.75s 213.67% Q5: 2.33s 11.69s 400.90% Q6: 5.51s 35.49s 543.89% Q7: 5.75s 24.08s 318.42% Q8: 0.85s 3.82s 348.47% Q9: 7.34s 28.37s 286.25% Q10: 1.99s 5.00s 150.93% Q11: 0.55s 1.92s 249.00% Q12: 6.02s 34.76s 477.05% Q13: 1.88s 3.76s 100.00% Q14: 0.64s 1.10s 73.11% Q15: 3.33s 16.80s 404.23% Q16: 0.88s 1.29s 47.66% Q17: 0.24s 0.60s 145.49% Q18: 7.75s 30.13s 288.48% Q19: 5.20s 13.08s 151.69% Q20: 12.66s 55.30s 336.92% Q21: 6.98s 24.50s 250.77% Q22: 0.62s 1.17s 90.26% ``` Fixes: #71528. Release justification: test-only change. Release note: None Co-authored-by: Aaditya Sondhi <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]>

irfansharif · 2022-09-13T16:03:26Z

Drive-by comment: this is very cool work!

aadityasondhi requested a review from a team August 11, 2022 19:51

aadityasondhi requested review from a team as code owners August 11, 2022 19:51

sravotto reviewed Aug 12, 2022

View reviewed changes

aadityasondhi force-pushed the aadityas/histogram-updates branch 3 times, most recently from 943707a to 8064985 Compare August 14, 2022 20:58

aadityasondhi force-pushed the aadityas/histogram-updates branch from 8064985 to ed97b36 Compare August 15, 2022 20:24

abarganier reviewed Aug 16, 2022

View reviewed changes

aadityasondhi force-pushed the aadityas/histogram-updates branch 3 times, most recently from df8c9a8 to c3fe957 Compare August 17, 2022 15:00

aadityasondhi commented Aug 17, 2022

View reviewed changes

aadityasondhi force-pushed the aadityas/histogram-updates branch 3 times, most recently from ef9eb07 to 5658e48 Compare August 19, 2022 19:35

This was referenced Aug 23, 2022

metric: migrate all histograms to use prometheus-backed version aadityasondhi/cockroach#1

Closed

metric: migrate all histograms to use prometheus-backed version #86671

Merged

aadityasondhi force-pushed the aadityas/histogram-updates branch from 5658e48 to ad8b581 Compare August 23, 2022 15:40

abarganier approved these changes Aug 23, 2022

View reviewed changes

aadityasondhi force-pushed the aadityas/histogram-updates branch from ad8b581 to 41c82f1 Compare August 23, 2022 18:39

aadityasondhi force-pushed the aadityas/histogram-updates branch from 41c82f1 to 28da9c8 Compare August 24, 2022 15:38

aadityasondhi force-pushed the aadityas/histogram-updates branch from a0a5009 to d7d5838 Compare August 26, 2022 15:02

craig bot merged commit db1554b into cockroachdb:master Aug 26, 2022

aadityasondhi deleted the aadityas/histogram-updates branch August 26, 2022 19:28

tbg mentioned this pull request Sep 19, 2022

promql (histograms): Refine interpolation behavior prometheus/prometheus#11215

Closed

irfansharif mentioned this pull request Jan 26, 2023

metric: changes in histogram buckets affects the connection latency graph in db console #95833

Closed

aadityasondhi mentioned this pull request May 24, 2023

metrics: histogram calculation results in reporting service latency spikes #103814

Closed

abarganier mentioned this pull request Oct 31, 2023

pkg/util/metric: period spike in alloc_objects from prometheus.Histogram metrics on poll #113508

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

metric: add prometheus-based histogram #85990

metric: add prometheus-based histogram #85990

aadityasondhi commented Aug 11, 2022 •

edited

Loading

cockroach-teamcity commented Aug 11, 2022

dhartunian commented Aug 11, 2022

tbg commented Aug 12, 2022

sravotto left a comment

aadityasondhi commented Aug 14, 2022

abarganier left a comment

aadityasondhi left a comment

aadityasondhi commented Aug 23, 2022

abarganier left a comment

aadityasondhi commented Aug 26, 2022

craig bot commented Aug 26, 2022

craig bot commented Aug 26, 2022

irfansharif commented Sep 13, 2022

metric: add prometheus-based histogram #85990

metric: add prometheus-based histogram #85990

Conversation

aadityasondhi commented Aug 11, 2022 • edited Loading

cockroach-teamcity commented Aug 11, 2022

dhartunian commented Aug 11, 2022

tbg commented Aug 12, 2022

sravotto left a comment

Choose a reason for hiding this comment

aadityasondhi commented Aug 14, 2022

abarganier left a comment

Choose a reason for hiding this comment

aadityasondhi left a comment

Choose a reason for hiding this comment

aadityasondhi commented Aug 23, 2022

abarganier left a comment

Choose a reason for hiding this comment

aadityasondhi commented Aug 26, 2022

craig bot commented Aug 26, 2022

craig bot commented Aug 26, 2022

irfansharif commented Sep 13, 2022

aadityasondhi commented Aug 11, 2022 •

edited

Loading