Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding grpc_code label to prometheus grpc_server_handling_seconds metric #606

Open
jon-whit opened this issue Jul 28, 2023 · 7 comments
Open

Comments

@jon-whit
Copy link

jon-whit commented Jul 28, 2023

I'm using the prometheus middleware to measure the RPC endpoint latency for my gRPC service. I've enabled the ServerHandlingTimeHistogram.

I noticed that the grpc_code label is not included in the observed _bucket metrics (see this line of code). Consequently, I cannot effectively measure my p99 request latency for requests served successfully (e.g. where grpc_code="OK"). I'd like to set some SLOs based on percentiles of successful requests, but this middleware doesn't currently allow me to do that as far as I can tell.. Was this design choice intentional, and if so why?

Would you be open to me submitting a PR to add it?

@jon-whit
Copy link
Author

Related to grpc-ecosystem/go-grpc-prometheus#75, but reviving since that repository is now deprecated.

@johanbrandhorst
Copy link
Collaborator

Sounds like the consensus in the other thread is that adding grpc_code would be too much, but that having a success/failure signal would be nice. I don't know enough about the prometheus library to say if this is possible, perhaps @bwplotka can weigh in on here too.

@bwplotka
Copy link
Collaborator

Yea. I think with native histograms it would be easier decision. Wonder if it makes sense to be early adopter of native histograms here for this reason cc @beorn7

Otherwise we could add opt-in grpc_code, that's also fine.

@jon-whit
Copy link
Author

jon-whit commented Aug 2, 2023

Sounds like the consensus in the other thread is that adding grpc_code would be too much, but that having a success/failure signal would be nice. I don't know enough about the prometheus library to say if this is possible, perhaps @bwplotka can weigh in on here too.

@johanbrandhorst I'm not sure I agree with that conclusion. I wouldn't say any consensus was yet achieved. There are still community members requesting it and the "consensus" was merely the response of the maintainers. But, in any case, why is it not possible to allow the client to choose whether to include it or not? For me the increase in cardinality is well worth the value I get from it. I should be able to enable this label if I want and understand that there is a tradeoff. For example, I should be able to

var serverMetrics *grpc_prometheus.ServerMetrics

serverMetrics.EnableHandlingTimeHistogram(
    ...,
    WithStatusCodeLabel(),
)  

If the WithStatusCodeLabel option is provided to the handling time histogram, then that label should be emitted and a value provided for it in each observation. The default can be to omit the grpc_code label, which maintains equivalent behavior with what is standard today, but if you want that label and are willing to increase cardinality, then that should be the developer's choice.

This work ☝️ can be in addition to switching to a native histogram so as to further reduce the impact/burden if you want to enable the grpc_code label.

If I submit a PR, will you guys consider merging it? If so, I'd be happy to submit one this week.

@beorn7
Copy link

beorn7 commented Aug 8, 2023

It would be great to offer native histogram support, but I would recommend to also support classic histograms in parallel so that scraper that aren't ready for native histograms yet will still get a classic histogram.

@johanbrandhorst
Copy link
Collaborator

There still seems to be interest from the community to add this (see #709), I'm not sure there was a conclusion from this discussion, I will defer to you on this @bwplotka.

@Tatskaari
Copy link

Tatskaari commented Jul 16, 2024

Hey! I'd also find this metrics useful because we're trying to measure our SLO burn rate and would like to set a success criteria based on success/failure and latency. If we use two separate counters, we'd have to double count messages that are both late, and errors.

I think it would be best to leave it up to the end users to decide whether they want to keep this label by dropping the labels when they scrape them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants