-
Notifications
You must be signed in to change notification settings - Fork 124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kminion fails to export metrics when a broker is stopping, starting or restarting #253
Comments
Hey, are you using the endToEnd reporting? If so, I believe I'm already looking into this (have not yet found a proper solution though, here's the draft PR: #252). |
We are not using end to end reporting. We are just scraping the consumer lag metrics with adminAPI/offsetsTopic scraping. It's same for both types of scraping Modes. |
Okay I need more details then, ideally instructions to reproduce this. This should definitely not happen. Any suspicious log messages? |
So far, we haven't seen any suspicious log messages. This can be reproduced by,
This can be reproduced by just restarting the kafka process too. |
Few more details wrt the kminion logs when one of the broker's kafka service was stopped/started. The scraping mode is # kminion logs while broker a1 was being stopped @15:19
# kminion logs after broker a1 was started @15:23
|
Further debugging this issue shows that kminion does expose the metrics every time we query it, but while brokers are being stopped it takes more time for kminion to generate the metrics, most likely because of the failing connection to the broker being stopped. The typical 6-7sec response time increases to 15-20sec that's more than the scrape timeout on the Prometheus servers, that's what is causing the missing metrics. The question is whether it's possible to handle the failing broker connections in a way that it doesn't increase the response time. |
We have observed that kminion fails to report every metrics when a kafka broker is starting, stopping or restarting.
Once the kafka process stop or start is completed, it starts reporting the metrics again as expected.
The text was updated successfully, but these errors were encountered: