-
Notifications
You must be signed in to change notification settings - Fork 153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metrics to identify stuck jobs #567
Comments
@ns-gsa Do you think a mixture of https://javadoc.jenkins.io/hudson/model/Job.html#isLogUpdated() and https://javadoc.jenkins-ci.org/hudson/model/Executor.html#isLikelyStuck() could give you what you need? |
@Waschndolos - the method names and the short description for those sounds promising But I am not sure after looking at
|
@ns-gsa I think I could only provide metrics with these methods. Maybe also a metric like "this job takes longer than usual" or something similar - but maybe that could not even be possible for Jenkins instances with a huge amount of jobs. Should I provide that? |
@Waschndolos - Apologies for my delay in getting back. Yes can we have metric(s) so that we can know avg time taken for a job, as well as the time the currently running job has taken, so we can write all sorts of alert expressions like
Also can we have a metric using isLogUpdated() and isLikelyStuck() for identifying stuck jobs. I can provide feedback on how this works after inspecting this metric behavior for a while in our production setup |
@ns-gsa I'll test the PR tomorrow in my companies Test Jenkins if I get the time |
Memo to me:
|
What feature do you want to see added?
We would like a metric or metrics that would help us identify a job that is stuck.
We use jenkins jobs to run ansible playbooks over a large number of hosts, and in some sites the job might take a long time to run due to slowness in the underlying hosts, and sometimes job would simply be stuck
So the way we identify this today is by looking at the job console output log to see if there is any recent progress in the log. If there are no new log lines updated in the last 10 minutes or so, then we know that we need to diagnose that job further.
Since we have a huge scale of sites and jobs with ever increasing number of sites and jobs, it is always not possible to always eyeball job logs, and we would need some form of metrics to identify a stuck job log, so we can have alerting integrations to alert engineers.
we can use
default_jenkins_builds_duration_milliseconds_summary_count
anddefault_jenkins_builds_duration_milliseconds_summary_sum
to find average runtime of job and compare withdefault_jenkins_builds_running_build_duration_milliseconds
to know if a job has exceeded the average time, but that won't necessarily mean that the job is stuck.Upstream changes
No response
The text was updated successfully, but these errors were encountered: