Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics to identify stuck jobs #567

Closed
ns-gsa opened this issue Oct 6, 2023 · 6 comments · Fixed by #592
Closed

Metrics to identify stuck jobs #567

ns-gsa opened this issue Oct 6, 2023 · 6 comments · Fixed by #592

Comments

@ns-gsa
Copy link

ns-gsa commented Oct 6, 2023

What feature do you want to see added?

We would like a metric or metrics that would help us identify a job that is stuck.

We use jenkins jobs to run ansible playbooks over a large number of hosts, and in some sites the job might take a long time to run due to slowness in the underlying hosts, and sometimes job would simply be stuck

So the way we identify this today is by looking at the job console output log to see if there is any recent progress in the log. If there are no new log lines updated in the last 10 minutes or so, then we know that we need to diagnose that job further.

Since we have a huge scale of sites and jobs with ever increasing number of sites and jobs, it is always not possible to always eyeball job logs, and we would need some form of metrics to identify a stuck job log, so we can have alerting integrations to alert engineers.

we can use
default_jenkins_builds_duration_milliseconds_summary_count and default_jenkins_builds_duration_milliseconds_summary_sum to find average runtime of job and compare with default_jenkins_builds_running_build_duration_milliseconds to know if a job has exceeded the average time, but that won't necessarily mean that the job is stuck.

Upstream changes

No response

@Waschndolos
Copy link

@Waschndolos
Copy link

@ns-gsa I think I could only provide metrics with these methods. Maybe also a metric like "this job takes longer than usual" or something similar - but maybe that could not even be possible for Jenkins instances with a huge amount of jobs. Should I provide that?

@ns-gsa
Copy link
Author

ns-gsa commented Oct 17, 2023

@Waschndolos - Apologies for my delay in getting back.

Yes can we have metric(s) so that we can know avg time taken for a job, as well as the time the currently running job has taken, so we can write all sorts of alert expressions like

  1. current build time more than avg time
  2. current build time more than avg time by X% etc.,

Also can we have a metric using isLogUpdated() and isLikelyStuck() for identifying stuck jobs. I can provide feedback on how this works after inspecting this metric behavior for a while in our production setup

@Waschndolos Waschndolos linked a pull request Oct 25, 2023 that will close this issue
3 tasks
@Waschndolos
Copy link

@ns-gsa I'll test the PR tomorrow in my companies Test Jenkins if I get the time

@Waschndolos
Copy link

Memo to me:

  • Only provide default_jenkins_builds_job_log_updated for running jobs
  • Rename default_jenkins_likely_stuck and insert job name

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants