Metrics to identify stuck jobs #567

ns-gsa · 2023-10-06T06:02:02Z

What feature do you want to see added?

We would like a metric or metrics that would help us identify a job that is stuck.

We use jenkins jobs to run ansible playbooks over a large number of hosts, and in some sites the job might take a long time to run due to slowness in the underlying hosts, and sometimes job would simply be stuck

So the way we identify this today is by looking at the job console output log to see if there is any recent progress in the log. If there are no new log lines updated in the last 10 minutes or so, then we know that we need to diagnose that job further.

Since we have a huge scale of sites and jobs with ever increasing number of sites and jobs, it is always not possible to always eyeball job logs, and we would need some form of metrics to identify a stuck job log, so we can have alerting integrations to alert engineers.

we can use
default_jenkins_builds_duration_milliseconds_summary_count and default_jenkins_builds_duration_milliseconds_summary_sum to find average runtime of job and compare with default_jenkins_builds_running_build_duration_milliseconds to know if a job has exceeded the average time, but that won't necessarily mean that the job is stuck.

Upstream changes

No response

The text was updated successfully, but these errors were encountered:

Waschndolos · 2023-10-08T06:06:57Z

@ns-gsa Do you think a mixture of https://javadoc.jenkins.io/hudson/model/Job.html#isLogUpdated() and https://javadoc.jenkins-ci.org/hudson/model/Executor.html#isLikelyStuck() could give you what you need?

ns-gsa · 2023-10-09T04:29:46Z

@Waschndolos - the method names and the short description for those sounds promising

But I am not sure after looking at

Waschndolos · 2023-10-09T09:25:39Z

@ns-gsa I think I could only provide metrics with these methods. Maybe also a metric like "this job takes longer than usual" or something similar - but maybe that could not even be possible for Jenkins instances with a huge amount of jobs. Should I provide that?

ns-gsa · 2023-10-17T21:43:16Z

@Waschndolos - Apologies for my delay in getting back.

Yes can we have metric(s) so that we can know avg time taken for a job, as well as the time the currently running job has taken, so we can write all sorts of alert expressions like

current build time more than avg time
current build time more than avg time by X% etc.,

Also can we have a metric using isLogUpdated() and isLikelyStuck() for identifying stuck jobs. I can provide feedback on how this works after inspecting this metric behavior for a while in our production setup

Waschndolos · 2023-10-25T16:16:00Z

@ns-gsa I'll test the PR tomorrow in my companies Test Jenkins if I get the time

Waschndolos · 2023-10-26T06:01:11Z

Memo to me:

Only provide default_jenkins_builds_job_log_updated for running jobs
Rename default_jenkins_likely_stuck and insert job name

ns-gsa added the enhancement label Oct 6, 2023

Waschndolos mentioned this issue Oct 25, 2023

Feature/567 metrics to identify stuck jobs #582

Closed

3 tasks

Waschndolos linked a pull request Oct 25, 2023 that will close this issue

Feature/567 metrics to identify stuck jobs #582

Closed

3 tasks

Waschndolos removed a link to a pull request Nov 26, 2023

Feature/567 metrics to identify stuck jobs #582

Closed

3 tasks

Waschndolos mentioned this issue Nov 26, 2023

Creating Build Metrics for log is updated and build is likely stuck #592

Merged

3 tasks

Waschndolos linked a pull request Nov 26, 2023 that will close this issue

Creating Build Metrics for log is updated and build is likely stuck #592

Merged

3 tasks

Waschndolos closed this as completed in #592 Nov 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metrics to identify stuck jobs #567

Metrics to identify stuck jobs #567

ns-gsa commented Oct 6, 2023 •

edited

Loading

Waschndolos commented Oct 8, 2023

ns-gsa commented Oct 9, 2023

Waschndolos commented Oct 9, 2023

ns-gsa commented Oct 17, 2023

Waschndolos commented Oct 25, 2023

Waschndolos commented Oct 26, 2023

Metrics to identify stuck jobs #567

Metrics to identify stuck jobs #567

Comments

ns-gsa commented Oct 6, 2023 • edited Loading

What feature do you want to see added?

Upstream changes

Waschndolos commented Oct 8, 2023

ns-gsa commented Oct 9, 2023

Waschndolos commented Oct 9, 2023

ns-gsa commented Oct 17, 2023

Waschndolos commented Oct 25, 2023

Waschndolos commented Oct 26, 2023

ns-gsa commented Oct 6, 2023 •

edited

Loading