Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

need a metric to see if overlap is being hit #18055

Open
shantanugadgil opened this issue Jul 25, 2023 · 2 comments
Open

need a metric to see if overlap is being hit #18055

shantanugadgil opened this issue Jul 25, 2023 · 2 comments
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/batch Issues related to batch jobs and scheduling theme/job-summary theme/metrics type/enhancement

Comments

@shantanugadgil
Copy link
Contributor

Proposal

For periodic job which have prohibit_overlap set to true, we need a method to detect that the overlap threshold is being hit.
With this metric it would be easy to detect if the configured schedule is too soon, based on runtime of the job.
Also, if the threshold is being hit often, it could be an indicator that something has changed (for the negative) in the job itself.

Use-cases

If there was an explicit metric which indicated that job FOO was hitting threshold often, we could increase it's schedule duration for longer timeframes and setup an alert in the monitoring system like DataDog, etc.

Attempted Solutions

Currently we monitor pending allocations, but cannot pinpoint if the allocation is pending due to insufficient resources or if certain jobs are hitting their overlap threshold.

@jrasell
Copy link
Member

jrasell commented Jul 25, 2023

Hi @shantanugadgil and thanks for taking the time to write up this issue. I think it would be a good addition, although we probably want to take a minute a think about exactly what labels to add, if any, to the data points.

I wonder if it would currently be possible to alert on periodic jobs which run for longer than expected, as a way to mitigate this pending any work on this request.

@jrasell jrasell added theme/metrics theme/job-summary theme/batch Issues related to batch jobs and scheduling stage/accepted Confirmed, and intend to work on. No timeline committment though. labels Jul 25, 2023
@shantanugadgil
Copy link
Contributor Author

I hadn't given a thought to what the label(s) might be named. Nothing much comes to mind immediately. Something like overlap_threshold_crossed, maybe? Also the next requirement could be to record the amount of time by which it crossed the threshold (which though I think can be figured out by the timestamp itself, not sure)

I wonder if it would currently be possible to alert on periodic jobs which run for longer than expected, as a way to mitigate this pending any work on this request.

This, I think is a separate problem about batch job not having a timeout, correct?

We are mitigating this currently by running a loop to check which periodic jobs "appear stuck"
#1782 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/batch Issues related to batch jobs and scheduling theme/job-summary theme/metrics type/enhancement
Projects
Status: Needs Roadmapping
Development

No branches or pull requests

2 participants