-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feature] Timeout for batch jobs #1782
Comments
Hey Sheldon, You could accomplish this yourself by putting a little script in-between what you actually want to run that waits till either the task finishes or the timeout and then returns |
That's how I'm handling it right now but I was thinking it would be cool if Nomad could do it. I understand if it seems like bloat though :) |
+1 - important feature for batch runs. Not so clean to handle this ourselves |
I think this function should be available not only for batch jobs but also for regular services, this will help us to implement "chaos monkey" function right inside nomad. This will increase system stability, because it will be ready for downtime of any service. |
|
I think it will be better to add "max_lifetime" and add ability to specify it as a range or concrete value. For example 10h-20h means that daemon might be killed in 11h or after 19h, but maximum time will be 20h. Implementing chaosmonkey in such way will be great feature in my opinion, and you don't need any 3rd party apps =) |
If a timeout function is implemented it can be used to mimic the classic HPC schedulers like PBS, TORQUE, SGE, etc. Having it as a first-class feature would be indeed useful for many folks including me! Thanks and Regards, |
Just adding a use case here. Let's say I have an app and implement a timeout inside the app. Let's assume there's a bug in this app which causes it to hang occasionally under certain conditions and it thus never reaches it's hard coded "timeout" because it's essentially stopped responding. We should have a way in the infrastructure to automate a relatively simple, drop dead timeline where the scheduler will kill a task that's unresponsive. Nomad is better equipped to provide that failsafe protection at the infrastructure level than rewriting timeout code in every app simply because it doesn't rely on the app's runtime to perform that kill. |
I agree, but as a temporary workaround, how about the timeout with the kill timout parameter? |
+1 for this, it's basic functionality for a job scheduler. Amazing this doesn't exist. @mlehner616 is obviously correct about why having the timeout checker inside the container itself is a boneheaded recommendation. We got bit by 3 hung jobs out of 100,000 that prevented our elastic infrastructure from scaling back down, costing a nice chunk of change. |
@Miserlou as mentioned earlier in this thread a workaround would be to wrap your app in a timeout script. There is an example of how you can do it above. That might save your beckon in the scenario you described. |
Timeout for batch jobs is an important safeguard. We can't rely on jobs' good behaviour... Job without time limit is a service hence timeout is crucial to constrain buggy tasks that might be running for too long... |
I'd also very much like to see nomad implement this, for the use case where nomad's parameterized jobs are used as a bulk task processing system, similar to the workflow described here: https://www.hashicorp.com/blog/replacing-queues-with-nomad-dispatch There are major advantages for us to use this workflow, as it takes advantage of infrastructure already in place to handle autoscaling, rather than having to set up a new system using celery or similar task queues. The lack of a built in timeout mechanism for batch jobs makes the infrastructure required for this, fairly common (afaik) use case quite a bit more complex. Handling the timeout in the tasks themselves is not a safe approach, for the reasons mentioned above, and would also increase the complexity of individual tasks, which is not ideal. Therefore the dispatcher must manage the timeout itself and kill batch jobs once it has been reached. This makes it inconvenient to manage jobs which need different timeouts using a single bulk task management system, as configuration for these needs to be stored centrally, separate from the actual job specification. There are workarounds for this, but it would be very nice to see nomad itself handle timeouts, both for safety and to simplify using nomad. |
@magical-chicken - I strongly, strongly recommend you avoid using Nomad for that purpose. There are claims made in that blog post which are simply untrue, and many people are being duped by it. See more here: |
@Miserlou Thanks for the heads up, that is a pretty serious bug in nomad, and is pretty concerning since we have a large amount of infrastructure managed by it. The volume of dispatches we are handling currently isn't too high, so I'm hoping nomad will be ok to use here in the short term, but long term I will definitely consider switching this system over to a dedicated task queue.
|
+1 are the any plans to include this feature any time soon? Seems pretty important. Wrapping tasks in a timeout script is a bit hacky. |
+1 |
1 similar comment
+1 |
A job run limit is an essential feature of a batch scheduler. All major batch schedulers (PBS, slurm, LSF, etc) have this capability. I’ve seen a growing interest in a tool like nomad, something that combines many of the features of a traditional batch scheduler with K8. But without a run time limit feature, integration into a traditional batch environment would be next to impossible. Is there any timeline on adding this feature to nomad? |
+1 |
@karlem you should add a +1 to the first post rather that a seaprate message. If you know more folks who might be interested in this, you should encourage them to do so as well! 😉 |
The absence of this feature just killed my cluster. A curl periodic job piled up to 600+ pending and tens running. This caused very high disk i/o usage from nomad and effectively rendered affected nodes totally unresponsive. Then Consul decided to stop working as well, because i/o timeout from other nodes. Of course you could argue that curl has in-built timeout options, the point is that if a task scheduler does not provide this feature, there is no simple and unified way to keep all jobs organised and safe if they can decide on their own on how much time they want to run. |
GitHub Actions Self-hosted runner with autoscaling is another good example, I thought. It's very much possible to run runners on Nomad as batch jobs, and autoscale them using parameterized batch jobs, so tasks can be easily dispatched, and can be triggered by GitHub Webhooks upon receiving the queued events. And having max lifetime support would be a great safeguard for such dynamic job scheduling integrated with third-party systems. |
+1 |
While a safety catch (timeout) is definitely a gap in the product - I don't think it captures the use-case I'm looking for in #15011. I am looking for a stanza to run my type=service job M-F, from 08:00-20:00 with a user-defined stop command when a driver supports it. |
Today I hit this for Docker jobs. Out system is full of Docker cron jobs and one job was stuck for 20 (twenty) days. 🥴 Without the timeout parameter, could there be some other "systematic" way to detect stuck (or jobs running for too long)? |
We have some python scripts that hit the allocations api to get this type
of information and trigger alerts / remediations in our monitoring system.
This may be a little different using the docker driver / your
implementation but we just reverse sort allocations by CreateTime and look
for a specific prefix (xxx-periodic) where xxx is the job name. This tells
us when the last allocation happened.
…On Thu, Oct 27, 2022 at 1:18 AM Shantanu Gadgil ***@***.***> wrote:
Today I hit this for Docker jobs. Out system is full of Docker cron jobs
and one job was stuck for 20 (twenty) days. 🥴
Without the timeout parameter, could there be some other "systematic" way
to detect stuck (or jobs running for too long)?
—
Reply to this email directly, view it on GitHub
<#1782 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJ2XYVW2VT7Q3XA3N3NI6DWFI3EPANCNFSM4CRUCDLQ>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
@jxgriffiths thanks for the idea ... Since your post we have been putting together a standalone Nomad The allocation based search was easy enough using a combination of curl, jq, date and bash. (wanted to avoid We also ended up putting together a jobs endpoint query for figuring out The subsequent question was how to individually tune the What we have done for this is to add a job level In case one has multiple groups/tasks in a batch job, one could also move the |
I'm currently running a handful of periodic batch jobs. It's great that nomad doesn't schedule another one if a current one is still running. However, I think it would be helpful to stop a batch job on timeout if it's running beyond a set time. Maybe there could be a script run on timeout or the script itself would just have to handle the signal.
The text was updated successfully, but these errors were encountered: