Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature] Timeout for batch jobs #1782

Open
sheldonkwok opened this issue Oct 3, 2016 · 28 comments · May be fixed by #18456
Open

[feature] Timeout for batch jobs #1782

sheldonkwok opened this issue Oct 3, 2016 · 28 comments · May be fixed by #18456

Comments

@sheldonkwok
Copy link
Contributor

I'm currently running a handful of periodic batch jobs. It's great that nomad doesn't schedule another one if a current one is still running. However, I think it would be helpful to stop a batch job on timeout if it's running beyond a set time. Maybe there could be a script run on timeout or the script itself would just have to handle the signal.

@dadgar
Copy link
Contributor

dadgar commented Oct 4, 2016

Hey Sheldon,

You could accomplish this yourself by putting a little script in-between what you actually want to run that waits till either the task finishes or the timeout and then returns exit 1

@sheldonkwok
Copy link
Contributor Author

That's how I'm handling it right now but I was thinking it would be cool if Nomad could do it. I understand if it seems like bloat though :)

@OferE
Copy link

OferE commented Feb 23, 2017

+1 - important feature for batch runs. Not so clean to handle this ourselves

@dadgar dadgar added this to the near-term milestone Feb 25, 2017
@schmichael schmichael removed this from the near-term milestone Jul 31, 2017
@alxark
Copy link

alxark commented Sep 1, 2017

I think this function should be available not only for batch jobs but also for regular services, this will help us to implement "chaos monkey" function right inside nomad. This will increase system stability, because it will be ready for downtime of any service.

@jippi
Copy link
Contributor

jippi commented Sep 1, 2017

As mentioned in Gitter chat, the timeout binary in coreutils can do this inside the container if you need a fix right now.

timeout 5 /path/to/slow/command with options

@alxark
Copy link

alxark commented Sep 1, 2017

I think it will be better to add "max_lifetime" and add ability to specify it as a range or concrete value. For example 10h-20h means that daemon might be killed in 11h or after 19h, but maximum time will be 20h. Implementing chaosmonkey in such way will be great feature in my opinion, and you don't need any 3rd party apps =)

@shantanugadgil
Copy link
Contributor

If a timeout function is implemented it can be used to mimic the classic HPC schedulers like PBS, TORQUE, SGE, etc.

Having it as a first-class feature would be indeed useful for many folks including me!
Hope this does get implemented.

Thanks and Regards,
Shantanu

@mlehner616
Copy link

Just adding a use case here. Let's say I have an app and implement a timeout inside the app. Let's assume there's a bug in this app which causes it to hang occasionally under certain conditions and it thus never reaches it's hard coded "timeout" because it's essentially stopped responding. We should have a way in the infrastructure to automate a relatively simple, drop dead timeline where the scheduler will kill a task that's unresponsive.

Nomad is better equipped to provide that failsafe protection at the infrastructure level than rewriting timeout code in every app simply because it doesn't rely on the app's runtime to perform that kill.

@shantanugadgil
Copy link
Contributor

I agree, but as a temporary workaround, how about the timeout with the kill timout parameter?

http://man7.org/linux/man-pages/man1/timeout.1.html

@Miserlou
Copy link

Miserlou commented Aug 17, 2018

+1 for this, it's basic functionality for a job scheduler. Amazing this doesn't exist. @mlehner616 is obviously correct about why having the timeout checker inside the container itself is a boneheaded recommendation. We got bit by 3 hung jobs out of 100,000 that prevented our elastic infrastructure from scaling back down, costing a nice chunk of change.

@AndrewSav
Copy link

@Miserlou as mentioned earlier in this thread a workaround would be to wrap your app in a timeout script. There is an example of how you can do it above. That might save your beckon in the scenario you described.

@onlyjob
Copy link
Contributor

onlyjob commented Aug 29, 2018

Timeout for batch jobs is an important safeguard. We can't rely on jobs' good behaviour... Job without time limit is a service hence timeout is crucial to constrain buggy tasks that might be running for too long...

@wiedenmeier
Copy link

I'd also very much like to see nomad implement this, for the use case where nomad's parameterized jobs are used as a bulk task processing system, similar to the workflow described here: https://www.hashicorp.com/blog/replacing-queues-with-nomad-dispatch

There are major advantages for us to use this workflow, as it takes advantage of infrastructure already in place to handle autoscaling, rather than having to set up a new system using celery or similar task queues. The lack of a built in timeout mechanism for batch jobs makes the infrastructure required for this, fairly common (afaik) use case quite a bit more complex.

Handling the timeout in the tasks themselves is not a safe approach, for the reasons mentioned above, and would also increase the complexity of individual tasks, which is not ideal. Therefore the dispatcher must manage the timeout itself and kill batch jobs once it has been reached. This makes it inconvenient to manage jobs which need different timeouts using a single bulk task management system, as configuration for these needs to be stored centrally, separate from the actual job specification.

There are workarounds for this, but it would be very nice to see nomad itself handle timeouts, both for safety and to simplify using nomad.

@Miserlou
Copy link

@magical-chicken - I strongly, strongly recommend you avoid using Nomad for that purpose. There are claims made in that blog post which are simply untrue, and many people are being duped by it.

See more here:
#4323 (comment)

@wiedenmeier
Copy link

@Miserlou Thanks for the heads up, that is a pretty serious bug in nomad, and is pretty concerning since we have a large amount of infrastructure managed by it. The volume of dispatches we are handling currently isn't too high, so I'm hoping nomad will be ok to use here in the short term, but long term I will definitely consider switching this system over to a dedicated task queue.

Nomad will crash with out of memory
Hopefully hashicorp intents to fix this, maybe they could add a configuration option for servers to use a memory mapped file to store state rather than risking an OOM kill, or even have servers start rejecting additional job registrations if they're running out of memory. There's really no case where it is acceptable for servers to crash completely, or secondary servers to fail to elect a new leader after the leader is lost.

@jxgriffiths
Copy link

+1 are the any plans to include this feature any time soon? Seems pretty important. Wrapping tasks in a timeout script is a bit hacky.

@epetrovich
Copy link

+1

1 similar comment
@grainnemcknight
Copy link

grainnemcknight commented May 26, 2019

+1

@sabbene
Copy link

sabbene commented Nov 7, 2019

A job run limit is an essential feature of a batch scheduler. All major batch schedulers (PBS, slurm, LSF, etc) have this capability. I’ve seen a growing interest in a tool like nomad, something that combines many of the features of a traditional batch scheduler with K8. But without a run time limit feature, integration into a traditional batch environment would be next to impossible. Is there any timeline on adding this feature to nomad?

@karlem
Copy link

karlem commented Jan 31, 2020

+1

@shantanugadgil
Copy link
Contributor

@karlem you should add a +1 to the first post rather that a seaprate message.
That's how they track the demand of a feature.

If you know more folks who might be interested in this, you should encourage them to do so as well! 😉

@BirkhoffLee
Copy link

The absence of this feature just killed my cluster. A curl periodic job piled up to 600+ pending and tens running. This caused very high disk i/o usage from nomad and effectively rendered affected nodes totally unresponsive. Then Consul decided to stop working as well, because i/o timeout from other nodes.

Of course you could argue that curl has in-built timeout options, the point is that if a task scheduler does not provide this feature, there is no simple and unified way to keep all jobs organised and safe if they can decide on their own on how much time they want to run.

@smaeda-ks
Copy link

smaeda-ks commented Apr 8, 2022

GitHub Actions Self-hosted runner with autoscaling is another good example, I thought. It's very much possible to run runners on Nomad as batch jobs, and autoscale them using parameterized batch jobs, so tasks can be easily dispatched, and can be triggered by GitHub Webhooks upon receiving the queued events. And having max lifetime support would be a great safeguard for such dynamic job scheduling integrated with third-party systems.

@mikenomitch mikenomitch added the theme/batch Issues related to batch jobs and scheduling label Aug 11, 2022
@danielnegri
Copy link

+1

@schmichael schmichael mentioned this issue Oct 22, 2022
@NickJLange
Copy link

While a safety catch (timeout) is definitely a gap in the product - I don't think it captures the use-case I'm looking for in #15011. I am looking for a stanza to run my type=service job M-F, from 08:00-20:00 with a user-defined stop command when a driver supports it.

@shantanugadgil
Copy link
Contributor

Today I hit this for Docker jobs. Out system is full of Docker cron jobs and one job was stuck for 20 (twenty) days. 🥴

Without the timeout parameter, could there be some other "systematic" way to detect stuck (or jobs running for too long)?

@jxgriffiths
Copy link

jxgriffiths commented Oct 27, 2022 via email

@shantanugadgil
Copy link
Contributor

@jxgriffiths thanks for the idea ...

Since your post we have been putting together a standalone Nomad checker job which will go through all batch jobs and to figure out "stuck" allocations.

The allocation based search was easy enough using a combination of curl, jq, date and bash. (wanted to avoid python as much as possible)

We also ended up putting together a jobs endpoint query for figuring out pending jobs too, but I think that is easily discoverable via metrics.

The subsequent question was how to individually tune the timeout for each job.

What we have done for this is to add a job level meta parameter, which the checker job will use as the configuration parameter to eventually kill the particular job.

In case one has multiple groups/tasks in a batch job, one could also move the meta down into the groups or tasks as per requirement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.