Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding timeouts to group and task #18456

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft

Adding timeouts to group and task #18456

wants to merge 1 commit into from

Conversation

mikenomitch
Copy link
Contributor

@mikenomitch mikenomitch commented Sep 12, 2023

During a hack week I took a stab at an initial implementation of timeouts for tasks. Pushing up this draft PR as a starting point for discussion. (Also note to anybody external to HashiCorp reading this, I'm a PM, not a full engineer, so don't get too excited just yet 😄)

What this does:

  • This creates a new taskrunner hook and new allocrunner hook for timeouts.
  • In the jobspec, if a new "timeout" block is set, the runners start a goroutine that waits for the deadline to pass. The allocation or task (depending on the block) is then killed at the deadline.

Open Product Questions:

  • I am currently failing the allocations if the group timeout happens, and failing them at the task level if FailOnTimeout is set. It kind of feels like there should be a top level "timed out" status instead? If a task fails due to timeout, should a postStop task get run? If a preStart tails times out, should a main task get run? Do people need customization of this behavior?
  • Right now this depends on a Nomad client running. Is that an issue for people?
  • I think timezones need to be handled. I think you could just pass in a TZ at the jobspec level and us that? If not, default to the client's TZ? To UTC? Could we just not do this initially and always go UTC?
  • This includes the functionality on service and system jobs... would that ever make sense?

Open Technical Questions:

  • This is saving information about the timeout in Nomad Variables. Its sort of cool that this is inspectable by the user, but I think this should probably be "its own thing" in the Nomad state store. Any disagreement there? What's the "right way" to store this?
  • Is there a better way to wait in the goroutine? I feel like this probably checks in too often and is wasting cycles?

I'm sure there are more open questions/issues (and unfortunately I've forgotten a lot since I did this over a month ago), but wanted to get the conversation going!

Fixes #1782

@tgross tgross added the stage/needs-rebase This PR needs to be rebased on main before it can be backported to pick up new BPA workflows label May 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stage/needs-rebase This PR needs to be rebased on main before it can be backported to pick up new BPA workflows
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[feature] Timeout for batch jobs
2 participants