Nomad job restart #11533

shishir-a412ed · 2021-11-18T22:04:00Z

@schmichael
Thank you for writing the nomad-job-restart-rfc. This was super helpful!

This PR is not ready for review. I am just opening this, so I can put in my understanding of the flow, and ask some questions.
We can use this PR as a brainstorming ground to progress on the RFC, and eventually get it in shape to be merged.

Based on nomad/checklist-command
This PR only has changes on Code section. No Doc changes are included right now.
The PR doesn't have implemented the monitor (and detach) mode where the job restart by default will start in monitor mode. And user can pass -detach to run the command in detach mode.

In phase 1, we ll just implement:

nomad job restart -batch-size n -batch-wait d <job_id>
nomad job restart -attach <restart_id>
nomad job restart -cancel <restart_id>

Pause/resume can come in Phase 2.

My understanding around the changes are as follows:

New CLI Command: nomad job restart
From the CLI, you jump to HTTP API client code in api/jobs.go.
From the HTTP API client, you call Restart which calls into the actual HTTP API defined in command/agent/job_endpoint.go.
jobRestart function is called in command/agent/job_endpoint.go. This decodes the HTTP request into API struct api.JobRestartRequest which can then be used to map into the RPC struct structs.JobRestartRequest and send an RPC.
- Following structs are defined.
  - API request (I didn't define the response yet. Still contemplating what would be needed in the API response struct? Do we just send the UUID back to the user?). api.JobRestartRequest
  - RPC Request struct: structs.JobRestartRequest
  - RPC Response struct structs.JobRestartResponse
Restart RPC is called. This will check permissions on the auth-token, get the snapshot from the state store, get the allocations for the job, Restart all the allocations (Restarting of allocations is not coded yet). Everytime we restart an allocation, we update the modify index and commit this into raft.
New raft message type defined: JobRestartRequestType
Raft apply here
raft log message applied by FSM in nomad/fsm.go
This will call applyRestartJob. This will decode the raft message into RPC struct and call into the state store: n.state.UpdateJobRestart
State store logic for committing the transaction to boltDB: https://github.com/hashicorp/nomad/blob/334470746e11a395147f4d60252f017321f072af/nomad/state/state_store.go#L969-L989
New boltDB table schema defined: https://github.com/hashicorp/nomad/blob/334470746e11a395147f4d60252f017321f072af/nomad/state/schema.go#L101-L120

Like I mentioned above, this patch might not be ready for review yet. It has a bunch of print statement left out. I added those just to test the flow and see if the data is being passed around correctly. E.g When I run this on a test job with 5 allocations, I can print some info like:

Nov 18 20:08:10 linux nomad[1955]:     2021-11-18T20:08:10.489Z [INFO]  client: started client: node_id=3757b276-95bb-1ed4-4fd7-c3169f6d919d
Nov 18 20:08:18 linux nomad[1955]:     2021-11-18T20:08:18.984Z [INFO]  client: node registration complete
Nov 18 20:10:06 linux nomad[1955]: HELLO HELLO: nomad/job_endpoint.go JobRestartRequest: &{ID:58d503f8-2c54-d3f5-4ca3-8484d123206d JobID:count BatchSize:5 BatchWait:10 Status:running RestartedAllocs:[] StartedAt:2021-11-18 20:10:06.373692499 +0000 UTC m=+122.230232254 UpdatedAt:2021-11-18 20:10:06.373692568 +0000 UTC m=+122.230232321 CreateIndex:0 ModifyIndex:0 WriteRequest:{Region:global Namespace:default AuthToken: IdempotencyToken: InternalRpcInfo:{Forwarded:false}}}
Nov 18 20:10:06 linux nomad[1955]: Hello alloc ID: 4089ce63-50e3-c370-54fe-6ab8a90d647e
Nov 18 20:10:06 linux nomad[1955]: Hello alloc ID: 910ed5bc-20dc-c027-1a40-e3d9c5d1eb2a
Nov 18 20:10:06 linux nomad[1955]: Hello alloc ID: c8d77bff-95bb-b4cb-b1d7-1dc2e8f294ef
Nov 18 20:10:06 linux nomad[1955]: Hello alloc ID: df98f5ce-b716-8e6e-cd48-55c37cd2d1f1
Nov 18 20:10:06 linux nomad[1955]: Hello alloc ID: fd9661e3-b167-6916-0350-84da97cae0a2
Nov 18 20:10:06 linux nomad[1955]: HELLO nomad/fsm.go: applyRestartJob
Nov 18 20:10:06 linux nomad[1955]: Hello JobRestartRequest object: {ID:58d503f8-2c54-d3f5-4ca3-8484d123206d JobID:count BatchSize:5 BatchWait:10 Status:running RestartedAllocs:[] StartedAt:2021-11-18 20:10:06.373692499 +0000 UTC UpdatedAt:2021-11-18 20:10:06.373692568 +0000 UTC CreateIndex:0 ModifyIndex:0 WriteRequest:{Region:global Namespace:default AuthToken: IdempotencyToken: InternalRpcInfo:{Forwarded:false}}}
Nov 18 20:10:06 linux nomad[1955]: HELLO state/state_store.go: UpdateJobRestart
Nov 18 20:10:06 linux nomad[1955]: HELLO: state/state_store.go: updateJobRestartImpl

Questions:

What would be the best way to restart allocations in the RPC code? (Any code references in the RPC code I could refer to?).
Should we signal the restart right before raftApply
i.e. for each alloc that we restart, we update the modify index and apply a new log message into raft. My understanding of the indexes is there are two indexes CreateIndex and ModifyIndex. The first time, we ll create the JobRestartObject in the server, we ll update the CreateIndex, and from that point onwards everytime we restart an alloc, ModifyIndex will be updated.
I am still not 100% clear on the boltDB table schema since it only has a key e.g. id in my table. Does that mean we can just store the entire JobRestartRequest Object against that key. If you can take a look at the state store code where I am committing the transaction and let me know if I am missing something.

This is probably a lot to read, and if this doesn't make a lot of sense, let me know. We can also carry some of this conversation back to our internal slack. I just wanted to put the patch out so it's easier to look at some code and go from there.

References: Issue #698

lgfa29 · 2021-12-21T18:32:57Z

Thanks @shishir-a412ed! As you mentioned, this is still a work in progress so I will convert it to Draft until it's ready for review. I think you will be able to switch it back yourself, but if you can't just let us know 🙂

mikenomitch · 2022-02-01T19:20:43Z

Hey @shishir-a412ed, just wanted to check in and see how this was going? Anything blocking?

Signed-off-by: Shishir Mahajan <[email protected]>

mikenomitch · 2022-10-19T18:35:36Z

FYI, I spoke with @shishir-a412ed, who has a great start on this PR and he probably won't be able to get around to finishing it.

If anybody internal to Nomad team or external wants to take a swing at finishing this off, feel free. Please let us know if you are doing so, so multiple people don't attempt it at the same time!

lgfa29 · 2023-03-01T01:10:10Z

Closing this in favour of #16278.

Thank you so much for the great start @shishir-a412ed!

zhixinwen · 2024-06-21T21:54:45Z

This command caused some surprise to me.
When I run nomad job restart $JOB_ID, I was expecting each alloc in the job to restart one by one and not causing issue in our running service. I expect restart to work similar with update.

But I was seeing the service was brought down quickly, looks like it restart the alloc without waiting the restarted one back online. And users see error during the restart. Is there anyway for me to make restart graceful?

lgfa29 · 2024-06-21T23:18:13Z

Hi @zhixinwen 👋

This is the expected and documented behaviour of this command. From the docs:

The command waits until the new allocations have client status ready before proceeding with the remaining batches. Services health checks are not taken into account.

I thought there was an open issue to track this feature request, but I can't seem to find it. If you don't mind it could be useful to create on. But implementing it would be a little tricky, since this command runs purely on the machine that called the command, which may not have access to Consul.

Is there anyway for me to make restart graceful?

The -batch-wait flag can help you. If you have a sense of how long it usually takes for the service to be healthy, you could set a fixed timeout. A more complex solution could involve using -batch-wait=ask and monitor the service health out-of-band, sending a response in the job restart command stdin when healthy.

vercel bot temporarily deployed to Preview – nomad November 18, 2021 22:04 Inactive

vercel bot deployed to Preview – nomad-storybook-and-ui November 18, 2021 22:04 View deployment

shishir-a412ed force-pushed the job_restart branch from 3344707 to 84b1861 Compare November 19, 2021 22:11

vercel bot temporarily deployed to Preview – nomad November 19, 2021 22:11 Inactive

vercel bot deployed to Preview – nomad-storybook-and-ui November 19, 2021 22:11 View deployment

shishir-a412ed force-pushed the job_restart branch from 84b1861 to 4af0d6e Compare November 19, 2021 22:11

vercel bot temporarily deployed to Preview – nomad November 19, 2021 22:11 Inactive

vercel bot deployed to Preview – nomad-storybook-and-ui November 19, 2021 22:14 View deployment

schmichael self-requested a review December 21, 2021 17:33

lgfa29 marked this pull request as draft December 21, 2021 18:33

shishir-a412ed force-pushed the job_restart branch from 4af0d6e to c03c7c1 Compare January 4, 2022 18:29

vercel bot temporarily deployed to Preview – nomad January 4, 2022 18:29 Inactive

vercel bot deployed to Preview – nomad-storybook-and-ui January 4, 2022 18:29 View deployment

shishir-a412ed force-pushed the job_restart branch from c03c7c1 to 3fa0b0a Compare January 6, 2022 18:21

vercel bot temporarily deployed to Preview – nomad January 6, 2022 18:21 Inactive

vercel bot deployed to Preview – nomad-storybook-and-ui January 6, 2022 18:21 View deployment

vercel bot temporarily deployed to Preview – nomad January 7, 2022 20:37 Inactive

vercel bot deployed to Preview – nomad-storybook-and-ui January 7, 2022 20:37 View deployment

shishir-a412ed force-pushed the job_restart branch from 8cd262a to 64575f8 Compare January 21, 2022 20:35

vercel bot temporarily deployed to Preview – nomad January 21, 2022 20:35 Inactive

vercel bot deployed to Preview – nomad-storybook-and-ui January 21, 2022 20:35 View deployment

shishir-a412ed force-pushed the job_restart branch from 64575f8 to 306b67c Compare February 1, 2022 14:53

vercel bot temporarily deployed to Preview – nomad February 1, 2022 14:53 Inactive

vercel bot deployed to Preview – nomad-storybook-and-ui February 1, 2022 14:53 View deployment

lgfa29 added the theme/api HTTP API and SDK issues label Feb 4, 2022

shishir-a412ed force-pushed the job_restart branch from 306b67c to 65d2883 Compare February 18, 2022 20:49

vercel bot deployed to Preview – nomad-storybook-and-ui February 18, 2022 20:49 View deployment

shishir-a412ed mentioned this pull request Apr 6, 2022

Rolling restart of a nomad job #12490

Closed

shishir-a412ed force-pushed the job_restart branch from 136d7cb to 12e373b Compare April 8, 2022 17:56

vercel bot deployed to Preview – nomad-storybook-and-ui April 8, 2022 17:56 View deployment

shishir-a412ed force-pushed the job_restart branch from 12e373b to 1164d80 Compare April 13, 2022 17:59

vercel bot deployed to Preview – nomad-storybook-and-ui April 13, 2022 17:59 View deployment

shishir-a412ed force-pushed the job_restart branch from 1164d80 to 09cc0d4 Compare April 22, 2022 21:32

vercel bot deployed to Preview – nomad-storybook-and-ui April 22, 2022 21:35 View deployment

shishir-a412ed force-pushed the job_restart branch from 09cc0d4 to e6e88be Compare May 3, 2022 00:28

vercel bot deployed to Preview – nomad-storybook-and-ui May 3, 2022 00:34 View deployment

shishir-a412ed force-pushed the job_restart branch from e6e88be to 36e65d0 Compare May 16, 2022 18:07

vercel bot deployed to Preview – nomad-storybook-and-ui May 16, 2022 18:12 View deployment

shishir-a412ed force-pushed the job_restart branch from 36e65d0 to 537cc03 Compare May 18, 2022 17:20

vercel bot deployed to Preview – nomad-storybook-and-ui May 18, 2022 17:23 View deployment

shishir-a412ed added 5 commits June 27, 2022 11:24

Job restart command.

aa83895

Signed-off-by: Shishir Mahajan <[email protected]>

Nomad job restart: updates.

ccf5f38

Signed-off-by: Shishir Mahajan <[email protected]>

Nomad job restart: fsm and state store updates.

aac627a

Signed-off-by: Shishir Mahajan <[email protected]>

Nomad job restart: Updates.

005bbfa

Signed-off-by: Shishir Mahajan <[email protected]>

More updates.

10eb6c9

Signed-off-by: Shishir Mahajan <[email protected]>

shishir-a412ed force-pushed the job_restart branch from 537cc03 to 10eb6c9 Compare June 27, 2022 18:24

vercel bot deployed to Preview – nomad-storybook-and-ui June 27, 2022 18:28 View deployment

mikenomitch added the theme/cli label Sep 20, 2022

mikenomitch added the help-wanted We encourage community PRs for these issues! label Oct 19, 2022

lgfa29 self-assigned this Feb 10, 2023

lgfa29 closed this Mar 1, 2023

lgfa29 mentioned this pull request Jul 24, 2023

cannot restart jobs with attached CSI volumes #17756

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nomad job restart #11533

Nomad job restart #11533

shishir-a412ed commented Nov 18, 2021 •

edited

Loading

lgfa29 commented Dec 21, 2021

mikenomitch commented Feb 1, 2022

mikenomitch commented Oct 19, 2022

lgfa29 commented Mar 1, 2023

zhixinwen commented Jun 21, 2024 •

edited

Loading

lgfa29 commented Jun 21, 2024 •

edited

Loading

Nomad job restart #11533

Nomad job restart #11533

Conversation

shishir-a412ed commented Nov 18, 2021 • edited Loading

lgfa29 commented Dec 21, 2021

mikenomitch commented Feb 1, 2022

mikenomitch commented Oct 19, 2022

lgfa29 commented Mar 1, 2023

zhixinwen commented Jun 21, 2024 • edited Loading

lgfa29 commented Jun 21, 2024 • edited Loading

shishir-a412ed commented Nov 18, 2021 •

edited

Loading

zhixinwen commented Jun 21, 2024 •

edited

Loading

lgfa29 commented Jun 21, 2024 •

edited

Loading