-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nomad job restart #11533
Nomad job restart #11533
Conversation
3344707
to
84b1861
Compare
84b1861
to
4af0d6e
Compare
Thanks @shishir-a412ed! As you mentioned, this is still a work in progress so I will convert it to |
4af0d6e
to
c03c7c1
Compare
c03c7c1
to
3fa0b0a
Compare
8cd262a
to
64575f8
Compare
64575f8
to
306b67c
Compare
Hey @shishir-a412ed, just wanted to check in and see how this was going? Anything blocking? |
306b67c
to
65d2883
Compare
12e373b
to
1164d80
Compare
1164d80
to
09cc0d4
Compare
09cc0d4
to
e6e88be
Compare
e6e88be
to
36e65d0
Compare
36e65d0
to
537cc03
Compare
Signed-off-by: Shishir Mahajan <[email protected]>
Signed-off-by: Shishir Mahajan <[email protected]>
Signed-off-by: Shishir Mahajan <[email protected]>
Signed-off-by: Shishir Mahajan <[email protected]>
Signed-off-by: Shishir Mahajan <[email protected]>
537cc03
to
10eb6c9
Compare
FYI, I spoke with @shishir-a412ed, who has a great start on this PR and he probably won't be able to get around to finishing it. If anybody internal to Nomad team or external wants to take a swing at finishing this off, feel free. Please let us know if you are doing so, so multiple people don't attempt it at the same time! |
Closing this in favour of #16278. Thank you so much for the great start @shishir-a412ed! |
This command caused some surprise to me. But I was seeing the service was brought down quickly, looks like it restart the alloc without waiting the restarted one back online. And users see error during the restart. Is there anyway for me to make restart graceful? |
Hi @zhixinwen 👋 This is the expected and documented behaviour of this command. From the docs:
I thought there was an open issue to track this feature request, but I can't seem to find it. If you don't mind it could be useful to create on. But implementing it would be a little tricky, since this command runs purely on the machine that called the command, which may not have access to Consul.
The |
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
@schmichael
Thank you for writing the
nomad-job-restart-rfc
. This was super helpful!This PR is not ready for review. I am just opening this, so I can put in my understanding of the flow, and ask some questions.
We can use this PR as a brainstorming ground to progress on the RFC, and eventually get it in shape to be merged.
Based on
nomad/checklist-command
This PR only has changes on
Code
section. NoDoc
changes are included right now.The PR doesn't have implemented the
monitor
(anddetach
) mode where thejob restart
by default will start inmonitor
mode. And user can pass-detach
to run the command indetach
mode.In phase 1, we ll just implement:
nomad job restart -batch-size n -batch-wait d <job_id>
nomad job restart -attach <restart_id>
nomad job restart -cancel <restart_id>
Pause/resume
can come in Phase 2.My understanding around the changes are as follows:
New CLI Command:
nomad job restart
From the CLI, you jump to HTTP API client code in
api/jobs.go
.From the HTTP API client, you call
Restart
which calls into the actual HTTP API defined incommand/agent/job_endpoint.go
.jobRestart
function is called incommand/agent/job_endpoint.go
. This decodes the HTTP request into API structapi.JobRestartRequest
which can then be used to map into the RPC structstructs.JobRestartRequest
and send an RPC.api.JobRestartRequest
structs.JobRestartRequest
structs.JobRestartResponse
Restart RPC
is called. This will check permissions on the auth-token, get the snapshot from the state store, get the allocations for the job, Restart all the allocations (Restarting of allocations is not coded yet). Everytime we restart an allocation, we update the modify index and commit this into raft.New raft message type defined:
JobRestartRequestType
Raft apply
here
raft log message applied by FSM in
nomad/fsm.go
This will call
applyRestartJob
. This will decode the raft message into RPC struct and call into the state store:n.state.UpdateJobRestart
State store logic for committing the transaction to boltDB: https://github.com/hashicorp/nomad/blob/334470746e11a395147f4d60252f017321f072af/nomad/state/state_store.go#L969-L989
New boltDB table schema defined: https://github.com/hashicorp/nomad/blob/334470746e11a395147f4d60252f017321f072af/nomad/state/schema.go#L101-L120
Like I mentioned above, this patch might not be ready for review yet. It has a bunch of print statement left out. I added those just to test the flow and see if the data is being passed around correctly. E.g When I run this on a test job with 5 allocations, I can print some info like:
Questions:
What would be the best way to restart allocations in the RPC code? (Any code references in the RPC code I could refer to?).
Should we signal the restart right before
raftApply
i.e. for each alloc that we restart, we update the modify index and apply a new log message into raft. My understanding of the indexes is there are two indexes
CreateIndex
andModifyIndex
. The first time, we ll create theJobRestartObject
in the server, we ll update theCreateIndex
, and from that point onwards everytime we restart an alloc,ModifyIndex
will be updated.I am still not 100% clear on the boltDB table schema since it only has a
key
e.g.id
in my table. Does that mean we can just store the entireJobRestartRequest
Object against that key. If you can take a look at the state store code where I am committing the transaction and let me know if I am missing something.This is probably a lot to read, and if this doesn't make a lot of sense, let me know. We can also carry some of this conversation back to our internal slack. I just wanted to put the patch out so it's easier to look at some code and go from there.
References: Issue #698