-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cannot restart jobs with attached CSI volumes #17756
Comments
Okay, since OP, I've Terminated all my Elasticsearch instances and brought up new ones - this is obviously bad. I attmpted to reschedule the job again:
To which, I see this:
Last time I pressed "Y". This time I pressed "n" Initially it seemed like it did nothing, but it seems that one instance did allocate successfully on a new node. Uh... Yay? Why does this "failure" prompt not attempt to roll the job back? Now I'm in a pretty weird state... the job is sorta half deployed. Two of the three instances are still on their old host and not migrated. |
After all of that re-ran the exact same command and it rotated the other 5 instances, seemingly just fine:
|
Seems like I've determined that if I individually EDIT, this might not be true. Seems like it worked once or twice, but now I see it hanging again. |
Thanks for the report @josh-m-sharpe. I'm still not sure what's going on, but given that you once had success on re-running the command I wonder if something in the allocation stop procedure is not properly waiting for the volume to detach before reschedule. Do you happen to have any server or client logs around the time this problem happened? If there are any sensitive data you can email them to [email protected] with the issue ID in the subject. |
@lgfa29 In retrospect I suspect I never had success running I can get plenty of logs for you, but I think the important thing is the lack of logs that should indicate that nomad is attempting to unmount the volume. Those logs don't exist. I can find logs all day long of attaching the volume when they should be, but I only see errors on the new hosts about unavailable volumes and nothing about even attempting to unmount the volumes from the old hosts. Those logs are symptoms of the real problem. So, I suspect nomad isn't even trying to unmount the volumes in this case. I'd be more than happy to pair on this if you'd be down. I work ET hours usually. |
The volumes are unmounted and unpublished here: nomad/client/allocrunner/csi_hook.go Lines 161 to 191 in e53955b
There isn't a lot to log here, but you can see these messages at the nomad/client/allocrunner/alloc_runner_hooks.go Lines 231 to 261 in e53955b
Your CSI plugin may also have some log information, both the code and controller plugins, but each CSI plugin behaves a little differently. |
Okay, maybe found the issue (full trace below)
...but I don't think that's true? I have a
Context is:
State of cluster. This all looks correct to me. Allocation was moved to the correct new node. aws-ebs-nodes are all up and running Full output from moment of
|
Thanks for the extra info @josh-m-sharpe. So it seems like the Nomad client is not able to communicate with the CSI controller plugin. CSI has lots of moving parts, so let me try to explain what appears to be going on. When the allocation is stopped the Nomad client that is running it tries to contact the CSI controller to detach the volume that was being used from itself, meaning that the EBS that was being used will now be available to be attached to another Nomad client. The CSI node plugin (which is what you showed in the From your logs the problem is that the Nomad client is failing to communicate with the CSI controller plugin, and so the EBS stays attached to the current client and when the new allocation tries to run on a different client it will fail to attach the volume to itself. So a good next step would be to make sure your CSI controller is running properly and that all your Nomad clients are able to properly communicate with it. You can check the logs for the CSI controller allocation and look for calls to |
@lgfa29 Unfotunately, I'm not seeing any errors from the controller (or node) allocations. I've got a lot for you here: Four log dumps down below, a quick summary:
So, the ebs-node reflects that the volume has been detached. It actually has been detached. Yet, the controller never got the message? No logs, nothing. :( 1. controller logs when job is started
2. restart -reschedule (only 3 nodes present)
3. Launching 3 replacement nodes
4.
|
So stepping back to what I'm attempting: I'm trying to replace the nomad hosts by standing up new ones (in the same set of availability zones) and then rotating allocations onto them. When I start the new nodes, there are two ebs-node allocations in the same availability zone (one on each hose). I'm starting to wonder (without evidence) if that's a problem. They are both responsible for the same EBS volumes. Is the controller maybe getting confused as to which ebs-node allocation is supposed to be doing what? This is my ebs-node job - is there something missing?
|
Yup! That's the one. Do you see them just once though? They should be called periodically by all Nomad clients:
That's the weird part, I would expect there to be RPC calls to detach the volume 🤔 Is the plugin reported as healthy in The specific error that is happening is coming from here: nomad/nomad/client_csi_endpoint.go Lines 281 to 296 in 1e7726c
So either:
So before you issue the
I'm not too familiar with the internals of the EBS plugin, so I can't comment much about it. Your deployment seems OK to me. There are some newer image versions available, so maybe worth trying updating the job? And how does your controller plugin job looks like? |
Before I go do all of that, I take issue with this:
I am explicitly setting the node where the volume is currently attached, and an allocation is actually using it to inelligible before executing To be crystal clear there is absolutely a different node, marked as eligible, with available resources, in the same availability zone and should absolutely take the volume and allocation. If you've gleaned fro the codez that ALL nodes must be eligible, then maybe we have our problem? |
Oooh that's a good point. Would you be able to leave the node running the CSI controller plugin as eligible and see if that fixes the problem? I don't eligibility needs to be taken into account here, so maybe that's what we need to fix.
Yes, but the new nodes don't have the CSI controller, that's the key component that is missing. The client that is running the controller is now ineligible and I think Nomad is incorrectly ignoring it. So if leaving the controller eligible fixes the problem we may indeed have found our problem. |
Okay, well, the controller has to be moved in this situation, so if I This looks like it worked. Confirmed the controller alloc moved to a new host.
Okay, let's reschedule elasticsearch:
So the 6th, non-restarted alloc was "moved" but is pending on the new host, so lets just try restartting everything again:
This is not true. It did not restart successfully. One of the allocations is still pending. These things have health checks. The last alloc is NOT healthy. How is that a success? According to
The OLD node where the volume was released looks good:
The NEW Node where the volume should have been mounted appears to show no evidence of mounting the volume for this allocation. |
Interestingly, when I shutdown the old node, this "bad" allocation fixed itself. Here are logs from the controller leading up to when I shut down the old node:
And unfortunately, I lost any logs on the OLD host's ebs-node when this happened. That said, per my previous comment, the volume was successfully detached |
Ugh, so setting all that up and running it a 2nd and 3rd time worked, both times. Good sign. Let me script this whole thing up and run it 15 more times to see what happens. There's still some fishy things happening here.
|
What does this even mean:
When that happens, why is my cluster in this state: So... there's an error, nomad can't find a place for the new allocation, but it still pulled down the allocation it was rescheduling? It just hangs here.
This is SUPER ambiguous. Feels like there's no good option Rerunning |
do we think #17996 impacts this issue? |
Hi @josh-m-sharpe, I had a reply draft here but then I got roped in into some 1.6.0 release and post-release work. Yes, we believe #17996 will also help your situation here. In #15415 we've learned that the EBS plugin is not as reliable as the CSI spec expects, so #17996 makes some improvements and also increase visibility when things don't go as expected (#17996 (comment)). For this specific issue, looking at the placement issues I see a few different reasons:
So I think we need to try a mode methodical approach:
Screenshot and command outputs help paint the overall picture, but in order to actually debug problems we do need log information. |
Honestly, I did all that including all the validation - and it all worked that first time above. I think I'll wait for v1.6.1 w/ that fix before trying again. I think there's some tangential issues here that maybe deserve their own issue threads:
|
I've been testing 1.5.8 most of today. Results are mixed. It seems the issue may be resolved when using So, trying another angle: adding arbitrary variable to job file and using 3rd approach was using drains, which seems the appropriate solution anyways , and actually seems to play nice. |
I'm glad to hear 1.5.8 improved things. Just to pull a bit on the last threads to make check if there's any follow-up work needed.
Would you mind expending on what "gracefully" means here, or which behaviour you were expecting? The
We know this is not perfect, and had initially planned something more robust but we found that approach to be a lot more complex than anticipated and decided to start with this simpler version. If there's a need to wait before proceeding to the next restart you could use the
This is indeed strange, but I wonder if the problem may have been which field was changed. Some changes are in-place updates, where only the tasks may be recreated. If that happened, and the CSI volume got detached, the task may fail to start again in the old host.
So this approach was just about draining the old nodes? |
Worth recalling my goal of finding some way to replace the nomad hosts running elasticsearch. ES has to be zero-downtime for us, so this is all just planning for the day the nomad hosts have to be switched out for whatever reason. Responses inline.
"Gracefully" just means zero-downtime to us. I need Nomad to correctly pull services in and out of Consul so that Traefik routes requests to healthy allocations.
Okay, so my initial comment was right after I set up 1.5.8 with the fixed CSI issue. At that point in this here journey, I tried the hacky variable-in-job approach here (more on that below). But nothing learned there. Next I retried using Drain to move the allocations and that worked much better - and knowing Drain was well-defined to be controlled by the migrate block this seemed directionally the right thing to do. In doing so I uncovered some additional attributes that I needed specifically for elasticsearch. Specifically I added a I just set things up again (for you) to re-run
Yea, strange for sure, but this had "hack" written all over it. I'm not super interested in re-inventing this scenario to test.
Yes, truth be told, in retrospect I probably only ever moved on to |
Looking at this a bit more this morning. I think there definitely might be something not quite right with Running
So, my best guess is that restart isn't honoring the health of the service. Just because the alloc is "running" doesn't mean it is healthy. I think? Notably there's no configuration option in Maybe my expectation here is wrong? Come to think of it restart says Here's the restart output:
|
Generally, it's peculiar to me that an allocation goes into "pending" status when it's in fact shutting down. |
@lgfa29 think I'm done blowing you up for now. thanks for caring! :) |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Output from
nomad version
1.5.6
Operating system and Environment details
amazonlinux
Issue
I started a job that had attached CSI volumes. It started successfully and did its thing for a long while and generally was OK. I started testing our process to update the nomad client host machines, which is roughly (double the # of client nodes, lock the old nodes,
nomad job restart
the relevant jobs, then drain and shut down the old nodes).This failed at the restart point because the CSI volumes were never (correctly?) released and made available to the new nodes.
From the original host's /var/log/messages;
So then I tried a variety of attempting to forcibly detach the volume:
Along the while, after trying to start the job again, one of the new host looks like it's allocating a container:

But this is not remotely accurate. docker isn't doing anything, there's seemingly no active process running on the host, or even attempting to be run. Allocation logs aren't sowing movement.
So... then I Drained that node, and when that didn't clear the pending allocation, I force drained it, and it's still there. It's a liar because it says it's done and it's not:
I've filed a similar bug before, which claimed to be a UI issue. It seems to me there's a missing piece of logic somewhere that asserts there are in fact no allocations running before a drain is considered done. This seems like more than a UI bug.
Reproduction steps
Expected Result
Actual Result
Job file (if appropriate)
The text was updated successfully, but these errors were encountered: