-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
process not managed by Nomad killed on restart #23969
Comments
Hi @yigit-erkoca and thanks for raising this issue. Nomad should certainly not be terminating processes it does not control, however, having looked at the client logs nothing stands out initially. I'll mark this for further investigation and add it to our backlog. If you do get any additional information that might be useful, please add it to this issue. |
Hello @jrasell, thank you for your attention! Indeed I'm extending the logs to find out the root cause. I have a setup that I can reproduce this issue quite easily. I'll add more details to the issue and let you know. I appreciate the help. |
It looks like the relevant portion of the logs is the following, which suggests that the spuriously killed process is getting killed while we're trying to restore the process for the previously-running task.
What we see here is that the client starts up, looks in its local state, and tries to restore handles to the tasks that were running before the host shutdown (ref We then call While we have saved the PID of the previously-running task in the local state, I don't see any code path where this PID is actually being made available for the driver to kill. @yigit-erkoca you say you have a reliable reproduction, right? Can you confirm that the PID of the process getting incorrectly killed the same as the PID of the previously-running task or its executor process? |
I did a little more testing and what's particularly weird about all of this is that the executor process is the one that actually sends the kill signal to the child process (or the Windows equivalent which is |
@tgross, thank you for looking into this! We have indeed a quite reliable reproduction environment. We have reproduced it again today to provide you more information with an example. Nomad is installed as windows service, and starts normally, starts all the executors, tasks, etc. All the executors listed below; please note the highlighted instance (which will be the one we will be interested in) After restarting/rebooting nomad will try to restore the tasks, and one of these occasions if there is any other process with the same PID above "5652" in this case notepad.exe gets the "5652". Nomad during the restore action, it will kill the notepad.exe. Which can be seen in the windows event viewer below. I hope this clarifies your question. From this, it looks like the PID is the executor's PID, not the task's PID. If you need any further information, or things that we can try, please let us we can use our reproduction environment. |
Thanks @yigit-erkoca, that's super helpful! That cuts the size of the problem in half and points me towards the go-plugin interface. I'll investigate further and report back. |
I did a little bit of parallel investigation on Linux, and using |
Looks like I found the culprit, and it totally can happen on other platforms. It's this line in go-plugin:
I'm having a chat with colleagues about the best approach to fix this. The obvious fix is to simply delete that line, but I assume there's a reason it exists. 😁 (Edit: this is an extremely old behavior hashicorp/go-plugin#4 which makes it even trickier) |
During reattachment, we look to see if the process corresponding to the stored PID is running. If so, we try to connect to that process. If that fails, we kill the process under the presumption it's not working, and return ErrProcessNotFound. But during reattachment we don't know that the PID we have is still valid. Which means that the process we're trying to attach to may have exited and a different process has spawned with the same PID. This results in some unrelated process getting silently killed. This impacts Nomad when running the `rawexec` or `exec` task drivers, because the Nomad agent spawns an "executor" process via go-plugin to control the workloads, and these executors are left running when Nomad exits. If the executors die in the meantime (or the host is rebooted), then we can potentially kill a random process on the host. Because there's no way for go-plugin to know whether the process is a go-plugin server without connecting, this kill is never really safe. Remove it. Ref: hashicorp/nomad#23969 Ref: https://hashicorp.atlassian.net/browse/NET-11233
Dependency update is here: #24265 |
During allocation recovery, the go-plugin reattachment workflow checks to see if the PID we stored in client state is still running. If so, we try to connect to that process. If that fails, we kill the process under the presumption it's not working. But during reattachment we don't know that the PID we have is still valid. Which means that the process we're trying to attach to may have exited and a different process has spawned with the same PID. This results in some unrelated process getting silently killed. Update go-plugin to 1.6.2, which includes a bug fix so we do not kill the process. Fixes: #23969
@tgross Hello. We are running the fixed version, but it seems our processes are still being killed, this time likely from here. We see "plugin failed to exit gracefully" in nomad logs just as our process ends unexpectedly. We are not able to reliably reproduce the issue at the moment (seems to occur under different circumstances than the original). |
@peter-harmann-tfs just to confirm, you mean an unrelated process is getting killed? That is, one not managed by Nomad? |
@tgross Yes, it is an unrelated process. Maybe to clarify, I am colleague of yigit-erkoca, so I am talking about the same deployment as the original issue. |
Ok, thanks. I'll reopen and get this on the queue for another in-depth investigation. |
Hi, I am colleague of yigit-erkoca and peter-harmann-tfs. We conducted a more detailed investigation regarding this issue, in order to provide steps to reproduce it. We extended the logs in the reattach() and Kill() functions in the client.go file. This allowed us to log the reattach PID and reattach address information for the process that is being terminated. The issue is finally reproduced, and observed that it occurs when the following sequence of events takes place:
I hope the information above is sufficient to identify the root cause of the problem. If more details are needed, we would be happy to provide them. |
Thanks @cenk-saldirim. That scenario is exactly what hashicorp/go-plugin#320 was supposed to have fixed, which is particularly strange. Also, just a heads up that my colleague @mismithhisler is taking over the investigation from our end. |
Thanks for the information @tgross. I believe, the only difference between previous issue and the current one is the reattach address of the executor process which has exited is actively being used by another process. If that reattach address:port is not actively listening, go-plugin does not attempt to kill any process. |
Ah! Sorry, I completely missed that... that complicates things quite a bit! |
Hi @cenk-saldirim, thanks for all the information. Do you have any more logs you could attach here showing this most recent issue? The "plugin failed to exit gracefully" is helpful, but we might be able to narrow this search with more. Also, can you confirm the clients that are having this issue are only running raw_exec tasks? And not running any external task drivers (i.e. Podman etc)? |
@cenk-saldirim Just following up to my previous comment to let you know we can reproduce this and are working on a fix. Thanks again for the info you provided. |
Hi @mismithhisler, thank you very much for your finding and glad to hear that you can reproduce this. I can share the logs we collected while observing this behavior, if still needed. |
I think we should be good on logs for now, thank you though! |
Nomad version
Nomad v1.7.7
BuildDate 2024-04-16T19:26:43Z
Revision 0f34c85
Operating system and Environment details
Windows 10 Enterprise 2016 LTSB
Issue
Nomad instance is killing another process on the system that is not deployed by nomad.
I see in the application event viewer log the following log from source "Process Exit Monitor"
2024-09-16 23:20:06 The process 'C:\simulator\BoardSimulator.exe' was terminated by the process 'C:\Program Files\Infra\Nomad\nomad.exe' with termination code 1. The creation time for the exiting process was 0x01db07ac4767577e.
Reproduction steps
On system reboot, some applications like simulator executables start by other windows services, as well as nomad instance.
This is a sporadic issue, does not happen at every reboot, but maybe estimated to be once in every ~100.
Expected Result
Nomad not killing/interfering any other process on the system that is not deployed by using nomad jobs.
Actual Result
Nomad is terminating other processes that are running in the system.
Job file (if appropriate)
There is only one job using raw_exec. We are just setting it up.
NomadJob.txt
Nomad Server logs (if appropriate)
Nomad Client logs (if appropriate)
nomad.log
The timestamp of the process termination according to the eventlog: 2024-09-16 23:20:06
The text was updated successfully, but these errors were encountered: