-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pdsh spins after all children are done #85
Comments
I don't think this is a known issue. Is the I haven't looked at this code in awhile, so I'll have to peek and see how and when the ssh module calls waitpid on its children. What surprises me is that the dsh threads are still active, as I would think the fds open to the ssh processes would be closed when the processes exit |
Something to look at would be Also, does this reproduce when you don't run |
Yes. |
Was the previous information useful in any way? Or anything else I can do to gather more information? |
Unfortunately I can't figure out why the socket pair for stdout/err would still be open when the ssh processes have exited, unless there is a bug where the ssh side of socketpairs is not closed in pdsh after fork... It might help to try to strace a pdsh process that is in this state, to see if it is blocked or continuously waking up from |
According to the backtrace pasted earlier, the dsh() is doing:
For good measure:
|
Yes I'm not sure I have any clues as to what is wrong here... Does the problem reproduce with the simplest test case using the "exec" rcmd type to run something instead of "ssh"? (you might have to make some fake workload) |
So it's interesting. Reducing the remote command down to a simple
and
I've still got this running in case there is something we can dig out of it. |
Ah, good reproducer thanks. Since it happens only occasionally it must be a race of some sort, which means it might be a little difficult to get a solution quickly. In the output, did pdsh receive and print the "hello" from foo-55? |
BTW, does something similar happen if you replace |
I think the
Yeah. Killing that |
I think ultimately this problem has to do with ssh multiplexing (i.e. using Which is quite a pity since multiplexing makes the per-connection handshake quite a bit quicker. |
Interesting! and thanks for running that down! I can't think of how the ssh process stdout/err fds stay open even after the process exits.. even with connection sharing this scheme should work, otherwise something like Does the remote sshd go away if you kill pdsh and let init reap the defunct ssh processes? I wonder if a kludge could be written to work around this case which would allow sshcmd dsh threads to immediately reap commands when they exit (if calling |
I'm frequently seeing this with 2.31:
with
pdsh
doing this at the above time:Clearly once the remote
sssh
's are done,pdsh
is not noticing this and reaping the children and exiting correctly.Known issue, or something more I can do to debug?
The text was updated successfully, but these errors were encountered: