-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rclone hangs after completion #45
Comments
I'm having trouble figuring out how to reproduce this issue -- what hangs for @fenn-cs isn't hanging for me... I definitely experienced this though. I was thinking maybe it had to do with the way we return EOF on file reads if the offset is farther than the end of the file, but (1) this is documented as the expected behavior of the EOF status and (2) that same file doesn't result in a hang when I upload it to my own archive. |
Reproduced with my own "kfogel OTS test" (13539) archive using rclone built from latest source, FWIW. The rclone tree I built was at commit 37623732c6e (current tip of
Command was:
|
FWIW, I put some debugging statements into the latest upstream rclone source and built it, to see exactly where the hang is happening in rclone. Here's the diff, followed by a session transcript. Diff (against commit 37623732c6e, which was HEAD of
Here's the very verbose output of a run -- the interesting stuff is at the end of the transcript:
|
Also, I don't know if this matters, but on one of my attempts, after it hung for a while in the traditional way, it suddenly errored out on its own like this:
This was the only time I've ever seen the hang end on its own, and it happened right before the disk-space filling up and resulting in everything not working one-time event that we were all just talking about. So I don't know if the above error was an early symptom of that -- seems unlikely, but who knows? -- or represents something else. If it happens again I'll make a noise here. |
@kfogel would you be up for seeing how far it gets within https://github.com/rclone/rclone/blob/34333d9fa8cd0c8b9bd37fe853ccc3d319109e6b/backend/sftp/sftp.go#L501-L509 ^ understood if you have used up your debug effort credits for this issue -- next week I can spend cycles on trying to reproduce this thing. |
@slifty Hah! Before reading your message above, I made this branch of rclone, with one commit that puts debugging statements into the exact function you requested :-). (I mean, that was the only plausible debugging path, I suppose, so this isn't really so much of a coincidence.) Here is my usual run, with that debugging patch in place in rclone:
So I think this means that next we have to debug into the underlying Go SFTP library, roughly in the vicinity of here? I'll tentatively hand off to you for that, because ISTR that you had been debugging the Go SFTP library anyway earlier, and so you probably have it built, know how to get rclone to use your source-built library, etc. Let me know if my assumption that this next step is easier for you than for me is wrong, though. |
@slifty Oh, I forgot! You haven't been able to reproduce the bug, whereas it reproduces reliably for me. In that case, if you'd like me to debug into the Go SFTP library, let me know. (I might need some tips from you on how to build and link rclone against it, but I wouldn't bug you until I'd tried.) |
I am able to reproduce now! (I set up rclone inside of the permanent devenv / on a debian box so that's all) A few pieces of information:
In particular I think it gets here:
I haven't confirmed this yet -- it would involve compiling go with debug output somehow, which I'll do if I have to but for now with these clues I suspect something about the Lock / Unlock is causing an issue, which suggests to me something about a previous call hasn't resolved properly. Again this would make sense. What I'm going to do is go through all of the SFTP event handlers and ensure all of them are correctly handled. |
Here's the SFTP debug output of a hanging interaction -- the important thing to note is that it appears that every non-INIT request is properly balanced (init is somehting handled by our library and it LOOKS like it handles it properly, but I'll investigate that).
It isn't yet clear to me why there are two new INIT calls in the middle of the interaction. That said, the fact that each INBOUND appears to have a corresponding OUTBOUND is what I expect to see. This also helps us potentially narrow in on the fact that only these calls are being made, lowering the possible footprint of any bug. |
The extra INIT calls reflect rclone opening additional connections (I don't know if this in itself is an indication of an error or if this is just rclone doing concurrent processing somehow). This in itself should not be a problem, but it's worth spending a moment to think about. |
nod Nice debugging so far, @slifty! I'm just following along. |
I added in use of the SSH tooling's logging utilities (at In the mean time, here is the diff between the logs from a HANG vs NO HANG: https://gist.github.com/slifty/23f5158738d90d4a46ab9539e96a63df/revisions?diff=split There are some minor differences in sequencing of requests (and NO_HANG asks for a REALPATH up front), but the most prominent difference to me is that HANG sends a I don't know what CHANNEL_EOF is (this is an SSH level command, I believe) but I'm going to learn more and see if possibly the ssh2 library doesn't respond to it correctly. |
I don't know if this issue in Go SSH is relevant -- probably not, but linking anyway: golang/go#29733 |
SSH spec's instructions on how a CHANNEL_EOF message should be handled: https://www.rfc-editor.org/rfc/rfc4254#section-5.3
|
Actually, I think the In particular... note the Also apparently back in November I posted a bit more debug output in the rclone forums which have a few clues: https://forum.rclone.org/t/rclone-does-not-end-after-finishing-an-sftp-copy/34004 That output includes a reference to:
in the stacktrace, which aligns with the thought that it's the newly linked |
I'm currently trying to figure out where |
Update: editor was filtering out files (even when I told it not to) so, that's that. Here's the line: https://github.com/mscdex/ssh2/blob/master/lib/protocol/handlers.misc.js#L904 |
AH HA! here's the issue: mscdex/ssh2#1111 |
There is an open PR in the upstream / ssh2 library related to properly closing SFTP channels when a `CHANNEL_EOF` message is received[1]. Unfortunately that PR has not yet been merged. Until that happens we need to handle the signal ourselves. Unfortunately this requires us to access "private" attributes (JavaScript doesn't support the concept of private attributes, so we are able to do this...). This is, of course a terrible and horrible idea and all of our tooling gets very upset about it. As a result I had to disable the tooling for the relevant line. Once the PR is merged in upstream we should upgrade to it and delete the contents of this commit. Issue #45 rclone hangs after completion [1] mscdex/ssh2#1111
There is an open PR in the upstream / ssh2 library related to properly closing SFTP channels when a `CHANNEL_EOF` message is received[1]. Unfortunately that PR has not yet been merged. Until that happens we need to handle the signal ourselves. Unfortunately this requires us to access "private" attributes (JavaScript doesn't support the concept of private attributes, so we are able to do this...). This is, of course a terrible and horrible idea and all of our tooling gets very upset about it. As a result I had to disable the tooling for the relevant line. Once the PR is merged in upstream we should upgrade to it and delete the contents of this commit. Issue #45 rclone hangs after completion [1] mscdex/ssh2#1111
There is an open PR in the upstream / ssh2 library related to properly closing SFTP channels when a `CHANNEL_EOF` message is received[1]. Unfortunately that PR has not yet been merged. Until that happens we need to handle the signal ourselves. Unfortunately this requires us to access "private" attributes (JavaScript doesn't support the concept of private attributes, so we are able to do this...). This is, of course a terrible and horrible idea and all of our tooling gets very upset about it. As a result I had to disable the tooling for the relevant line. Once the PR is merged in upstream we should upgrade to it and delete the contents of this commit. Issue #45 rclone hangs after completion [1] mscdex/ssh2#1111
There is an open PR in the upstream / ssh2 library related to properly closing SFTP channels when a `CHANNEL_EOF` message is received[1]. Unfortunately that PR has not yet been merged. Until that happens we need to handle the signal ourselves. Unfortunately this requires us to access "private" attributes (JavaScript doesn't support the concept of private attributes, so we are able to do this...). This is, of course a terrible and horrible idea and all of our tooling gets very upset about it. As a result I had to disable the tooling for the relevant line. Once the PR is merged in upstream we should upgrade to it and delete the contents of this commit. Issue #45 rclone hangs after completion [1] mscdex/ssh2#1111
I've opened a patch above with plenty more detail about what's going on -- ultimately CHANNEL_EOF is not properly handled by the ssh2 library we use so we need to handle it ourselves. |
I've opened an issue on the rclone forums about this to get insight there, but after completing a
copy
command rclone seems to sit there with open processes. I'm betting there's something about our protocol implementation that is causing this issue. Maybe we don't terminate a given connection properly somehow after a file is completely copied.The text was updated successfully, but these errors were encountered: