Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rclone hangs after completion #45

Closed
slifty opened this issue Nov 10, 2022 · 19 comments · Fixed by #74
Closed

rclone hangs after completion #45

slifty opened this issue Nov 10, 2022 · 19 comments · Fixed by #74

Comments

@slifty
Copy link
Contributor

slifty commented Nov 10, 2022

I've opened an issue on the rclone forums about this to get insight there, but after completing a copy command rclone seems to sit there with open processes. I'm betting there's something about our protocol implementation that is causing this issue. Maybe we don't terminate a given connection properly somehow after a file is completely copied.

@slifty
Copy link
Contributor Author

slifty commented Jan 25, 2023

I'm having trouble figuring out how to reproduce this issue -- what hangs for @fenn-cs isn't hanging for me...

I definitely experienced this though.

I was thinking maybe it had to do with the way we return EOF on file reads if the offset is farther than the end of the file, but (1) this is documented as the expected behavior of the EOF status and (2) that same file doesn't result in a hang when I upload it to my own archive.

@kfogel
Copy link
Contributor

kfogel commented Feb 1, 2023

Reproduced with my own "kfogel OTS test" (13539) archive using rclone built from latest source, FWIW. The rclone tree I built was at commit 37623732c6e (current tip of master branch):

$ ~/private/work/ots/clients/permanent/rclone/r/rclone/rclone --version
rclone v1.62.0-DEV
- os/version: debian bookworm/sid (64 bit)
- os/kernel: 6.1.0-1-amd64 (x86_64)
- os/type: linux
- os/arch: amd64
- go/version: go1.19.5
- go/linking: dynamic
- go/tags: none

Command was:

$ ~/private/work/ots/clients/permanent/rclone/r/rclone/rclone copy permanent-prod:/ ./my-permanent-data

@kfogel
Copy link
Contributor

kfogel commented Feb 2, 2023

FWIW, I put some debugging statements into the latest upstream rclone source and built it, to see exactly where the hang is happening in rclone. Here's the diff, followed by a session transcript.

Diff (against commit 37623732c6e, which was HEAD of master as of yesterday):

$ pwd
/home/kfogel/private/work/ots/clients/permanent/rclone/r/rclone
$ git remote show origin | grep "Fetch URL"
  Fetch URL: [email protected]:rclone/rclone.git
$ git diff
diff --git backend/sftp/sftp.go backend/sftp/sftp.go
index c998180c7..586a74798 100644
--- backend/sftp/sftp.go
+++ backend/sftp/sftp.go
@@ -697,12 +697,20 @@ func (f *Fs) drainPool(ctx context.Context) (err error) {
 		fs.Debugf(f, "Closing %d unused connections", len(f.pool))
 	}
 	for i, c := range f.pool {
+		fs.Debugf(f, "KFF: First we check if connection %d is already closed, by trying to close it.", i)
 		if cErr := c.closed(); cErr == nil {
+                        fs.Debugf(f, "KFF: No error from c.closed() means connection %d was still open.", i)
+                        fs.Debugf(f, "KFF: So now we try c.close() on connection %d.", i)
 			cErr = c.close()
 			if cErr != nil {
+                                fs.Debugf(f, "KFF: Got error from c.close() on connection %d.  cErr: %v", i, cErr)
 				err = cErr
-			}
-		}
+			} else {
+                                fs.Debugf(f, "KFF: Got no error from c.close() on connection %d", i)
+                        }
+		} else {
+                        fs.Debugf(f, "KFF: Got error from c.closed(), meaning connection %d must have already been closed.  cErr: %v", i, cErr)
+                }
 		f.pool[i] = nil
 	}
 	f.pool = nil
$ 

Here's the very verbose output of a run -- the interesting stuff is at the end of the transcript:

$ pwd
/home/kfogel/private/work/ots/clients/permanent/rclone/clones
$ /home/kfogel/private/work/ots/clients/permanent/rclone/r/rclone/rclone --version
rclone v1.62.0-DEV
- os/version: debian bookworm/sid (64 bit)
- os/kernel: 6.1.0-1-amd64 (x86_64)
- os/type: linux
- os/arch: amd64
- go/version: go1.19.5
- go/linking: dynamic
- go/tags: none
$ rm -rf my-permanent-data/; /home/kfogel/private/work/ots/clients/permanent/rclone/r/rclone/rclone copy --create-empty-src-dirs -vv permanent-prod:/ my-permanent-data
<7>DEBUG : rclone: Version "v1.62.0-DEV" starting with parameters ["/home/kfogel/private/work/ots/clients/permanent/rclone/r/rclone/rclone" "copy" "--create-empty-src-dirs" "-vv" "permanent-prod:/" "my-permanent-data"]
<7>DEBUG : rclone: systemd logging support activated
<7>DEBUG : Creating backend with remote "permanent-prod:/"
<7>DEBUG : Using config file from "/home/kfogel/.config/rclone/rclone.conf"
<7>DEBUG : sftp://[email protected]@sftp.permanent.org:22//: Keyboard interactive auth requested
<7>DEBUG : sftp://[email protected]@sftp.permanent.org:22//: New connection 192.168.1.79:48218->35.82.80.74:22 to "SSH-2.0-ssh2js1.11.0"
<7>DEBUG : sftp://[email protected]@sftp.permanent.org:22//: Shell type "unix" from config
<7>DEBUG : sftp://[email protected]@sftp.permanent.org:22//: Using root directory "/"
<7>DEBUG : Creating backend with remote "my-permanent-data"
<7>DEBUG : fs cache: renaming cache item "my-permanent-data" to be canonical "/home/kfogel/private/work/ots/clients/permanent/rclone/clones/my-permanent-data"
<7>DEBUG : sftp://[email protected]@sftp.permanent.org:22//: Keyboard interactive auth requested
<7>DEBUG : sftp://[email protected]@sftp.permanent.org:22//: Keyboard interactive auth requested
<7>DEBUG : sftp://[email protected]@sftp.permanent.org:22//: New connection 192.168.1.79:59644->35.82.80.74:22 to "SSH-2.0-ssh2js1.11.0"
<7>DEBUG : sftp://[email protected]@sftp.permanent.org:22//: New connection 192.168.1.79:59630->35.82.80.74:22 to "SSH-2.0-ssh2js1.11.0"
<7>DEBUG : sftp://[email protected]@sftp.permanent.org:22//: Keyboard interactive auth requested
<7>DEBUG : sftp://[email protected]@sftp.permanent.org:22//: New connection 192.168.1.79:59660->35.82.80.74:22 to "SSH-2.0-ssh2js1.11.0"
<6>INFO  : archives/kfogel OTS test (13539)/My Files/Screenshot from 2022-07-12 14-17-51.png: Copied (new)
<6>INFO  : archives/kfogel OTS test (13539)/My Files/dennys-fake-maple-syrup-2022-05-11.jpg: Copied (new)
<6>INFO  : archives/kfogel OTS test (13539)/My Files/tex-kerning.png: Copied (new)
<6>INFO  : archives/kfogel OTS test (13539)/My Files/worker-digging.png: Copied (new)
<6>INFO  : archives/kfogel OTS test (13539)/My Files/file-ending-with-dot: Copied (new)
<6>INFO  : archives/kfogel OTS test (13539)/My Files/GNOME Workspace Switcher bug/gnome-3-shell-42.0-workspace-switcher-bottom-indicator-ws1-to-ws2-2022-04-10.png: Copied (new)
<7>DEBUG : sftp://[email protected]@sftp.permanent.org:22//: Keyboard interactive auth requested
<7>DEBUG : sftp://[email protected]@sftp.permanent.org:22//: Keyboard interactive auth requested
<7>DEBUG : sftp://[email protected]@sftp.permanent.org:22//: Keyboard interactive auth requested
<7>DEBUG : sftp://[email protected]@sftp.permanent.org:22//: New connection 192.168.1.79:43024->35.82.80.74:22 to "SSH-2.0-ssh2js1.11.0"
<7>DEBUG : sftp://[email protected]@sftp.permanent.org:22//: New connection 192.168.1.79:43006->35.82.80.74:22 to "SSH-2.0-ssh2js1.11.0"
<7>DEBUG : sftp://[email protected]@sftp.permanent.org:22//: New connection 192.168.1.79:43018->35.82.80.74:22 to "SSH-2.0-ssh2js1.11.0"
<6>INFO  : archives/kfogel OTS test (13539)/My Files/GNOME Workspace Switcher bug/This subfolder is just to, you know, test subfolders/weird-whiskey-endorsement.jpg: Copied (new)
<6>INFO  : archives/kfogel OTS test (13539)/My Files/file-with-.-dot-in-middle-and-at-end: Copied (new)
<7>DEBUG : Local file system at /home/kfogel/private/work/ots/clients/permanent/rclone/clones/my-permanent-data: Waiting for checks to finish
<7>DEBUG : Local file system at /home/kfogel/private/work/ots/clients/permanent/rclone/clones/my-permanent-data: Waiting for transfers to finish
<6>INFO  : archives/kfogel OTS test (13539)/My Files/GNOME Workspace Switcher bug/gnome-3-shell-42.0-workspace-switcher-bottom-indicator-ws2-to-ws1-2022-04-10.png: Copied (new)
<6>INFO  : archives/kfogel OTS test (13539)/My Files/Cables I have known/50-foot-VGA-cable.jpg: Copied (new)
<7>DEBUG : archives/kfogel OTS test (13539)/Public: Making directory
<7>DEBUG : archives/kfogel OTS test (13539)/My Files/GNOME Workspace Switcher bug/This is an empty subfolder: Making directory
<7>DEBUG : archives/kfogel OTS test (13539)/Apps: Making directory
<7>DEBUG : Local file system at /home/kfogel/private/work/ots/clients/permanent/rclone/clones/my-permanent-data: copied 3 directories
<6>INFO  : 
Transferred:   	   13.413 MiB / 13.413 MiB, 100%, 1.466 MiB/s, ETA 0s
Transferred:           10 / 10, 100%
Elapsed time:        21.8s

<7>DEBUG : 66 go routines active
<7>DEBUG : sftp://[email protected]@sftp.permanent.org:22//: Closing 7 unused connections
<7>DEBUG : sftp://[email protected]@sftp.permanent.org:22//: KFF: First we check if connection 0 is already closed, by trying to close it.
<7>DEBUG : sftp://[email protected]@sftp.permanent.org:22//: KFF: No error from c.closed() means connection 0 was still open.
<7>DEBUG : sftp://[email protected]@sftp.permanent.org:22//: KFF: So now we try c.close() on connection 0.

 #####   It's been hanging here for a long time; eventually I'll interrupt & kill it.  #####

@kfogel
Copy link
Contributor

kfogel commented Feb 2, 2023

Also, I don't know if this matters, but on one of my attempts, after it hung for a while in the traditional way, it suddenly errored out on its own like this:

Transferred:       13.413 MiB / 13.413 MiB, 100%, 544.749 KiB/s, ETA 0s
Transferred:            9 / 9, 100%
Elapsed time:        41.8s

Failed to copy: close tcp 192.168.1.79:52466->35.82.80.74:22: use of closed network connection

This was the only time I've ever seen the hang end on its own, and it happened right before the disk-space filling up and resulting in everything not working one-time event that we were all just talking about. So I don't know if the above error was an early symptom of that -- seems unlikely, but who knows? -- or represents something else.

If it happens again I'll make a noise here.

@slifty
Copy link
Contributor Author

slifty commented Feb 3, 2023

@kfogel would you be up for seeing how far it gets within https://github.com/rclone/rclone/blob/34333d9fa8cd0c8b9bd37fe853ccc3d319109e6b/backend/sftp/sftp.go#L501-L509

^ understood if you have used up your debug effort credits for this issue -- next week I can spend cycles on trying to reproduce this thing.

@kfogel
Copy link
Contributor

kfogel commented Feb 5, 2023

@slifty Hah! Before reading your message above, I made this branch of rclone, with one commit that puts debugging statements into the exact function you requested :-). (I mean, that was the only plausible debugging path, I suppose, so this isn't really so much of a coincidence.)

Here is my usual run, with that debugging patch in place in rclone:

$ rm -rf my-permanent-data/; /home/kfogel/private/work/ots/clients/permanent/rclone/r/rclone/rclone copy --create-empty-src-dirs -vv permanent-prod:/ my-permanent-data
<7>DEBUG : rclone: Version "v1.62.0-DEV" starting with parameters ["/home/kfogel/private/work/ots/clients/permanent/rclone/r/rclone/rclone" "copy" "--create-empty-src-dirs" "-vv" "permanent-prod:/" "my-permanent-data"]
<7>DEBUG : rclone: systemd logging support activated
<7>DEBUG : Creating backend with remote "permanent-prod:/"
<7>DEBUG : Using config file from "/home/kfogel/.config/rclone/rclone.conf"
<7>DEBUG : sftp://[email protected]@sftp.permanent.org:22//: Keyboard interactive auth requested
<7>DEBUG : sftp://[email protected]@sftp.permanent.org:22//: New connection 192.168.1.79:38814->35.82.80.74:22 to "SSH-2.0-ssh2js1.11.0"
<7>DEBUG : sftp://[email protected]@sftp.permanent.org:22//: Shell type "unix" from config
<7>DEBUG : sftp://[email protected]@sftp.permanent.org:22//: Using root directory "/"
<7>DEBUG : Creating backend with remote "my-permanent-data"
<7>DEBUG : fs cache: renaming cache item "my-permanent-data" to be canonical "/home/kfogel/private/work/ots/clients/permanent/rclone/clones/my-permanent-data"
<7>DEBUG : sftp://[email protected]@sftp.permanent.org:22//: Keyboard interactive auth requested
<7>DEBUG : sftp://[email protected]@sftp.permanent.org:22//: Keyboard interactive auth requested
<7>DEBUG : sftp://[email protected]@sftp.permanent.org:22//: New connection 192.168.1.79:38816->35.82.80.74:22 to "SSH-2.0-ssh2js1.11.0"
<7>DEBUG : sftp://[email protected]@sftp.permanent.org:22//: New connection 192.168.1.79:38832->35.82.80.74:22 to "SSH-2.0-ssh2js1.11.0"
<7>DEBUG : sftp://[email protected]@sftp.permanent.org:22//: Keyboard interactive auth requested
<7>DEBUG : sftp://[email protected]@sftp.permanent.org:22//: Keyboard interactive auth requested
<7>DEBUG : sftp://[email protected]@sftp.permanent.org:22//: Keyboard interactive auth requested
<7>DEBUG : sftp://[email protected]@sftp.permanent.org:22//: New connection 192.168.1.79:50286->35.82.80.74:22 to "SSH-2.0-ssh2js1.11.0"
<6>INFO  : archives/kfogel OTS test (13539)/My Files/Screenshot from 2022-07-12 14-17-51.png: Copied (new)
<7>DEBUG : sftp://[email protected]@sftp.permanent.org:22//: New connection 192.168.1.79:50298->35.82.80.74:22 to "SSH-2.0-ssh2js1.11.0"
<7>DEBUG : sftp://[email protected]@sftp.permanent.org:22//: New connection 192.168.1.79:50314->35.82.80.74:22 to "SSH-2.0-ssh2js1.11.0"
<6>INFO  : archives/kfogel OTS test (13539)/My Files/tex-kerning.png: Copied (new)
<6>INFO  : archives/kfogel OTS test (13539)/My Files/dennys-fake-maple-syrup-2022-05-11.jpg: Copied (new)
<6>INFO  : archives/kfogel OTS test (13539)/My Files/worker-digging.png: Copied (new)
<6>INFO  : archives/kfogel OTS test (13539)/My Files/GNOME Workspace Switcher bug/gnome-3-shell-42.0-workspace-switcher-bottom-indicator-ws1-to-ws2-2022-04-10.png: Copied (new)
<6>INFO  : archives/kfogel OTS test (13539)/My Files/file-with-.-dot-in-middle-and-at-end: Copied (new)
<6>INFO  : archives/kfogel OTS test (13539)/My Files/GNOME Workspace Switcher bug/gnome-3-shell-42.0-workspace-switcher-bottom-indicator-ws2-to-ws1-2022-04-10.png: Copied (new)
<6>INFO  : archives/kfogel OTS test (13539)/My Files/file-ending-with-dot: Copied (new)
<7>DEBUG : Local file system at /home/kfogel/private/work/ots/clients/permanent/rclone/clones/my-permanent-data: Waiting for checks to finish
<7>DEBUG : Local file system at /home/kfogel/private/work/ots/clients/permanent/rclone/clones/my-permanent-data: Waiting for transfers to finish
<6>INFO  : archives/kfogel OTS test (13539)/My Files/GNOME Workspace Switcher bug/This subfolder is just to, you know, test subfolders/weird-whiskey-endorsement.jpg: Copied (new)
<6>INFO  : archives/kfogel OTS test (13539)/My Files/Cables I have known/50-foot-VGA-cable.jpg: Copied (new)
<7>DEBUG : archives/kfogel OTS test (13539)/Public: Making directory
<7>DEBUG : archives/kfogel OTS test (13539)/My Files/GNOME Workspace Switcher bug/This is an empty subfolder: Making directory
<7>DEBUG : archives/kfogel OTS test (13539)/Apps: Making directory
<7>DEBUG : Local file system at /home/kfogel/private/work/ots/clients/permanent/rclone/clones/my-permanent-data: copied 3 directories
<6>INFO  : 
Transferred:   	   13.413 MiB / 13.413 MiB, 100%, 1.699 MiB/s, ETA 0s
Transferred:           10 / 10, 100%
Elapsed time:        12.3s

<7>DEBUG : 57 go routines active
<7>DEBUG : sftp://[email protected]@sftp.permanent.org:22//: Closing 6 unused connections
<7>DEBUG : sftp://[email protected]@sftp.permanent.org:22//: KFF [in drainPool()]: First we check if connection 0 is already closed, by trying to close it.
<7>DEBUG : sftp://[email protected]@sftp.permanent.org:22//: KFF [in drainPool()]: No error from c.closed() means connection 0 was still open.
<7>DEBUG : sftp://[email protected]@sftp.permanent.org:22//: KFF [in drainPool()]: So now we try c.close() on connection 0.
<7>DEBUG : sftp://[email protected]@sftp.permanent.org:22//: KFF [in close()]: Entered close(c == &{0xc0007cc1e0 0xc0001d4240 0xc00055ccc0})
<7>DEBUG : sftp://[email protected]@sftp.permanent.org:22//: KFF [in close()]: About to call c.sftpClient.Close()

### and here it is hanging as usual ###

So I think this means that next we have to debug into the underlying Go SFTP library, roughly in the vicinity of here?

I'll tentatively hand off to you for that, because ISTR that you had been debugging the Go SFTP library anyway earlier, and so you probably have it built, know how to get rclone to use your source-built library, etc. Let me know if my assumption that this next step is easier for you than for me is wrong, though.

@kfogel
Copy link
Contributor

kfogel commented Feb 5, 2023

@slifty Oh, I forgot! You haven't been able to reproduce the bug, whereas it reproduces reliably for me.

In that case, if you'd like me to debug into the Go SFTP library, let me know. (I might need some tips from you on how to build and link rclone against it, but I wouldn't bug you until I'd tried.)

@slifty
Copy link
Contributor Author

slifty commented Feb 7, 2023

I am able to reproduce now! (I set up rclone inside of the permanent devenv / on a debian box so that's all)

A few pieces of information:

  1. rclone does NOT hang against a normal sftp service (which is to say that a normal service is doing SOMETHING that ours isn't doing. This is not surprising but still worth noting.
  2. I believe that the .close being called is actually this one: https://github.com/pkg/sftp/blob/cf523d99b5d2f41128caaaafc4c908a6a4d2fa0e/conn.go#L33-L37

In particular I think it gets here:

func (c *conn) Close() error {
	c.Lock()
	defer c.Unlock()
	return c.WriteCloser.Close()
}

I haven't confirmed this yet -- it would involve compiling go with debug output somehow, which I'll do if I have to but for now with these clues I suspect something about the Lock / Unlock is causing an issue, which suggests to me something about a previous call hasn't resolved properly. Again this would make sense.

What I'm going to do is go through all of the SFTP event handlers and ensure all of them are correctly handled.

@slifty
Copy link
Contributor Author

slifty commented Feb 7, 2023

Here's the SFTP debug output of a hanging interaction -- the important thing to note is that it appears that every non-INIT request is properly balanced (init is somehting handled by our library and it LOOKS like it handles it properly, but I'll investigate that).

SFTP: Inbound: Received INIT (v3)
SFTP: Inbound: Received STAT (id:1)
SFTP: Outbound: Buffered ATTRS
SFTP: Inbound: Received STAT (id:2)
SFTP: Outbound: Buffered ATTRS
SFTP: Inbound: Received OPENDIR (id:3)
SFTP: Outbound: Buffered HANDLE
SFTP: Inbound: Received READDIR (id:4)
SFTP: Outbound: Buffered NAME
SFTP: Inbound: Received READDIR (id:5)
SFTP: Outbound: Buffered STATUS
SFTP: Inbound: Received CLOSE (id:6)
SFTP: Outbound: Buffered STATUS
SFTP: Inbound: Received STAT (id:7)
SFTP: Outbound: Buffered ATTRS
SFTP: Inbound: Received OPENDIR (id:8)
SFTP: Outbound: Buffered HANDLE
SFTP: Inbound: Received READDIR (id:9)
SFTP: Outbound: Buffered NAME
SFTP: Inbound: Received READDIR (id:10)
SFTP: Outbound: Buffered STATUS
SFTP: Inbound: Received CLOSE (id:11)
SFTP: Outbound: Buffered STATUS
SFTP: Inbound: Received STAT (id:12)
SFTP: Outbound: Buffered ATTRS
SFTP: Inbound: Received OPENDIR (id:13)
SFTP: Outbound: Buffered HANDLE
SFTP: Inbound: Received READDIR (id:14)
SFTP: Outbound: Buffered NAME
SFTP: Inbound: Received READDIR (id:15)
SFTP: Outbound: Buffered STATUS
SFTP: Inbound: Received CLOSE (id:16)
SFTP: Outbound: Buffered STATUS
SFTP: Inbound: Received STAT (id:17)
SFTP: Inbound: Received INIT (v3)
SFTP: Inbound: Received STAT (id:1)
SFTP: Inbound: Received INIT (v3)
SFTP: Inbound: Received STAT (id:1)
SFTP: Outbound: Buffered ATTRS
SFTP: Inbound: Received OPENDIR (id:18)
SFTP: Outbound: Buffered HANDLE
SFTP: Inbound: Received READDIR (id:19)
SFTP: Outbound: Buffered STATUS
SFTP: Inbound: Received CLOSE (id:20)
SFTP: Outbound: Buffered STATUS
SFTP: Outbound: Buffered ATTRS
SFTP: Inbound: Received OPENDIR (id:21)
SFTP: Outbound: Buffered ATTRS
SFTP: Inbound: Received OPENDIR (id:2)
SFTP: Outbound: Buffered HANDLE
SFTP: Inbound: Received READDIR (id:22)
SFTP: Outbound: Buffered STATUS
SFTP: Inbound: Received CLOSE (id:23)
SFTP: Outbound: Buffered STATUS
SFTP: Outbound: Buffered HANDLE
SFTP: Inbound: Received READDIR (id:3)
SFTP: Outbound: Buffered NAME
SFTP: Inbound: Received READDIR (id:4)
SFTP: Outbound: Buffered STATUS
SFTP: Inbound: Received CLOSE (id:5)
SFTP: Outbound: Buffered STATUS
SFTP: Inbound: Received STAT (id:2)
SFTP: Outbound: Buffered ATTRS
SFTP: Inbound: Received OPENDIR (id:24)
SFTP: Outbound: Buffered HANDLE
SFTP: Inbound: Received READDIR (id:25)
SFTP: Outbound: Buffered NAME
SFTP: Inbound: Received READDIR (id:26)
SFTP: Outbound: Buffered STATUS
SFTP: Inbound: Received CLOSE (id:27)
SFTP: Outbound: Buffered STATUS

It isn't yet clear to me why there are two new INIT calls in the middle of the interaction.

That said, the fact that each INBOUND appears to have a corresponding OUTBOUND is what I expect to see.

This also helps us potentially narrow in on the fact that only these calls are being made, lowering the possible footprint of any bug.

@slifty
Copy link
Contributor Author

slifty commented Feb 7, 2023

The extra INIT calls reflect rclone opening additional connections (I don't know if this in itself is an indication of an error or if this is just rclone doing concurrent processing somehow). This in itself should not be a problem, but it's worth spending a moment to think about.

@kfogel
Copy link
Contributor

kfogel commented Feb 7, 2023

nod Nice debugging so far, @slifty! I'm just following along.

@slifty
Copy link
Contributor Author

slifty commented Feb 8, 2023

I added in use of the SSH tooling's logging utilities (at silly level). PR incoming.

In the mean time, here is the diff between the logs from a HANG vs NO HANG:

https://gist.github.com/slifty/23f5158738d90d4a46ab9539e96a63df/revisions?diff=split

There are some minor differences in sequencing of requests (and NO_HANG asks for a REALPATH up front), but the most prominent difference to me is that HANG sends a Inbound: CHANNEL_EOF (r:0) command at the end, while NO HANG doesn't.

I don't know what CHANNEL_EOF is (this is an SSH level command, I believe) but I'm going to learn more and see if possibly the ssh2 library doesn't respond to it correctly.

@slifty
Copy link
Contributor Author

slifty commented Feb 8, 2023

I don't know if this issue in Go SSH is relevant -- probably not, but linking anyway: golang/go#29733

@slifty
Copy link
Contributor Author

slifty commented Feb 8, 2023

SSH spec's instructions on how a CHANNEL_EOF message should be handled: https://www.rfc-editor.org/rfc/rfc4254#section-5.3

   No explicit response is sent to this message.  However, the
   application may send EOF to whatever is at the other end of the
   channel.  Note that the channel remains open after this message, and
   more data may still be sent in the other direction.  This message
   does not consume window space and can be sent even if no window space
   is available.

   When either party wishes to terminate the channel, it sends
   SSH_MSG_CHANNEL_CLOSE.  Upon receiving this message, a party MUST
   send back an SSH_MSG_CHANNEL_CLOSE unless it has already sent this
   message for the channel.  The channel is considered closed for a
   party when it has both sent and received SSH_MSG_CHANNEL_CLOSE, and
   the party may then reuse the channel number.  A party MAY send
   SSH_MSG_CHANNEL_CLOSE without having sent or received
   SSH_MSG_CHANNEL_EOF.

@slifty
Copy link
Contributor Author

slifty commented Feb 8, 2023

Actually, I think the Close is this one: https://github.com/pkg/sftp/blob/cf523d99b5d2f41128caaaafc4c908a6a4d2fa0e/conn.go#L58-L62

In particular... note the .Wait call there. Is it possible it's waiting for the SERVER to close the connection?

Also apparently back in November I posted a bit more debug output in the rclone forums which have a few clues: https://forum.rclone.org/t/rclone-does-not-end-after-finishing-an-sftp-copy/34004

That output includes a reference to:

	github.com/pkg/[email protected]/conn.go:60 +0x28 fp=0x14000dc1890 sp=0x14000dc1870 pc=0x103d53db8
github.com/pkg/sftp.(*clientConn).Close(0x14000dc1908?)

in the stacktrace, which aligns with the thought that it's the newly linked Close being called (that uses Wait)

@slifty
Copy link
Contributor Author

slifty commented Feb 8, 2023

I'm currently trying to figure out where Inbound: CHANNEL_EOF is emitted -- searching for Inbound: in the ssh2 code base finds plenty of static debug strings for various codes, but not that one. Sarching for CHANNEL_EOF in the ssh2 code finds some outbound / client logic... but nothing server side yet. The dig continues.

@slifty
Copy link
Contributor Author

slifty commented Feb 8, 2023

Update: editor was filtering out files (even when I told it not to) so, that's that.

Here's the line: https://github.com/mscdex/ssh2/blob/master/lib/protocol/handlers.misc.js#L904

@slifty
Copy link
Contributor Author

slifty commented Feb 8, 2023

AH HA!

here's the issue: mscdex/ssh2#1111

slifty added a commit that referenced this issue Feb 8, 2023
There is an open PR in the upstream / ssh2 library related to properly
closing SFTP channels when a `CHANNEL_EOF` message is received[1].
Unfortunately that PR has not yet been merged.

Until that happens we need to handle the signal ourselves.
Unfortunately this requires us to access "private" attributes
(JavaScript doesn't support the concept of private attributes, so we are
able to do this...). This is, of course a terrible and horrible idea and
all of our tooling gets very upset about it.  As a result I had to
disable the tooling for the relevant line.

Once the PR is merged in upstream we should upgrade to it and delete the
contents of this commit.

Issue #45 rclone hangs after completion

[1] mscdex/ssh2#1111
@slifty slifty mentioned this issue Feb 8, 2023
slifty added a commit that referenced this issue Feb 8, 2023
There is an open PR in the upstream / ssh2 library related to properly
closing SFTP channels when a `CHANNEL_EOF` message is received[1].
Unfortunately that PR has not yet been merged.

Until that happens we need to handle the signal ourselves.
Unfortunately this requires us to access "private" attributes
(JavaScript doesn't support the concept of private attributes, so we are
able to do this...). This is, of course a terrible and horrible idea and
all of our tooling gets very upset about it.  As a result I had to
disable the tooling for the relevant line.

Once the PR is merged in upstream we should upgrade to it and delete the
contents of this commit.

Issue #45 rclone hangs after completion

[1] mscdex/ssh2#1111
slifty added a commit that referenced this issue Feb 8, 2023
There is an open PR in the upstream / ssh2 library related to properly
closing SFTP channels when a `CHANNEL_EOF` message is received[1].
Unfortunately that PR has not yet been merged.

Until that happens we need to handle the signal ourselves.
Unfortunately this requires us to access "private" attributes
(JavaScript doesn't support the concept of private attributes, so we are
able to do this...). This is, of course a terrible and horrible idea and
all of our tooling gets very upset about it.  As a result I had to
disable the tooling for the relevant line.

Once the PR is merged in upstream we should upgrade to it and delete the
contents of this commit.

Issue #45 rclone hangs after completion

[1] mscdex/ssh2#1111
slifty added a commit that referenced this issue Feb 8, 2023
There is an open PR in the upstream / ssh2 library related to properly
closing SFTP channels when a `CHANNEL_EOF` message is received[1].
Unfortunately that PR has not yet been merged.

Until that happens we need to handle the signal ourselves.
Unfortunately this requires us to access "private" attributes
(JavaScript doesn't support the concept of private attributes, so we are
able to do this...). This is, of course a terrible and horrible idea and
all of our tooling gets very upset about it.  As a result I had to
disable the tooling for the relevant line.

Once the PR is merged in upstream we should upgrade to it and delete the
contents of this commit.

Issue #45 rclone hangs after completion

[1] mscdex/ssh2#1111
@slifty
Copy link
Contributor Author

slifty commented Feb 8, 2023

I've opened a patch above with plenty more detail about what's going on -- ultimately CHANNEL_EOF is not properly handled by the ssh2 library we use so we need to handle it ourselves.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants