-
Notifications
You must be signed in to change notification settings - Fork 152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add cancellable context in ssh CLI #2971
base: main
Are you sure you want to change the base?
Conversation
Pull Request Test Coverage Report for Build 9006523799Details
💛 - Coveralls |
Hi @dplore , could someone take a look at this PR |
Hi @ram-mac please review, thank you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, adding the timeout is good. I am not sure why the goroutine is needed though? If the DUT ssh session dies, the timeout should handle closing the connection with a timeout error. A test that triggers a reboot should expect this to occur and handle the timeout.
The use case of cliClient.RunCommand(context.Background(), "run killall -STOP sshd")
I think it not something that should be handled in the binding except to have a timeout error returned.
I suggest only adding the timeout and not changing the code for dialCLI.
err error | ||
}) | ||
|
||
go func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why introduce a goroutine? could this introduce flakiness? Wouldn't it be more simple to just keep this in the main process/thread as it was before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I should clarify that the proposed change in DialCLI is unrelated to the main issue but seemed close enough in spirit to tack onto this PR. DialCLI as-is ignores the timeout value from the binding file (which is handled in the crypto package).
I can remove it from this PR but it seemed like a relevant enough custodial change that I ended up including it.
The issue seems to be the edge case where the ssh session connected to the DUT doesn't die, but hangs indefinitely. The killall -STOP sshd
is just pausing the DUT's sshd in order to repro the client's behaviour.
A test that triggers a reboot should expect this to occur and handle the timeout.
I agree that a test expecting an unresponsive sshd would need to handle the timeout. To me this means providing a context with a deadline to whatever request is being made, and deferring a cancellation in the thread making the call. Ideally this would mean passing down the ctx all the way to whatever dependency actually makes the request, however it is not forwarded further down the call stack beyond RunCommand. This is the reason I would like to handle it here. (i.e. sess.CombinedOutput doesn't accept context, and this is the function that hangs indefinitely when the DUT's sshd is unresponsive). The main thread will wait for the first of either: context timeout or CombinedOutput to return.
As for handling it without a goroutine, I'm not too sure how it would be implemented. Maybe I am overlooking something?
Thanks for taking the time to review!
Pull Request Test Coverage Report for Build 12002997967Warning: This coverage report may be inaccurate.This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
Details
💛 - Coveralls |
When using the RunCommand function, there is a case where if an SSH connection is stopped from the server side without closing the connection, the program calling it will halt and be unable to resume. This may occur when a router is restarted, after which the severed ssh session will not be reestablished.
To illustrate, this line will cause deadlock
note: this will also block incoming ssh until
killall -CONT sshd
is executed. For testing, try insteadrun killall -STOP sshd; sleep 10s; killall -CONT sshd
with the proposed changes, it can be handled using a context with a timeout/deadline (or otherwise cancelled)
Have also added the sshConfig timeout to DialCLI's ClientConfig.