Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add cancellable context in ssh CLI #2971

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

LiamWalsh98
Copy link

When using the RunCommand function, there is a case where if an SSH connection is stopped from the server side without closing the connection, the program calling it will halt and be unable to resume. This may occur when a router is restarted, after which the severed ssh session will not be reestablished.

To illustrate, this line will cause deadlock

cliClient.RunCommand(context.Background(), "run killall -STOP sshd")

note: this will also block incoming ssh until killall -CONT sshd is executed. For testing, try instead run killall -STOP sshd; sleep 10s; killall -CONT sshd

with the proposed changes, it can be handled using a context with a timeout/deadline (or otherwise cancelled)

ctx, cancel := context.WithTimeout(context.Background(), time.Second*5)
defer cancel()
cliClient.RunCommand(ctx, "run killall -STOP sshd")

Have also added the sshConfig timeout to DialCLI's ClientConfig.

@LiamWalsh98 LiamWalsh98 requested review from a team as code owners May 8, 2024 18:11
@OpenConfigBot
Copy link

OpenConfigBot commented May 8, 2024

Pull Request Functional Test Report for #2971 / 0570dd7

No tests identified for validation.

Help

@kjahed kjahed added the cisco-pr label May 8, 2024
@coveralls
Copy link

Pull Request Test Coverage Report for Build 9006523799

Details

  • 18 of 25 (72.0%) changed or added relevant lines in 2 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.1%) to 55.973%

Changes Missing Coverage Covered Lines Changed/Added Lines %
topologies/binding/binding.go 0 1 0.0%
topologies/binding/cli.go 18 24 75.0%
Totals Coverage Status
Change from base Build 9004753373: 0.1%
Covered Lines: 1996
Relevant Lines: 3566

💛 - Coveralls

@singh-prem
Copy link
Contributor

Hi @dplore , could someone take a look at this PR

@dplore
Copy link
Member

dplore commented Jul 24, 2024

Hi @ram-mac please review, thank you.

Copy link
Member

@dplore dplore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, adding the timeout is good. I am not sure why the goroutine is needed though? If the DUT ssh session dies, the timeout should handle closing the connection with a timeout error. A test that triggers a reboot should expect this to occur and handle the timeout.

The use case of cliClient.RunCommand(context.Background(), "run killall -STOP sshd") I think it not something that should be handled in the binding except to have a timeout error returned.

I suggest only adding the timeout and not changing the code for dialCLI.

err error
})

go func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why introduce a goroutine? could this introduce flakiness? Wouldn't it be more simple to just keep this in the main process/thread as it was before?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I should clarify that the proposed change in DialCLI is unrelated to the main issue but seemed close enough in spirit to tack onto this PR. DialCLI as-is ignores the timeout value from the binding file (which is handled in the crypto package).

I can remove it from this PR but it seemed like a relevant enough custodial change that I ended up including it.


The issue seems to be the edge case where the ssh session connected to the DUT doesn't die, but hangs indefinitely. The killall -STOP sshd is just pausing the DUT's sshd in order to repro the client's behaviour.

A test that triggers a reboot should expect this to occur and handle the timeout.

I agree that a test expecting an unresponsive sshd would need to handle the timeout. To me this means providing a context with a deadline to whatever request is being made, and deferring a cancellation in the thread making the call. Ideally this would mean passing down the ctx all the way to whatever dependency actually makes the request, however it is not forwarded further down the call stack beyond RunCommand. This is the reason I would like to handle it here. (i.e. sess.CombinedOutput doesn't accept context, and this is the function that hangs indefinitely when the DUT's sshd is unresponsive). The main thread will wait for the first of either: context timeout or CombinedOutput to return.

As for handling it without a goroutine, I'm not too sure how it would be implemented. Maybe I am overlooking something?

Thanks for taking the time to review!

@coveralls
Copy link

Pull Request Test Coverage Report for Build 12002997967

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 18 of 25 (72.0%) changed or added relevant lines in 2 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.1%) to 55.383%

Changes Missing Coverage Covered Lines Changed/Added Lines %
topologies/binding/binding.go 0 1 0.0%
topologies/binding/cli.go 18 24 75.0%
Totals Coverage Status
Change from base Build 12000810336: 0.1%
Covered Lines: 1996
Relevant Lines: 3604

💛 - Coveralls

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants