Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#40] Allow SSHKit.run executing on hosts in parallel #131

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

seungjin
Copy link

@seungjin seungjin commented Aug 30, 2018

Allow SSHKit.run executing on hosts in parallel

Description

SSHKit.run can do its job in sequential and parallel.
SSHKit.run(context, cmd, :sequential, 1000) runs cmd sequentially with its timeout 1000ms.
SSHKit.run(context, cmd, :parallel, 1000) runs cmd sequentially with its each cmd's timeout 1000ms.

Motivation and Context

Please refer to #40

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist

  • I have updated the documentation accordingly.
  • I have added tests to cover my changes (with unit and/or functional tests).
  • I have added a note to CHANGELOG.md if necessary (in the ## master section).

@seungjin seungjin changed the title https://github.com/bitcrowd/sshkit.ex/issues/40 Allow SSHKit.run executing on hosts in parallel Allow SSHKit.run executing on hosts in parallel Aug 30, 2018
@seungjin seungjin changed the title Allow SSHKit.run executing on hosts in parallel [#40] Allow SSHKit.run executing on hosts in parallel Aug 30, 2018
Running 'sleep 2' cmd in parallel (to 2 connections) and let's call it a pass
when it returns its run time less than 4 sec. Sequential runs more than 4 sec.
@seungjin
Copy link
Author

seungjin commented Sep 3, 2018

  • I have added tests to cover my changes (with unit and/or functional tests).

functional test added:
0d239f6
7c84642

testing 2 ssh sessions with 'sleep 2' commands and checking the returns less than 4 seconds.

@seungjin seungjin closed this Sep 3, 2018
@seungjin seungjin reopened this Sep 3, 2018
Copy link
Contributor

@pmeinhardt pmeinhardt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@seungjin thank you so much for your contribution 🙂 👍

I've had a little time to take a closer look and added a few comments.
If you could add the corresponding changelog entries and documentation that'd be amazing.

We're more than happy to assist. If you need anything, please don't hesitate to let us know.

Cheers

assert String.trim(stdout(output1)) == host.options[:user]
assert String.trim(stdout(output2)) == host.options[:user]

end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 For the test, I would rather start 2 hosts. I would expect this to be the more common use case for the :parallel feature.

You can start another host by adding another config entry to the :boot tag. Maybe with a different user, so the output is actually different.

A small note on formatting:

  • Please remove the blank line (line 33)
  • Throughout the code base we use spaces after the comma between list items (on line 23, seems like Credo does not catch that)

Thank you 🙂

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, I would like to avoid relying on time in the tests since that may fail for a bunch of reasons and make the test suite flaky. I wonder if this could be tested in a different way? 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tessi do you have an idea?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@tessi tessi Sep 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might need two tests :)

  1. to test parallel execution on two different hosts (as @pmeinhardt said) which is not required to test the actualy parallelism but instead focuses on correct execution (so we see when things break for real users in a real'ish scenario).
  2. test parallel execution on the same host, in this time also testing that things really run in parallel.

For 2. I agree that the time approach might* work -- but it makes our test suite at least 2 seconds slower. Instead I would suggest to test a side effect on that host which can only happen when two parallel ssh sessions happen. The question is which side effect to test. Possible options are:

  • monitor currently active ssh shells. The command could monitor active ssh sessions and wait until two sessions are seen (maybe with a timeout to prevent lifelock).
  • kinda crappy: a command could append numbers from 1 to 1.000.000 into a file. If both ssh sessions do that the file content would contain "unsorted" numbers
  • if the two commands could be different: one could open a socket/file and read/wait on it. the other could write to that socket/file.

I personally prefer the first option, as it's the least crappy one :)
It seems the who command lists currently logged in users

The who utility displays a list of all users currently logged on, showing for each user the login name, tty name, the date and time of login, and hostname if not local.

We could write (and then execute) a little bash script which runs who | wc -l until two users are logged in. That bash script could be a while-loop wrapped in the timeout command.

*) might because when tests are run on CI timing is always an issue. Maybe some tests needs longer than others and result in flaky tests. Things will always break in unexpected ways :D

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh and again (because I forgot to write this before), thanks @seungjin for your contribution 💚 I would love this feature to go in.

@lessless
Copy link

Shouldn't such kind of a parallelization be a client consideration?

@pmeinhardt
Copy link
Contributor

Hi @lessless, thanks a lot for bringing some life to this PR again 🌱💧

I think this is a very valid question that we have also asked ourselves. On the one hand, it's of course rather simple to achieve parallel execution even if the library does not provide it. On the other hand, it looks like such a common use case that it seems nice if we can avoid everyone having to re-roll their own copy of parallel task execution.

We ourselves usually deploy to at least two machines for load-balancing and I assume there's really a lot of people out there who'll have similar needs. I personally think this (optimizing for the common use case) is a good reason for having it in the library.

That said, we're super happy to hear and consider different opinions anytime though so please don't hesitate to share your thoughts! 💚

@pmeinhardt
Copy link
Contributor

At the same time, sorry @seungjin for the long silence 🙇

I had another look at how we could verify that connections are established in parallel. Here's a thought:

  1. Start 2 different sshd Docker containers via the @boot tag
  2. Run sth. like sleep 5 on both of these hosts via SSH using :parallel mode (the command should just keep the connection open long enough. but we'll kill it as soon as we detect success)
  3. In parallel, run docker top [CONTAINER] for both hosts until we see a connection on both of them at the same time
  4. Once we do see a session on both hosts, kill the containers, (prematurely) terminating the sleep so we're not wasting any time.

Below is a rough outline visualizing what this looks like:

session

The left pane is running the SSH server. The top-right, just sets up a test user. Both of these steps are taken care of by the @boot tag. The pane in the middle was used to open an SSH connection. The bottom-right pane shows the docker top output with the session active first, then without the SSH connection. We'd basically be looking for something like the first output on two containers "at the same time" to have a successful test.

Are you by any chance still up for working on this?

Again, sorry for the long radio silence and thanks a lot for your contribution ❤️

@lessless
Copy link

lessless commented Apr 22, 2019 via email

@pmeinhardt
Copy link
Contributor

Hey @lessless, there's currently no plan to integrate retry into the package – it seems to me like this would be highly application/use-case dependent. Any SSHKit.run and SSHKit.upload/download calls return a list of statuses and stdout/stderr aggregates , one per host. So to retry/recover after an error, these are the place to look 🙂

If you have additional questions, maybe we should move these into a separate issue. I'd definitely be happy to hear more about your use case. We're still working on improving SSHKit's interface and design so this kind of feedback is incredibly helpful ✌️

@lessless
Copy link

lessless commented Apr 28, 2019

@pmeinhardt it will be amazing if we an bounce off few ideas but I'm not sure if it's a right thing to open a new issue for it.

My use case in a nutshell is parallel try/ rescue:

  • run commands in parallel on multiple nodes
  • catch exit statuses
  • revert changes if needed

Here is more detailed description of the idea https://elixirforum.com/t/call-for-feedback-multi-node-deployment-tool-edeliver-2-0-design/21747. It'll be great if you can take look at it and let us know how we can utilize SSHKit to implement that design or what are possible riffs and obstacles to avoid.

I'm also @lessless on Elixir Slack if anything.

Cheers!

@pmeinhardt pmeinhardt changed the base branch from master to main December 12, 2020 01:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants