[#40] Allow SSHKit.run executing on hosts in parallel #131

seungjin · 2018-08-30T14:21:16Z

Allow SSHKit.run executing on hosts in parallel

Description

SSHKit.run can do its job in sequential and parallel.
SSHKit.run(context, cmd, :sequential, 1000) runs cmd sequentially with its timeout 1000ms.
SSHKit.run(context, cmd, :parallel, 1000) runs cmd sequentially with its each cmd's timeout 1000ms.

Motivation and Context

Please refer to #40

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Checklist

I have updated the documentation accordingly.
I have added tests to cover my changes (with unit and/or functional tests).
I have added a note to CHANGELOG.md if necessary (in the ## master section).

@pmeinhardt

Advice from @pmeinhardt, bitcrowd#40 (comment)

Running 'sleep 2' cmd in parallel (to 2 connections) and let's call it a pass when it returns its run time less than 4 sec. Sequential runs more than 4 sec.

seungjin · 2018-09-03T17:43:21Z

I have added tests to cover my changes (with unit and/or functional tests).

functional test added:
0d239f6
7c84642

testing 2 ssh sessions with 'sleep 2' commands and checking the returns less than 4 seconds.

pmeinhardt

@seungjin thank you so much for your contribution 🙂 👍

I've had a little time to take a closer look and added a few comments.
If you could add the corresponding changelog entries and documentation that'd be amazing.

We're more than happy to assist. If you need anything, please don't hesitate to let us know.

Cheers

pmeinhardt · 2018-09-10T15:53:24Z

test/sshkit_functional_test.exs

+      assert String.trim(stdout(output1)) == host.options[:user]
+      assert String.trim(stdout(output2)) == host.options[:user]
+
+    end


🤔 For the test, I would rather start 2 hosts. I would expect this to be the more common use case for the :parallel feature.

You can start another host by adding another config entry to the :boot tag. Maybe with a different user, so the output is actually different.

A small note on formatting:

Please remove the blank line (line 33)

Throughout the code base we use spaces after the comma between list items (on line 23, seems like Credo does not catch that)

Thank you 🙂

Ideally, I would like to avoid relying on time in the tests since that may fail for a bunch of reasons and make the test suite flaky. I wonder if this could be tested in a different way? 🤔

@tessi do you have an idea?

or @klappradla?

We might need two tests :)

to test parallel execution on two different hosts (as @pmeinhardt said) which is not required to test the actualy parallelism but instead focuses on correct execution (so we see when things break for real users in a real'ish scenario).

test parallel execution on the same host, in this time also testing that things really run in parallel.

For 2. I agree that the time approach might* work -- but it makes our test suite at least 2 seconds slower. Instead I would suggest to test a side effect on that host which can only happen when two parallel ssh sessions happen. The question is which side effect to test. Possible options are:

monitor currently active ssh shells. The command could monitor active ssh sessions and wait until two sessions are seen (maybe with a timeout to prevent lifelock).

kinda crappy: a command could append numbers from 1 to 1.000.000 into a file. If both ssh sessions do that the file content would contain "unsorted" numbers

if the two commands could be different: one could open a socket/file and read/wait on it. the other could write to that socket/file.

I personally prefer the first option, as it's the least crappy one :)
It seems the who command lists currently logged in users

The who utility displays a list of all users currently logged on, showing for each user the login name, tty name, the date and time of login, and hostname if not local.

We could write (and then execute) a little bash script which runs who | wc -l until two users are logged in. That bash script could be a while-loop wrapped in the timeout command.

*) might because when tests are run on CI timing is always an issue. Maybe some tests needs longer than others and result in flaky tests. Things will always break in unexpected ways :D

Oh and again (because I forgot to write this before), thanks @seungjin for your contribution 💚 I would love this feature to go in.

lessless · 2019-04-19T11:14:02Z

Shouldn't such kind of a parallelization be a client consideration?

pmeinhardt · 2019-04-21T19:10:46Z

Hi @lessless, thanks a lot for bringing some life to this PR again 🌱💧

I think this is a very valid question that we have also asked ourselves. On the one hand, it's of course rather simple to achieve parallel execution even if the library does not provide it. On the other hand, it looks like such a common use case that it seems nice if we can avoid everyone having to re-roll their own copy of parallel task execution.

We ourselves usually deploy to at least two machines for load-balancing and I assume there's really a lot of people out there who'll have similar needs. I personally think this (optimizing for the common use case) is a good reason for having it in the library.

That said, we're super happy to hear and consider different opinions anytime though so please don't hesitate to share your thoughts! 💚

pmeinhardt · 2019-04-21T19:27:14Z

At the same time, sorry @seungjin for the long silence 🙇

I had another look at how we could verify that connections are established in parallel. Here's a thought:

Start 2 different sshd Docker containers via the @boot tag
Run sth. like sleep 5 on both of these hosts via SSH using :parallel mode (the command should just keep the connection open long enough. but we'll kill it as soon as we detect success)
In parallel, run docker top [CONTAINER] for both hosts until we see a connection on both of them at the same time
Once we do see a session on both hosts, kill the containers, (prematurely) terminating the sleep so we're not wasting any time.

Below is a rough outline visualizing what this looks like:

The left pane is running the SSH server. The top-right, just sets up a test user. Both of these steps are taken care of by the @boot tag. The pane in the middle was used to open an SSH connection. The bottom-right pane shows the docker top output with the session active first, then without the SSH connection. We'd basically be looking for something like the first output on two containers "at the same time" to have a successful test.

Are you by any chance still up for working on this?

Again, sorry for the long radio silence and thanks a lot for your contribution ❤️

lessless · 2019-04-22T07:31:51Z

In case you guys will decide to move on with this one concern is to properly propagate is exit codes and associated outputs in case of error back to the caller. I’m not sure if I’m not being late here because I didn’t look at the code yet - we are just exploring bout options with SSHKit for now. One of the drawbacks of doing things in parallel is more involving retry and recovery logic. Can you please shed some lights on how are dealing with that or what available options will be there on a table after merging this PR? Many thanks!

…

On 21 Apr 2019, at 21:27, Paul Meinhardt ***@***.***> wrote: At the same time, sorry @seungjin for the long silence 🙇 I had another look at how we could verify that connections are established in parallel. Here's a thought: Start 2 different sshd Docker containers via the @boot tag Run sth. like sleep 5 on both of these hosts via SSH using :parallel mode (the command should just keep the connection open long enough. but we'll kill it as soon as we detect success) In parallel, run docker top [CONTAINER] for both hosts until we see a connection on both of them at the same time Once we do see a session on both hosts, kill the containers, (prematurely) terminating the sleep so we're not wasting any time. Below is a rough outline visualizing what this looks like: The left pane is running the SSH server. The top-right, just sets up a test user. Both of these steps are taken care of by the @boot tag. The pane in the middle was used to open an SSH connection. The bottom-right pane shows the docker top output with the session active first, then without the SSH connection. We'd basically be looking for something like the first output on two containers "at the same time" to have a successful test. Are you by any chance still up for working on this? Again, sorry for the long radio silence and thanks a lot for your contribution ❤️ — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

pmeinhardt · 2019-04-25T06:58:20Z

Hey @lessless, there's currently no plan to integrate retry into the package – it seems to me like this would be highly application/use-case dependent. Any SSHKit.run and SSHKit.upload/download calls return a list of statuses and stdout/stderr aggregates , one per host. So to retry/recover after an error, these are the place to look 🙂

If you have additional questions, maybe we should move these into a separate issue. I'd definitely be happy to hear more about your use case. We're still working on improving SSHKit's interface and design so this kind of feedback is incredibly helpful ✌️

lessless · 2019-04-28T10:09:25Z

@pmeinhardt it will be amazing if we an bounce off few ideas but I'm not sure if it's a right thing to open a new issue for it.

My use case in a nutshell is parallel try/ rescue:

run commands in parallel on multiple nodes
catch exit statuses
revert changes if needed

Here is more detailed description of the idea https://elixirforum.com/t/call-for-feedback-multi-node-deployment-tool-edeliver-2-0-design/21747. It'll be great if you can take look at it and let us know how we can utilize SSHKit to implement that design or what are possible riffs and obstacles to avoid.

I'm also @lessless on Elixir Slack if anything.

Cheers!

seungjin added 3 commits August 27, 2018 22:08

Allowing executing on hosts asynchronously.

3806c94

Internal run function is now shared with sequential and parallel runs.

9a72365

Advice from @pmeinhardt, bitcrowd#40 (comment)

timeout option for seq and parallel runs.

a5a9d21

seungjin changed the title ~~https://github.com/bitcrowd/sshkit.ex/issues/40 Allow SSHKit.run executing on hosts in parallel~~ Allow SSHKit.run executing on hosts in parallel Aug 30, 2018

seungjin changed the title ~~Allow SSHKit.run executing on hosts in parallel~~ [#40] Allow SSHKit.run executing on hosts in parallel Aug 30, 2018

seungjin added 3 commits August 31, 2018 10:18

No trailing white-space at the end of a line.

95ed0f7

test for parallel SSHKit.run execution with :parallel option

0d239f6

Parallel run test's threshold changed for some slower environment.

7c84642

Running 'sleep 2' cmd in parallel (to 2 connections) and let's call it a pass when it returns its run time less than 4 sec. Sequential runs more than 4 sec.

seungjin closed this Sep 3, 2018

seungjin reopened this Sep 3, 2018

pmeinhardt reviewed Sep 10, 2018

View reviewed changes

pmeinhardt changed the base branch from master to main December 12, 2020 01:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[#40] Allow SSHKit.run executing on hosts in parallel #131

[#40] Allow SSHKit.run executing on hosts in parallel #131

seungjin commented Aug 30, 2018 •

edited

Loading

seungjin commented Sep 3, 2018 •

edited

Loading

pmeinhardt left a comment

pmeinhardt Sep 10, 2018

pmeinhardt Sep 10, 2018

pmeinhardt Sep 10, 2018

pmeinhardt Sep 10, 2018

tessi Sep 18, 2018 •

edited

Loading

tessi Oct 2, 2018

lessless commented Apr 19, 2019

pmeinhardt commented Apr 21, 2019

pmeinhardt commented Apr 21, 2019

lessless commented Apr 22, 2019 via email

pmeinhardt commented Apr 25, 2019

lessless commented Apr 28, 2019 •

edited

Loading

[#40] Allow SSHKit.run executing on hosts in parallel #131

Are you sure you want to change the base?

[#40] Allow SSHKit.run executing on hosts in parallel #131

Conversation

seungjin commented Aug 30, 2018 • edited Loading

Description

Motivation and Context

Types of changes

Checklist

seungjin commented Sep 3, 2018 • edited Loading

pmeinhardt left a comment

Choose a reason for hiding this comment

pmeinhardt Sep 10, 2018

Choose a reason for hiding this comment

pmeinhardt Sep 10, 2018

Choose a reason for hiding this comment

pmeinhardt Sep 10, 2018

Choose a reason for hiding this comment

pmeinhardt Sep 10, 2018

Choose a reason for hiding this comment

tessi Sep 18, 2018 • edited Loading

Choose a reason for hiding this comment

tessi Oct 2, 2018

Choose a reason for hiding this comment

lessless commented Apr 19, 2019

pmeinhardt commented Apr 21, 2019

pmeinhardt commented Apr 21, 2019

lessless commented Apr 22, 2019 via email

pmeinhardt commented Apr 25, 2019

lessless commented Apr 28, 2019 • edited Loading

seungjin commented Aug 30, 2018 •

edited

Loading

seungjin commented Sep 3, 2018 •

edited

Loading

tessi Sep 18, 2018 •

edited

Loading

lessless commented Apr 28, 2019 •

edited

Loading