Better orchestration of integration test runners #4410

rdner · 2024-03-13T13:21:24Z

Describe the enhancement:

Based on my observations over the last 2 months, the root cause of most of the issues we're experiencing with OGC while running our integration tests is the communication via SSH.

Some of the reports can be found here #4356

SSH is not very resilient when it comes to connection issues/interruptions and we cannot simply add a retry because we run commands via SSH which, in most cases, are meant to be executed only once.

The goal of this enhancement should be minimizing interactions via SSH.

We could implement the following improvements to make our integration tests more stable:

It should be possible to prepare a single script (per OS) with all commands needed for initialization and execution of integration tests on a remote VM. We should run this script only once via SSH instead of sending separate commands. Buildkite does something similar – sends a complex manifest to a remote machine and then runs it there.
VMs that run our integration tests should have access to an S3 bucket or any other artifact storage where they would upload their results/test logs (should be a part of the script mentioned above).
The main script/orchestrator should:
3.1 watch the artifact storage for the results to appear without any communication with the VMs.
3.2 watch states of the VMs via OGC and fail the tests if one of the machines has a wrong state.
All the read-only communication via OGC should have retries.

Describe a specific use case for the enhancement or feature:

Our integration tests are occasionally failing because of orchestration of the VMs or while communicating with the VMs:

What is the definition of done?

While running integration tests we use SSH only once for delivering and running a script
All read-only orchestration operations should have a retry
Runners should deliver their artifacts directly to a storage.

elasticmachine · 2024-03-13T13:21:26Z

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

blakerouse · 2024-03-13T16:28:53Z

It should be possible to prepare a single script (per OS) with all commands needed for initialization and execution of integration tests on a remote VM. We should run this script only once via SSH instead of sending separate commands. Buildkite does something similar – sends a complex manifest to a remote machine and then runs it there.

Not a fan of using a script. Script programming sucks, we should never change to using scripts. I would be fine with changing to a golang code that is ran.

VMs that run our integration tests should have access to an S3 bucket or any other artifact storage where they would upload their results/test logs (should be a part of the script mentioned above).

Don't see why we need to add S3 as another dependencies. We can add retries on pulling the content.

leehinman · 2024-03-13T19:08:27Z

So things like cfengine, Chef, Puppet, Ansible & Salt stack are all solutions to this kind of problem. I don't want any of those in our testing framework, but I think it is worth looking into how they solve the problem. There are lots of subtle edge cases and I'd rather learn from others than discover all those sharp edges on our own.

cachedout · 2024-03-16T13:23:01Z

Drive-by comment: have you tried just tuning SSH to improve reliability? Stuff like turning on multiplexing or tuning ServerAliveInterval on the client side might improve things substantially.

blakerouse · 2024-03-16T15:48:12Z

It is also safe to retry most of the commands that fail. Just being more defensive in the execution of the SSH commands can also improve the stability.

rdner · 2024-04-04T12:18:57Z

Some improvements to the SSH connection management (reconnect and TCP keep alive) were made in #4498

elasticmachine · 2024-06-05T11:30:29Z

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

rdner · 2024-09-19T08:11:46Z

Closing since we achieved sufficient stability in the runners by adding retries.

rdner added enhancement New feature or request Team:Elastic-Agent Label for the Agent team labels Mar 13, 2024

rdner mentioned this issue Mar 25, 2024

[Flaky Test]: Integration tests keep running forever until manually cancelled #4475

Closed

pierrehilbert mentioned this issue May 27, 2024

[Flaky Test]: ssh connect error: "error dialing tcp address \"xxx.xxx.xxx.xxx:22\" :dial tcp xxx.xxx.xxx.xxx:22: connect: connection refused #4817

Closed

jlind23 added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Jun 5, 2024

rdner closed this as completed Sep 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better orchestration of integration test runners #4410

Better orchestration of integration test runners #4410

rdner commented Mar 13, 2024

elasticmachine commented Mar 13, 2024

blakerouse commented Mar 13, 2024

leehinman commented Mar 13, 2024

cachedout commented Mar 16, 2024

blakerouse commented Mar 16, 2024

rdner commented Apr 4, 2024 •

edited

Loading

elasticmachine commented Jun 5, 2024

rdner commented Sep 19, 2024

Better orchestration of integration test runners #4410

Better orchestration of integration test runners #4410

Comments

rdner commented Mar 13, 2024

elasticmachine commented Mar 13, 2024

blakerouse commented Mar 13, 2024

leehinman commented Mar 13, 2024

cachedout commented Mar 16, 2024

blakerouse commented Mar 16, 2024

rdner commented Apr 4, 2024 • edited Loading

elasticmachine commented Jun 5, 2024

rdner commented Sep 19, 2024

rdner commented Apr 4, 2024 •

edited

Loading