Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better orchestration of integration test runners #4410

Closed
rdner opened this issue Mar 13, 2024 · 8 comments
Closed

Better orchestration of integration test runners #4410

rdner opened this issue Mar 13, 2024 · 8 comments
Labels
enhancement New feature or request Team:Elastic-Agent Label for the Agent team Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Comments

@rdner
Copy link
Member

rdner commented Mar 13, 2024

Describe the enhancement:

Based on my observations over the last 2 months, the root cause of most of the issues we're experiencing with OGC while running our integration tests is the communication via SSH.

Some of the reports can be found here #4356

SSH is not very resilient when it comes to connection issues/interruptions and we cannot simply add a retry because we run commands via SSH which, in most cases, are meant to be executed only once.

The goal of this enhancement should be minimizing interactions via SSH.

We could implement the following improvements to make our integration tests more stable:

  1. It should be possible to prepare a single script (per OS) with all commands needed for initialization and execution of integration tests on a remote VM. We should run this script only once via SSH instead of sending separate commands. Buildkite does something similar – sends a complex manifest to a remote machine and then runs it there.
  2. VMs that run our integration tests should have access to an S3 bucket or any other artifact storage where they would upload their results/test logs (should be a part of the script mentioned above).
  3. The main script/orchestrator should:
    3.1 watch the artifact storage for the results to appear without any communication with the VMs.
    3.2 watch states of the VMs via OGC and fail the tests if one of the machines has a wrong state.
  4. All the read-only communication via OGC should have retries.

Describe a specific use case for the enhancement or feature:

Our integration tests are occasionally failing because of orchestration of the VMs or while communicating with the VMs:

What is the definition of done?

  • While running integration tests we use SSH only once for delivering and running a script
  • All read-only orchestration operations should have a retry
  • Runners should deliver their artifacts directly to a storage.
@rdner rdner added enhancement New feature or request Team:Elastic-Agent Label for the Agent team labels Mar 13, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@blakerouse
Copy link
Contributor

  1. It should be possible to prepare a single script (per OS) with all commands needed for initialization and execution of integration tests on a remote VM. We should run this script only once via SSH instead of sending separate commands. Buildkite does something similar – sends a complex manifest to a remote machine and then runs it there.

Not a fan of using a script. Script programming sucks, we should never change to using scripts. I would be fine with changing to a golang code that is ran.

  1. VMs that run our integration tests should have access to an S3 bucket or any other artifact storage where they would upload their results/test logs (should be a part of the script mentioned above).

Don't see why we need to add S3 as another dependencies. We can add retries on pulling the content.

@leehinman
Copy link
Contributor

So things like cfengine, Chef, Puppet, Ansible & Salt stack are all solutions to this kind of problem. I don't want any of those in our testing framework, but I think it is worth looking into how they solve the problem. There are lots of subtle edge cases and I'd rather learn from others than discover all those sharp edges on our own.

@cachedout
Copy link
Contributor

Drive-by comment: have you tried just tuning SSH to improve reliability? Stuff like turning on multiplexing or tuning ServerAliveInterval on the client side might improve things substantially.

@blakerouse
Copy link
Contributor

It is also safe to retry most of the commands that fail. Just being more defensive in the execution of the SSH commands can also improve the stability.

@rdner
Copy link
Member Author

rdner commented Apr 4, 2024

Some improvements to the SSH connection management (reconnect and TCP keep alive) were made in #4498

@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@rdner
Copy link
Member Author

rdner commented Sep 19, 2024

Closing since we achieved sufficient stability in the runners by adding retries.

@rdner rdner closed this as completed Sep 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Team:Elastic-Agent Label for the Agent team Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

No branches or pull requests

6 participants