-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better orchestration of integration test runners #4410
Comments
Pinging @elastic/elastic-agent (Team:Elastic-Agent) |
Not a fan of using a script. Script programming sucks, we should never change to using scripts. I would be fine with changing to a golang code that is ran.
Don't see why we need to add S3 as another dependencies. We can add retries on pulling the content. |
So things like cfengine, Chef, Puppet, Ansible & Salt stack are all solutions to this kind of problem. I don't want any of those in our testing framework, but I think it is worth looking into how they solve the problem. There are lots of subtle edge cases and I'd rather learn from others than discover all those sharp edges on our own. |
Drive-by comment: have you tried just tuning SSH to improve reliability? Stuff like turning on multiplexing or tuning |
It is also safe to retry most of the commands that fail. Just being more defensive in the execution of the SSH commands can also improve the stability. |
Some improvements to the SSH connection management (reconnect and TCP keep alive) were made in #4498 |
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane) |
Closing since we achieved sufficient stability in the runners by adding retries. |
Describe the enhancement:
Based on my observations over the last 2 months, the root cause of most of the issues we're experiencing with OGC while running our integration tests is the communication via SSH.
Some of the reports can be found here #4356
SSH is not very resilient when it comes to connection issues/interruptions and we cannot simply add a retry because we run commands via SSH which, in most cases, are meant to be executed only once.
The goal of this enhancement should be minimizing interactions via SSH.
We could implement the following improvements to make our integration tests more stable:
3.1 watch the artifact storage for the results to appear without any communication with the VMs.
3.2 watch states of the VMs via OGC and fail the tests if one of the machines has a wrong state.
Describe a specific use case for the enhancement or feature:
Our integration tests are occasionally failing because of orchestration of the VMs or while communicating with the VMs:
What is the definition of done?
The text was updated successfully, but these errors were encountered: