-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix install/enroll command not failing when the daemon restart fails #3815
Conversation
Pinging @elastic/elastic-agent (Team:Elastic-Agent) |
This pull request does not have a backport label. Could you fix it @belimawr? 🙏
NOTE: |
9dbfe0a
to
eb35ba8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of nitpicks here and there, overall the change looks ok
buildkite test this |
I tried this out by modifying agent so that it unconditionally retried restarting the daemon with the following patch applied: diff --git a/internal/pkg/agent/cmd/enroll_cmd.go b/internal/pkg/agent/cmd/enroll_cmd.go
index e704990426..a1facdf185 100644
--- a/internal/pkg/agent/cmd/enroll_cmd.go
+++ b/internal/pkg/agent/cmd/enroll_cmd.go
@@ -452,10 +452,10 @@ func (c *enrollCmd) prepareFleetTLS() error {
}
func (c *enrollCmd) daemonReloadWithBackoff(ctx context.Context) error {
- err := c.daemonReload(ctx)
- if err == nil {
- return nil
- }
+ err := errors.New("this is a test error")
+ // if err == nil {
+ // return nil
+ // }
if errors.Is(err, context.DeadlineExceeded) ||
errors.Is(err, context.Canceled) {
return fmt.Errorf("could not reload daemon: %w", err)
@@ -470,9 +470,7 @@ func (c *enrollCmd) daemonReloadWithBackoff(ctx context.Context) error {
c.log.Info("Retrying to restart...")
err = c.daemonReload(ctx)
- if err == nil {
- return nil
- }
+
if errors.Is(err, context.DeadlineExceeded) ||
errors.Is(err, context.Canceled) {
return fmt.Errorf("could not reload daemon after %d retries: %w",
(END) I got the following output. Note that the existing implementation of the progress bar seems to buffer all logs until the step completes, so I didn't see any output until it failed completely and then everything was dumped out.
There are several problems here:
|
7fbbbf8
to
ad50c63
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you missed a check on one of the tests, but apart from that all good
I see your point there, I reduced the time so iterations are faster now. How many times do you think we should retry? I don't know the details of the shutdown process from the Elastic-Agent so I'm really not sure what makes sense here. If there is any sort of communication with Fleet-Server or even waiting Beats finishing to flush some data, it might make sense to have longer wait times in case ES is overloaded.
That comes from a totally different bit of the codebase, the bit that logs it is waiting for an external process to finish, hence it does not know the reason of the failure. elastic-agent/internal/pkg/agent/cmd/install.go Lines 256 to 262 in e3c0695
|
e4cb918
to
1d12fb4
Compare
// Stage 3: Make sure there are no errors in logs | ||
t.Log("Making sure there are no error logs") | ||
docs = queryESDocs(t, func() (estools.Documents, error) { | ||
return estools.CheckForErrorsInLogs(info.ESClient, info.Namespace, []string{ | ||
// acceptable error messages (include reason) | ||
"Error dialing dial tcp 127.0.0.1:9200: connect: connection refused", // beat is running default config before its config gets updated | ||
"Global configuration artifact is not available", // Endpoint: failed to load user artifact due to connectivity issues | ||
"Failed to download artifact", | ||
"Failed to initialize artifact", | ||
"Failed to apply initial policy from on disk configuration", | ||
"elastic-agent-client error: rpc error: code = Canceled desc = context canceled", // can happen on restart | ||
"add_cloud_metadata: received error failed requesting openstack metadata: Get \\\"https://169.254.169.254/2009-04-04/meta-data/instance-id\\\": dial tcp 169.254.169.254:443: connect: connection refused", // okay for the openstack metadata to not work | ||
"add_cloud_metadata: received error failed requesting openstack metadata: Get \\\"https://169.254.169.254/2009-04-04/meta-data/hostname\\\": dial tcp 169.254.169.254:443: connect: connection refused", // okay for the cloud metadata to not work | ||
"add_cloud_metadata: received error failed requesting openstack metadata: Get \\\"https://169.254.169.254/2009-04-04/meta-data/placement/availability-zone\\\": dial tcp 169.254.169.254:443: connect: connection refused", // okay for the cloud metadata to not work | ||
"add_cloud_metadata: received error failed requesting openstack metadata: Get \\\"https://169.254.169.254/2009-04-04/meta-data/instance-type\\\": dial tcp 169.254.169.254:443: connect: connection refused", // okay for the cloud metadata to not work | ||
"add_cloud_metadata: received error failed with http status code 404", // okay for the cloud metadata to not work | ||
"add_cloud_metadata: received error failed fetching EC2 Identity Document: operation error ec2imds: GetInstanceIdentityDocument, http response error StatusCode: 404, request to EC2 IMDS failed", // okay for the cloud metadata to not work | ||
}) | ||
}) | ||
t.Logf("error logs: Got %d documents", len(docs.Hits.Hits)) | ||
for _, doc := range docs.Hits.Hits { | ||
t.Logf("%#v", doc.Source) | ||
} | ||
require.Empty(t, docs.Hits.Hits) | ||
|
||
// Stage 4: Make sure we have message confirming central management is running | ||
// Stage 3: Make sure we have message confirming central management is running |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll remove this change once #3917 is merged. I kept it here so the integration tests will, hopefully, pass.
Folks, I've resolved all comments, some I implemented what was asked, others I explained why it is implemented the way it is. Once all tests are passing I'll ask for a re-review. |
665cbe9
to
e587a8b
Compare
This pull request is now in conflicts. Could you fix it? 🙏
|
In a situation where the control socket isn't available yet, most likely it will become available fairly quickly once the agent creates it. The problem is we previously only waited once, when really we probably needed to wait a few seconds. I would say retrying every 1s is reasonable for 120 total retries over 2 minutes.
Does this retry logic actually impact the shutdown process? This wasn't something I was expecting to have to consider. |
Asserting there are no errors in the logs from Elastic-Agent and all Beats is flaky and does not ensure the Elastic-Agent is working correctly. The test already assert the healthy of all components, so there is no need to look in the logs. The number of exceptions this assertion for no log errors is already an example of how fragile this is. The Elastic-Agent life cycle is complex and some transient errors are expected, as the code evolves those errors or messages will change, making assertion on them flaky.
aa95353
to
13fd587
Compare
buildkite test this |
Quality Gate failedFailed conditions35.1% Coverage on New Code (required ≥ 40%) |
Sonarqube is failing due to code coverage below 40%, force merging it. |
…3815) * fix install/enroll cmd not failing when agent restart fails * surface errors that might occur during enroll * fail install command if agent cannot be restarted * do not print success message if there was an enroll error. Print an error message and the error instead * add logs to show the different enroll attempts * add more context t errors * refactor internal/pkg/agent/install/perms_unix.go and add more context to errors restore main version * ignore agent restart error on enroll tests as there is no agent to be restarted * daemonReloadWithBackoff does not retry on context deadline exceeded * Do not reload the Agent daemon if enrolling from a container The enroll command would always try to restart the daemon, however when enrolling as part of the container command, there is no running daemon to reload. This commit adds a CLI flag, --skip-daemon-reload, to the enroll command to skip the reloading step, the container command now makes use of this flag. * Apply suggestions from code review Co-authored-by: Paolo Chilà <[email protected]> * PR improvements * Add integration test * make lint happy * PR improvements * Fix after rebase * Fix some issues * more PR improvments * Fix enroll command * Fix TestContainterCMD * Fix implementation * Remove flaky integration test assertion Asserting there are no errors in the logs from Elastic-Agent and all Co-authored-by: Anderson Queriroz <[email protected]> Co-authored-by: Paolo Chilà <[email protected]> Co-authored-by: Pierre HILBERT <[email protected]> (cherry picked from commit 938f0b9) # Conflicts: # internal/pkg/composable/providers/kubernetessecrets/kubernetes_secrets_test.go
…e daemon restart fails (#3962) * Fix install/enroll command not failing when the daemon restart fails (#3815) * Do not reload the Agent daemon if enrolling from a container The enroll command would always try to restart the daemon, however when enrolling as part of the container command, there is no running daemon to reload. This commit adds a CLI flag, --skip-daemon-reload, to the enroll command to skip the reloading step, the container command now makes use of this flag. Co-authored-by: Anderson Queriroz <[email protected]> Co-authored-by: Paolo Chilà <[email protected]> Co-authored-by: Pierre HILBERT <[email protected]> (cherry picked from commit 938f0b9) --------- Co-authored-by: Tiago Queiroz <[email protected]> Co-authored-by: Anderson Queiroz <[email protected]>
…lastic#3815) * fix install/enroll cmd not failing when agent restart fails * surface errors that might occur during enroll * fail install command if agent cannot be restarted * do not print success message if there was an enroll error. Print an error message and the error instead * add logs to show the different enroll attempts * add more context t errors * refactor internal/pkg/agent/install/perms_unix.go and add more context to errors restore main version * ignore agent restart error on enroll tests as there is no agent to be restarted * daemonReloadWithBackoff does not retry on context deadline exceeded * Do not reload the Agent daemon if enrolling from a container The enroll command would always try to restart the daemon, however when enrolling as part of the container command, there is no running daemon to reload. This commit adds a CLI flag, --skip-daemon-reload, to the enroll command to skip the reloading step, the container command now makes use of this flag. * Apply suggestions from code review Co-authored-by: Paolo Chilà <[email protected]> * PR improvements * Add integration test * make lint happy * PR improvements * Fix after rebase * Fix some issues * more PR improvments * Fix enroll command * Fix TestContainterCMD * Fix implementation * Remove flaky integration test assertion Asserting there are no errors in the logs from Elastic-Agent and all Co-authored-by: Anderson Queriroz <[email protected]> Co-authored-by: Paolo Chilà <[email protected]> Co-authored-by: Pierre HILBERT <[email protected]>
…e daemon restart fails (#3962) * Fix install/enroll command not failing when the daemon restart fails (#3815) * Do not reload the Agent daemon if enrolling from a container The enroll command would always try to restart the daemon, however when enrolling as part of the container command, there is no running daemon to reload. This commit adds a CLI flag, --skip-daemon-reload, to the enroll command to skip the reloading step, the container command now makes use of this flag. Co-authored-by: Anderson Queriroz <[email protected]> Co-authored-by: Paolo Chilà <[email protected]> Co-authored-by: Pierre HILBERT <[email protected]> (cherry picked from commit 938f0b9) --------- Co-authored-by: Tiago Queiroz <[email protected]> Co-authored-by: Anderson Queiroz <[email protected]>
For the agent to be enrolled it needs to restart after the enrol process is completed, in order to pickup the new config and "connect" to fleet-server.
This change makes the enrol command to fail if it cannot restart the agent after enrolling on fleet. A
--skip-daemon-reload
CLI flag is also added to theenroll
command because thecontainer
command callsenroll
, however there is no running daemon to restart. Skipping the daemon reload on this case allows thecontainer
command to succeed.This PR brings back the changes introduced by #3554 that were reverted due to the
container
command not working and fixes the issues with thecontainer
command.665cbe9 does not belong to this PR but is required to get the integration tests to pass. I'll remove it once #3917 is merged
What does this PR do?
This change makes the enroll command to fail if it cannot restart the agent after enrolling on fleet
Why is it important?
For the agent to be enrolled it needs to restart after the enrol process is completed, in order to pickup the new config and "connect" to fleet-server.
Checklist
- [ ] I have made corresponding changes to the documentation- [ ] I have made corresponding change to the default configuration files./changelog/fragments
using the changelog toolAuthor's Checklist
How to test this PR locally
Export the follwing environment variables
Then execute the
container
command:The Elastic-Agent should enrol in Fleet and start working.
Related issues
## Use cases## Screenshots## LogsQuestions to ask yourself