-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change upgrade watcher to use StateWatch #3622
Conversation
Pinging @elastic/elastic-agent (Team:Elastic-Agent) |
🌐 Coverage report
|
Should we be removing this line? Or do we need to leave it in there for older Agent versions (that don't have the watcher implementation in this PR)? |
case err := <-failedCh: | ||
if err != nil { | ||
if failedErr == nil { | ||
failedCount++ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like we only increment failedCount
when we received a non-nil error from failedCh
but previously there was no error (i.e. failedErr == nil
). So that means failedCount
actually represents the number of times we flip-flopped from a non-error state to an error state, right? If my understanding is correct, could we rename this to something like flipFlopCount
or something else indicating that it's counting flip-flops?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
This has to remaining for older versions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
buildkite test this |
another unrelated failure... this time the endpoint integration |
buildkite test this |
SonarQube Quality Gate |
What does this PR do?
Refactors how the upgrade watcher works removing the need to parse the PID from any system service manager.
The watcher now tracks the inability to connect to the daemon, lost connections to the daemon, agent failures, component failures, and flip-flopping of healthy/error states. The tracking of the lost connections to the daemon allows the removal of the PID tracking that used to be present. The error tracking is more robust as it requires that an error be an error for at least 30 seconds, allowing the Elastic Agent to possibly have an issue and then recover (before if the State call happened when an error was present it would just become failed).
This also includes unit testing for the
AgentWatcher
service that previously had no unit tests.Why is it important?
Completely removes the requirement of executing and reading output of service managers (removing the need for root access). Improves the overall logic and provides better testing.
Checklist
[ ] I have made corresponding changes to the documentation[ ] I have made corresponding change to the default configuration files./changelog/fragments
using the changelog tool