Change upgrade watcher to use StateWatch #3622

blakerouse · 2023-10-17T14:53:10Z

What does this PR do?

Refactors how the upgrade watcher works removing the need to parse the PID from any system service manager.

The watcher now tracks the inability to connect to the daemon, lost connections to the daemon, agent failures, component failures, and flip-flopping of healthy/error states. The tracking of the lost connections to the daemon allows the removal of the PID tracking that used to be present. The error tracking is more robust as it requires that an error be an error for at least 30 seconds, allowing the Elastic Agent to possibly have an issue and then recover (before if the State call happened when an error was present it would just become failed).

This also includes unit testing for the AgentWatcher service that previously had no unit tests.

Why is it important?

Completely removes the requirement of executing and reading output of service managers (removing the need for root access). Improves the overall logic and provides better testing.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
~~[ ] I have made corresponding changes to the documentation~~
~~[ ] I have made corresponding change to the default configuration files~~
I have added tests that prove my fix is effective or that my feature works
I have added an entry in ./changelog/fragments using the changelog tool
I have added an integration test or an E2E test (covered by existing tests)

… completely.

elasticmachine · 2023-10-17T14:55:55Z

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

elasticmachine · 2023-10-17T15:00:58Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2023-10-18T20:41:10.213+0000
Duration: 25 min 25 sec

Test stats 🧪

Test	Results
Failed	0
Passed	6513
Skipped	59
Total	6572

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.
/package : Generate the packages.
run integration tests : Run the Elastic Agent Integration tests.
run end-to-end tests : Generate the packages and run the E2E Tests.
run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

elasticmachine · 2023-10-17T15:47:06Z

🌐 Coverage report

Name	Metrics % (`covered/total`)	Diff
Packages	98.81% (`83/84`)	👍
Files	66.997% (`203/303`)	👎 -0.109
Classes	65.946% (`366/555`)	👍 0.227
Methods	53.128% (`1155/2174`)	👍 0.131
Lines	39.542% (`13633/34477`)	👍 0.28
Conditionals	100.0% (`0/0`)	💚

internal/pkg/agent/application/upgrade/watcher.go

ycombinator · 2023-10-17T21:07:28Z

elastic-agent/testing/upgradetest/watcher.go

Line 22 in defde80

crash_check.interval: 15s

Should we be removing this line? Or do we need to leave it in there for older Agent versions (that don't have the watcher implementation in this PR)?

ycombinator · 2023-10-17T21:17:06Z

internal/pkg/agent/application/upgrade/watcher.go

+			case err := <-failedCh:
+				if err != nil {
+					if failedErr == nil {
+						failedCount++


It looks like we only increment failedCount when we received a non-nil error from failedCh but previously there was no error (i.e. failedErr == nil). So that means failedCount actually represents the number of times we flip-flopped from a non-error state to an error state, right? If my understanding is correct, could we rename this to something like flipFlopCount or something else indicating that it's counting flip-flops?

blakerouse · 2023-10-18T15:19:27Z

elastic-agent/testing/upgradetest/watcher.go

Line 22 in defde80

crash_check.interval: 15s

Should we be removing this line? Or do we need to leave it in there for older Agent versions (that don't have the watcher implementation in this PR)?

This has to remaining for older versions.

ycombinator

LGTM.

blakerouse · 2023-10-19T01:05:16Z

buildkite test this

blakerouse · 2023-10-19T03:10:26Z

another unrelated failure... this time the endpoint integration

blakerouse · 2023-10-19T03:10:34Z

buildkite test this

elastic-sonarqube · 2023-10-19T05:40:43Z

SonarQube Quality Gate

0 Bugs
0 Vulnerabilities
0 Security Hotspots
2 Code Smells

67.4% Coverage
6.3% Duplication

blakerouse added 4 commits October 13, 2023 11:43

Add PID to status output.

f3965a0

Fix watcher interval.

cf8cdf1

Refactor watcher to use StateWatch removing the need for PID watching…

4a3834e

… completely.

Merge branch 'main' into watcher-stream

427240c

blakerouse added Team:Elastic-Agent Label for the Agent team backport-skip labels Oct 17, 2023

blakerouse self-assigned this Oct 17, 2023

Add changelog.

07c6798

blakerouse marked this pull request as ready for review October 17, 2023 14:55

blakerouse requested a review from a team as a code owner October 17, 2023 14:55

blakerouse requested review from michalpristas and faec October 17, 2023 14:55

pierrehilbert requested a review from ycombinator October 17, 2023 14:59

blakerouse added 6 commits October 17, 2023 11:02

Fix lint.

08fe2d6

Run mage check.

a59dde0

Fix TestStandaloneUpgradeRollbackOnRestarts to use the build watcher.

d5fcbd7

Make lint happy.

ae38370

Annoying lint.

87a9b26

Make notice.

013bf93

blakerouse added 2 commits October 17, 2023 11:52

Fix windows.

11993fb

Fix unit test connection on Windows.

abde993

ycombinator reviewed Oct 17, 2023

View reviewed changes

internal/pkg/agent/application/upgrade/watcher.go Show resolved Hide resolved

ycombinator reviewed Oct 17, 2023

View reviewed changes

Adjust grpc client.

9acd53a

Add back PID tracking.

d5a02e1

More watcher fixes.

3d14b0e

ycombinator approved these changes Oct 18, 2023

View reviewed changes

blakerouse enabled auto-merge (squash) October 19, 2023 01:05

blakerouse merged commit d64d704 into elastic:main Oct 19, 2023
8 checks passed

blakerouse deleted the watcher-stream branch October 19, 2023 13:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change upgrade watcher to use StateWatch #3622

Change upgrade watcher to use StateWatch #3622

blakerouse commented Oct 17, 2023 •

edited

Loading

elasticmachine commented Oct 17, 2023

elasticmachine commented Oct 17, 2023 •

edited

Loading

Build stats

Test stats 🧪

elasticmachine commented Oct 17, 2023 •

edited

Loading

ycombinator commented Oct 17, 2023

ycombinator Oct 17, 2023

blakerouse Oct 18, 2023

blakerouse commented Oct 18, 2023

ycombinator left a comment

blakerouse commented Oct 19, 2023

blakerouse commented Oct 19, 2023

blakerouse commented Oct 19, 2023

elastic-sonarqube bot commented Oct 19, 2023

Change upgrade watcher to use StateWatch #3622

Change upgrade watcher to use StateWatch #3622

Conversation

blakerouse commented Oct 17, 2023 • edited Loading

What does this PR do?

Why is it important?

Checklist

elasticmachine commented Oct 17, 2023

elasticmachine commented Oct 17, 2023 • edited Loading

💚 Build Succeeded

Build stats

Test stats 🧪

💚 Flaky test report

🤖 GitHub comments

elasticmachine commented Oct 17, 2023 • edited Loading

🌐 Coverage report

ycombinator commented Oct 17, 2023

ycombinator Oct 17, 2023

Choose a reason for hiding this comment

blakerouse Oct 18, 2023

Choose a reason for hiding this comment

blakerouse commented Oct 18, 2023

ycombinator left a comment

Choose a reason for hiding this comment

blakerouse commented Oct 19, 2023

blakerouse commented Oct 19, 2023

blakerouse commented Oct 19, 2023

elastic-sonarqube bot commented Oct 19, 2023

blakerouse commented Oct 17, 2023 •

edited

Loading

elasticmachine commented Oct 17, 2023 •

edited

Loading

elasticmachine commented Oct 17, 2023 •

edited

Loading