Track upgrade details #3527

ycombinator · 2023-10-05T00:22:04Z

What does this PR do?

This PR allows the Agent to internally track most of the important states of the upgrade process. The two states not being tracked as part of this PR are UPG_SCHEDULING and UPG_ROLLBACK; tracking for these will be added in follow up PRs (see #3119 (comment) for rationale).

Implementation Details

This PR introduces a new package, github.com/elastic/elastic-agent/internal/pkg/agent/application/upgrade/details. This package introduces two new types:

State: to represent the various upgrade states, e.g. UPG_REQUESTED, UPG_DOWNLOADING, etc.
Details: to encapsulate the upgrade state and other relevant upgrade details, e.g. the target version, any error upon failure, etc.

The Details type uses an Observer design pattern to notify any interested consumers of changes.

A new field, UpgradeDetails, and accompanying setter method, setUpgradeDetails is introduced on the coordinator.state struct. In the coordinator's Upgrade method, a new Details object is constructed and the new setter method is registered as an observer on this details object. As the agent progresses through (most of) the upgrade steps, the details object is updated. The Observer pattern then causes these updates to be reflected in coordinator.state.UpgradeDetails.

Why is it important?

To lay the foundations for increasing visibility into the Agent upgrade process.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
~~I have made corresponding changes to the documentation~~
~~I have made corresponding change to the default configuration files~~
I have added tests that prove my fix is effective or that my feature works
~~I have added an entry in ./changelog/fragments using the changelog tool~~
~~I have added an integration test or an E2E test~~

Related issues

Relates to Track and report upgrade details #3119 (comment)

elasticmachine · 2023-10-05T00:22:07Z

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

elasticmachine · 2023-10-05T00:35:07Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2023-10-19T19:51:43.746+0000
Duration: 26 min 19 sec

Test stats 🧪

Test	Results
Failed	0
Passed	6577
Skipped	59
Total	6636

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.
/package : Generate the packages.
run integration tests : Run the Elastic Agent Integration tests.
run end-to-end tests : Generate the packages and run the E2E Tests.
run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

mergify · 2023-10-05T01:44:16Z

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b track-upgrade-details upstream/track-upgrade-details
git merge upstream/main
git push upstream track-upgrade-details

internal/pkg/agent/application/upgrade/details/details.go

pchila · 2023-10-05T07:28:51Z

internal/pkg/agent/application/upgrade/details/details.go

+	d.mu.RLock()
+	defer d.mu.RUnlock()


Why do we use an RLock() here? Is it expected to have multiple concurrent NotifyObservers() calls (all the others functions guards are a normal Lock()) ?
If NotifyObservers() cannot be called concurrently then we don't need an RWMutex

There could be multiple observers registered. In such a case they would be called serially, in the loop inside the notifyObservers method.

However, even if there was only one single observer registered, the RLock call is needed to ensure that the internal state of d, e.g. d.State or d.Metadata, haven't changed between the caller calling NotifyObservers() and eventually the observer function being called inside the notifyObserver method. Also, I used an RLock instead of a Lock because I'm okay with multiple reads of d's internal state while NotifyObservers() is executing; I just don't want a write to sneak in there as explained earlier.

The goal I'm trying to achieve here is that for a given UpgradeDetails object, d, I want all it's observers to see the same value of d when d.NotifyObservers() is called. Because an RLock() would prevent concurrent writes to the fields of d, I think my goal is achieved.

You're right that a regular Lock() would also achieve that goal but it would be a stronger lock than I need. I don't necessarily care if there are two concurrent invocations of d.NotifyObservers() for the same d object; in fact, I would want both to be executed without the first one blocking the second one, which would happen if we used a Lock() here.

I think I am missing something here as I don't see other functions holding an RLock for reading state. If those reads happen without any lock at all, again there's no difference between RLock and a normal mutex.

Are you referring to notifyObservers() and notifyObserver()? I did not use any mutexes in there because otherwise we end up with a deadlock when d.SetState() is called. These are private methods and the public methods that call them acquire/release the mutex.

Multiple RLock() can be acquired simultaneously so in the current code the notification would run concurrently. Is this the desired result ?

Yes, as explained in the previous comment.

If that's the case, did we write down somewhere that the Observers' code MUST be thread safe ?

No, but you bring up an excellent point. Observer's should receive a (deep) copy of the details object, not a pointer to it. The idea is that the observer receives a "snapshot-in-time" of the upgrade details at the time of observation. I have fixed this now in 5bbb507.

It's not just in the manipulation of the UpgradeDetails, it's a warning for the implementation of the observers to protect their own state against concurrent calls...

It's not just in the manipulation of the UpgradeDetails, it's a warning for the implementation of the observers to protect their own state against concurrent calls...

Now I think I might be missing something 😄. Not a rhetorical question: why should we add such a warning? Isn't it generally true that any object that wants to support concurrent access may want to protect its own state?

@pchila and I chatted over Zoom about this. The decision was to provide a setter method on UpgradeDetails for setting the download percentage, as this was the only operation that was being externally synchronized with a call to the NotifyObservers() method. By providing this setter and implementing the synchronization internally within that setter, we no longer need or allow writers of UpgradeDetails to directly notify observers; that happens automatically under the hood when these writers update some state of UpgradeDetails via setters. This is a much cleaner and safer approach!

internal/pkg/agent/application/upgrade/details/state.go

internal/pkg/agent/application/upgrade/step_download.go

mergify · 2023-10-13T12:34:57Z

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b track-upgrade-details upstream/track-upgrade-details
git merge upstream/main
git push upstream track-upgrade-details

internal/pkg/agent/application/coordinator/coordinator.go

cmacknz · 2023-10-13T19:08:01Z

internal/pkg/agent/application/upgrade/upgrade.go

@@ -182,6 +189,8 @@ func (u *Upgrader) Upgrade(ctx context.Context, version string, sourceURI string
 		return nil, err
 	}

+	det.SetState(details.StateWatching)


Since the agent restarts essentially right after this, whether not we ever seen any of these transitions in Fleet depends on if a checkin happens to occur before we restart.

We would be guaranteed to see this on the first checkin of the new agent version if it were persisted somewhere. I assume something like this will be done when we add support for the UPG_ROLLBACK state?

Although realistically if the upgrade completes quickly and without error then knowing anything about it isn't that interesting. It's the slow and failure cases that are interesting to see.

Yeah, I haven't yet thought through the implementation of for UPG_ROLLBACK but I expect we'll use the upgrade marker file to persist this state so we can communicate it from the Upgrade Watcher process to the main Agent process. We may want to expand on that idea and use the upgrade marker file to persist UPG_WATCHING as well — in fact, it will probably be the initial state in the upgrade marker file when it's created.

Longer term it would be nice to persist all upgrade states in the upgrade marker file and use it to drive the upgrade state machine; that way if an Agent restarts anywhere in the middle of an upgrade, it should be able to pick up from where it left off (assuming every or almost every step of the upgrade process is idempotent).

internal/pkg/agent/application/upgrade/details/state.go

internal/pkg/agent/application/coordinator/coordinator.go

ycombinator · 2023-10-13T19:50:22Z

CI is failing because the TestCoordinatorInitiatesUpgrade unit test is timing out. Investigating...

ycombinator · 2023-10-13T20:52:36Z

CI is now failing because of the following data race. Investigating...

[2023-10-13T20:45:58.475Z] === FAIL: internal/pkg/agent/application/upgrade/artifact/download/http TestDownloadLogProgressWithLength (2.27s)
[2023-10-13T20:45:58.475Z] ==================
[2023-10-13T20:45:58.475Z] WARNING: DATA RACE
[2023-10-13T20:45:58.475Z] Write at 0x00c0001402c0 by goroutine 15:
[2023-10-13T20:45:58.475Z]   github.com/elastic/elastic-agent/internal/pkg/agent/application/upgrade/artifact/download/http.(*detailsProgressObserver).ReportCompleted()
[2023-10-13T20:45:58.475Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3527/src/github.com/elastic/elastic-agent/internal/pkg/agent/application/upgrade/artifact/download/http/progress_observer.go:117 +0x44
[2023-10-13T20:45:58.475Z]   github.com/elastic/elastic-agent/internal/pkg/agent/application/upgrade/artifact/download/http.(*downloadProgressReporter).ReportComplete()
[2023-10-13T20:45:58.475Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3527/src/github.com/elastic/elastic-agent/internal/pkg/agent/application/upgrade/artifact/download/http/progress_reporter.go:108 +0x1cc
[2023-10-13T20:45:58.475Z]   github.com/elastic/elastic-agent/internal/pkg/agent/application/upgrade/artifact/download/http.(*Downloader).downloadFile()
[2023-10-13T20:45:58.475Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3527/src/github.com/elastic/elastic-agent/internal/pkg/agent/application/upgrade/artifact/download/http/downloader.go:222 +0xdc8
[2023-10-13T20:45:58.475Z]   github.com/elastic/elastic-agent/internal/pkg/agent/application/upgrade/artifact/download/http.(*Downloader).download()
[2023-10-13T20:45:58.475Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3527/src/github.com/elastic/elastic-agent/internal/pkg/agent/application/upgrade/artifact/download/http/downloader.go:150 +0x2cc
[2023-10-13T20:45:58.475Z]   github.com/elastic/elastic-agent/internal/pkg/agent/application/upgrade/artifact/download/http.(*Downloader).Download()
[2023-10-13T20:45:58.475Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3527/src/github.com/elastic/elastic-agent/internal/pkg/agent/application/upgrade/artifact/download/http/downloader.go:111 +0x1c4
[2023-10-13T20:45:58.475Z]   github.com/elastic/elastic-agent/internal/pkg/agent/application/upgrade/artifact/download/http.TestDownloadLogProgressWithLength()
[2023-10-13T20:45:58.475Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3527/src/github.com/elastic/elastic-agent/internal/pkg/agent/application/upgrade/artifact/download/http/downloader_test.go:126 +0x6c4
[2023-10-13T20:45:58.475Z]   testing.tRunner()
[2023-10-13T20:45:58.475Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3527/.gvm/versions/go1.20.9.linux.arm64/src/testing/testing.go:1576 +0x180
[2023-10-13T20:45:58.475Z]   testing.(*T).Run.func1()
[2023-10-13T20:45:58.475Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3527/.gvm/versions/go1.20.9.linux.arm64/src/testing/testing.go:1629 +0x40
[2023-10-13T20:45:58.475Z] 
[2023-10-13T20:45:58.475Z] Previous write at 0x00c0001402c0 by goroutine 22:
[2023-10-13T20:45:58.475Z]   github.com/elastic/elastic-agent/internal/pkg/agent/application/upgrade/artifact/download/http.(*detailsProgressObserver).Report()
[2023-10-13T20:45:58.475Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3527/src/github.com/elastic/elastic-agent/internal/pkg/agent/application/upgrade/artifact/download/http/progress_observer.go:112 +0x48
[2023-10-13T20:45:58.475Z]   github.com/elastic/elastic-agent/internal/pkg/agent/application/upgrade/artifact/download/http.(*downloadProgressReporter).Report.func1()
[2023-10-13T20:45:58.475Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3527/src/github.com/elastic/elastic-agent/internal/pkg/agent/application/upgrade/artifact/download/http/progress_reporter.go:84 +0x29c
[2023-10-13T20:45:58.475Z] 
[2023-10-13T20:45:58.475Z] Goroutine 15 (running) created at:
[2023-10-13T20:45:58.475Z]   testing.(*T).Run()
[2023-10-13T20:45:58.475Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3527/.gvm/versions/go1.20.9.linux.arm64/src/testing/testing.go:1629 +0x5b4
[2023-10-13T20:45:58.476Z]   testing.runTests.func1()
[2023-10-13T20:45:58.476Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3527/.gvm/versions/go1.20.9.linux.arm64/src/testing/testing.go:2036 +0x80
[2023-10-13T20:45:58.476Z]   testing.tRunner()
[2023-10-13T20:45:58.476Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3527/.gvm/versions/go1.20.9.linux.arm64/src/testing/testing.go:1576 +0x180
[2023-10-13T20:45:58.476Z]   testing.runTests()
[2023-10-13T20:45:58.476Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3527/.gvm/versions/go1.20.9.linux.arm64/src/testing/testing.go:2034 +0x6a8
[2023-10-13T20:45:58.476Z]   testing.(*M).Run()
[2023-10-13T20:45:58.476Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3527/.gvm/versions/go1.20.9.linux.arm64/src/testing/testing.go:1906 +0x8e0
[2023-10-13T20:45:58.476Z]   main.main()
[2023-10-13T20:45:58.476Z]       _testmain.go:57 +0x2b8
[2023-10-13T20:45:58.476Z] 
[2023-10-13T20:45:58.476Z] Goroutine 22 (running) created at:
[2023-10-13T20:45:58.476Z]   github.com/elastic/elastic-agent/internal/pkg/agent/application/upgrade/artifact/download/http.(*downloadProgressReporter).Report()
[2023-10-13T20:45:58.476Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3527/src/github.com/elastic/elastic-agent/internal/pkg/agent/application/upgrade/artifact/download/http/progress_reporter.go:64 +0x238
[2023-10-13T20:45:58.476Z]   github.com/elastic/elastic-agent/internal/pkg/agent/application/upgrade/artifact/download/http.(*Downloader).downloadFile()
[2023-10-13T20:45:58.476Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3527/src/github.com/elastic/elastic-agent/internal/pkg/agent/application/upgrade/artifact/download/http/downloader.go:215 +0xba0
[2023-10-13T20:45:58.476Z]   github.com/elastic/elastic-agent/internal/pkg/agent/application/upgrade/artifact/download/http.(*Downloader).download()
[2023-10-13T20:45:58.476Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3527/src/github.com/elastic/elastic-agent/internal/pkg/agent/application/upgrade/artifact/download/http/downloader.go:150 +0x2cc
[2023-10-13T20:45:58.476Z]   github.com/elastic/elastic-agent/internal/pkg/agent/application/upgrade/artifact/download/http.(*Downloader).Download()
[2023-10-13T20:45:58.476Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3527/src/github.com/elastic/elastic-agent/internal/pkg/agent/application/upgrade/artifact/download/http/downloader.go:111 +0x1c4
[2023-10-13T20:45:58.476Z]   github.com/elastic/elastic-agent/internal/pkg/agent/application/upgrade/artifact/download/http.TestDownloadLogProgressWithLength()
[2023-10-13T20:45:58.476Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3527/src/github.com/elastic/elastic-agent/internal/pkg/agent/application/upgrade/artifact/download/http/downloader_test.go:126 +0x6c4
[2023-10-13T20:45:58.476Z]   testing.tRunner()
[2023-10-13T20:45:58.476Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3527/.gvm/versions/go1.20.9.linux.arm64/src/testing/testing.go:1576 +0x180
[2023-10-13T20:45:58.476Z]   testing.(*T).Run.func1()
[2023-10-13T20:45:58.476Z]       /var/lib/jenkins/workspace/_agent_elastic-agent-mbp_PR-3527/.gvm/versions/go1.20.9.linux.arm64/src/testing/testing.go:1629 +0x40
[2023-10-13T20:45:58.476Z] ==================

elasticmachine · 2023-10-13T22:31:03Z

🌐 Coverage report

Name	Metrics % (`covered/total`)	Diff
Packages	98.824% (`84/85`)	👍 0.014
Files	67.105% (`204/304`)	👍 0.109
Classes	66.19% (`370/559`)	👍 0.244
Methods	53.583% (`1174/2191`)	👍 0.341
Lines	39.807% (`13771/34594`)	👍 0.209
Conditionals	100.0% (`0/0`)	💚

…vent data race

…n single goroutine

elastic-sonarqube · 2023-10-19T21:37:22Z

SonarQube Quality Gate

0 Bugs
0 Vulnerabilities
0 Security Hotspots
1 Code Smell

84.1% Coverage
0.0% Duplication

ycombinator added Team:Elastic-Agent Label for the Agent team backport-skip skip-changelog labels Oct 5, 2023

ycombinator requested a review from a team as a code owner October 5, 2023 00:22

ycombinator requested review from blakerouse and faec October 5, 2023 00:22

ycombinator mentioned this pull request Oct 5, 2023

Track and report upgrade details #3119

Closed

3 tasks

mergify bot assigned ycombinator Oct 5, 2023

ycombinator added the enhancement New feature or request label Oct 5, 2023

ycombinator mentioned this pull request Oct 5, 2023

Send upgrade details to Fleet Server in check-in API requests #3528

Merged

7 tasks

pchila reviewed Oct 5, 2023

View reviewed changes

ycombinator force-pushed the track-upgrade-details branch from d16e997 to 838f19b Compare October 5, 2023 11:36

ycombinator mentioned this pull request Oct 5, 2023

Refactoring HTTP downloader progress reporter to accept multiple observers #3542

Merged

7 tasks

ycombinator force-pushed the track-upgrade-details branch 3 times, most recently from 1a8147d to a2d4a68 Compare October 13, 2023 00:47

ycombinator force-pushed the track-upgrade-details branch from a2d4a68 to fa15191 Compare October 13, 2023 12:45

ycombinator commented Oct 13, 2023

View reviewed changes

internal/pkg/agent/application/coordinator/coordinator.go Outdated Show resolved Hide resolved

faec reviewed Oct 13, 2023

View reviewed changes

internal/pkg/agent/application/coordinator/coordinator.go Outdated Show resolved Hide resolved

ycombinator force-pushed the track-upgrade-details branch from b02c38e to 7346a5e Compare October 13, 2023 19:00

cmacknz reviewed Oct 13, 2023

View reviewed changes

ycombinator requested review from pchila and faec October 13, 2023 22:45

ycombinator added 25 commits October 19, 2023 12:45

Fixing booboos introduced during conflict resolution

0673107

Add unit test

a5489d6

Add assertion on error

1d76472

Add comment on stateNeedsRefresh

898ba0e

Add comment linking to Fleet Server OpenAPI spec for UPG_* values

761e2fe

Use public accessor for setting upgrade details on coordinator to pre…

5df99f3

…vent data race

Use buffered channel for upgradeDetailsChan in test so test can run i…

a479063

…n single goroutine

Fixing unit test

8fe8499

Add mutex to prevent data race

94cc5ab

Clarify assertion's intent

e7d3401

Make copy of details before notifying observer with it.

413959c

Add setter for setting download percent

11dca0b

Remove unnecessary struct tags

52fa242

Change mutex type

3029e0f

Document FailedState and ErrorMsg fields

8eae498

Track download rate as well

3ac886c

Change data type of time field

db2eb60

Rename struct to avoid stutter in naming

7d20b9f

Log upgrade details when they change

abefefa

Add nil guard

492d195

Setting logger in test

6db2a99

Use sentinel value for encoding +Inf download rate in JSON

29afff0

Fix up comment

6b3719b

Set omitempty on failed_state and error_msg

26d2dd6

Add units to download rate

52987e7

ycombinator force-pushed the track-upgrade-details branch from dcfec79 to 52987e7 Compare October 19, 2023 19:46

Fixing test after conflicts

b8d81a0

ycombinator merged commit 92acf08 into elastic:main Oct 19, 2023
7 of 8 checks passed

ycombinator deleted the track-upgrade-details branch October 19, 2023 23:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Track upgrade details #3527

Track upgrade details #3527

ycombinator commented Oct 5, 2023 •

edited

Loading

elasticmachine commented Oct 5, 2023

elasticmachine commented Oct 5, 2023 •

edited

Loading

Build stats

Test stats 🧪

mergify bot commented Oct 5, 2023

pchila Oct 5, 2023

ycombinator Oct 5, 2023

This comment was marked as duplicate.

ycombinator Oct 17, 2023

ycombinator Oct 17, 2023

pchila Oct 17, 2023

ycombinator Oct 17, 2023

ycombinator Oct 17, 2023 •

edited

Loading

mergify bot commented Oct 13, 2023

cmacknz Oct 13, 2023

ycombinator Oct 13, 2023

ycombinator commented Oct 13, 2023

ycombinator commented Oct 13, 2023

elasticmachine commented Oct 13, 2023 •

edited

Loading

elastic-sonarqube bot commented Oct 19, 2023

Track upgrade details #3527

Track upgrade details #3527

Conversation

ycombinator commented Oct 5, 2023 • edited Loading

What does this PR do?

Implementation Details

Why is it important?

Checklist

Related issues

elasticmachine commented Oct 5, 2023

elasticmachine commented Oct 5, 2023 • edited Loading

💚 Build Succeeded

Build stats

Test stats 🧪

💚 Flaky test report

🤖 GitHub comments

mergify bot commented Oct 5, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

This comment was marked as duplicate.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ycombinator Oct 17, 2023 • edited Loading

Choose a reason for hiding this comment

mergify bot commented Oct 13, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ycombinator commented Oct 13, 2023

ycombinator commented Oct 13, 2023

elasticmachine commented Oct 13, 2023 • edited Loading

🌐 Coverage report

elastic-sonarqube bot commented Oct 19, 2023

ycombinator commented Oct 5, 2023 •

edited

Loading

elasticmachine commented Oct 5, 2023 •

edited

Loading

ycombinator Oct 17, 2023 •

edited

Loading

elasticmachine commented Oct 13, 2023 •

edited

Loading