-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Track upgrade details #3527
Track upgrade details #3527
Conversation
Pinging @elastic/elastic-agent (Team:Elastic-Agent) |
This pull request is now in conflicts. Could you fix it? 🙏
|
d.mu.RLock() | ||
defer d.mu.RUnlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we use an RLock()
here? Is it expected to have multiple concurrent NotifyObservers()
calls (all the others functions guards are a normal Lock()
) ?
If NotifyObservers()
cannot be called concurrently then we don't need an RWMutex
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There could be multiple observers registered. In such a case they would be called serially, in the loop inside the notifyObservers
method.
However, even if there was only one single observer registered, the RLock
call is needed to ensure that the internal state of d
, e.g. d.State
or d.Metadata
, haven't changed between the caller calling NotifyObservers()
and eventually the observer function being called inside the notifyObserver
method. Also, I used an RLock
instead of a Lock
because I'm okay with multiple reads of d
's internal state while NotifyObservers()
is executing; I just don't want a write to sneak in there as explained earlier.
This comment was marked as duplicate.
This comment was marked as duplicate.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The goal I'm trying to achieve here is that for a given UpgradeDetails
object, d
, I want all it's observers to see the same value of d
when d.NotifyObservers()
is called. Because an RLock()
would prevent concurrent writes to the fields of d
, I think my goal is achieved.
You're right that a regular Lock()
would also achieve that goal but it would be a stronger lock than I need. I don't necessarily care if there are two concurrent invocations of d.NotifyObservers()
for the same d
object; in fact, I would want both to be executed without the first one blocking the second one, which would happen if we used a Lock()
here.
I think I am missing something here as I don't see other functions holding an RLock for reading state. If those reads happen without any lock at all, again there's no difference between RLock and a normal mutex.
Are you referring to notifyObservers()
and notifyObserver()
? I did not use any mutexes in there because otherwise we end up with a deadlock when d.SetState()
is called. These are private methods and the public methods that call them acquire/release the mutex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Multiple RLock() can be acquired simultaneously so in the current code the notification would run concurrently. Is this the desired result ?
Yes, as explained in the previous comment.
If that's the case, did we write down somewhere that the Observers' code MUST be thread safe ?
No, but you bring up an excellent point. Observer's should receive a (deep) copy of the details object, not a pointer to it. The idea is that the observer receives a "snapshot-in-time" of the upgrade details at the time of observation. I have fixed this now in 5bbb507.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not just in the manipulation of the UpgradeDetails, it's a warning for the implementation of the observers to protect their own state against concurrent calls...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not just in the manipulation of the UpgradeDetails, it's a warning for the implementation of the observers to protect their own state against concurrent calls...
Now I think I might be missing something 😄. Not a rhetorical question: why should we add such a warning? Isn't it generally true that any object that wants to support concurrent access may want to protect its own state?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pchila and I chatted over Zoom about this. The decision was to provide a setter method on UpgradeDetails
for setting the download percentage, as this was the only operation that was being externally synchronized with a call to the NotifyObservers()
method. By providing this setter and implementing the synchronization internally within that setter, we no longer need or allow writers of UpgradeDetails
to directly notify observers; that happens automatically under the hood when these writers update some state of UpgradeDetails
via setters. This is a much cleaner and safer approach!
d16e997
to
838f19b
Compare
1a8147d
to
a2d4a68
Compare
This pull request is now in conflicts. Could you fix it? 🙏
|
a2d4a68
to
fa15191
Compare
b02c38e
to
7346a5e
Compare
@@ -182,6 +189,8 @@ func (u *Upgrader) Upgrade(ctx context.Context, version string, sourceURI string | |||
return nil, err | |||
} | |||
|
|||
det.SetState(details.StateWatching) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the agent restarts essentially right after this, whether not we ever seen any of these transitions in Fleet depends on if a checkin happens to occur before we restart.
We would be guaranteed to see this on the first checkin of the new agent version if it were persisted somewhere. I assume something like this will be done when we add support for the UPG_ROLLBACK state?
Although realistically if the upgrade completes quickly and without error then knowing anything about it isn't that interesting. It's the slow and failure cases that are interesting to see.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I haven't yet thought through the implementation of for UPG_ROLLBACK
but I expect we'll use the upgrade marker file to persist this state so we can communicate it from the Upgrade Watcher process to the main Agent process. We may want to expand on that idea and use the upgrade marker file to persist UPG_WATCHING
as well — in fact, it will probably be the initial state in the upgrade marker file when it's created.
Longer term it would be nice to persist all upgrade states in the upgrade marker file and use it to drive the upgrade state machine; that way if an Agent restarts anywhere in the middle of an upgrade, it should be able to pick up from where it left off (assuming every or almost every step of the upgrade process is idempotent).
CI is failing because the |
CI is now failing because of the following data race. Investigating...
|
🌐 Coverage report
|
…n single goroutine
dcfec79
to
52987e7
Compare
SonarQube Quality Gate |
What does this PR do?
This PR allows the Agent to internally track most of the important states of the upgrade process. The two states not being tracked as part of this PR are
UPG_SCHEDULING
andUPG_ROLLBACK
; tracking for these will be added in follow up PRs (see #3119 (comment) for rationale).Implementation Details
This PR introduces a new package,
github.com/elastic/elastic-agent/internal/pkg/agent/application/upgrade/details
. This package introduces two new types:State
: to represent the various upgrade states, e.g.UPG_REQUESTED
,UPG_DOWNLOADING
, etc.Details
: to encapsulate the upgrade state and other relevant upgrade details, e.g. the target version, any error upon failure, etc.The
Details
type uses an Observer design pattern to notify any interested consumers of changes.A new field,
UpgradeDetails
, and accompanying setter method,setUpgradeDetails
is introduced on thecoordinator.state
struct. In the coordinator'sUpgrade
method, a newDetails
object is constructed and the new setter method is registered as an observer on this details object. As the agent progresses through (most of) the upgrade steps, the details object is updated. The Observer pattern then causes these updates to be reflected incoordinator.state.UpgradeDetails
.Why is it important?
To lay the foundations for increasing visibility into the Agent upgrade process.
Checklist
I have made corresponding changes to the documentationI have made corresponding change to the default configuration filesI have added an entry in./changelog/fragments
using the changelog toolI have added an integration test or an E2E testRelated issues