-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix double stop components #3482
Conversation
This pull request does not have a backport label. Could you fix it @pchila? 🙏
NOTE: |
f24d064
to
cad774f
Compare
🌐 Coverage report
|
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane) |
f86ce29
to
8c25616
Compare
s.log.Debugf("service %s is already stopping: skipping...") | ||
return | ||
} | ||
stopping = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That this is never set to false doesn't seem right. What would happen if we stopped the service and then tried to start it again? This would be uninstalling the endpoint integration and then reinstalling it quickly, or perhaps reassigning the agent to a different policy.
The ignoreCheckins
on this path is set to false again in the case actionStart:
block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When discussing the fix with @blakerouse , I asked about recycling a service runtime object and we came to the conclusion that a stopped component is ultimately discarded (removed from the maps of running components in the runtime manager) after it eventually stops.
If this assumption does not hold, then we need to keep track of the installation status and if the object gets recycled we need to wait for the ongoing uninstall to finish, then reinstall it (it is done as part of the start()) and set up the component back properly.
@blakerouse any thoughts ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is code in this file that would only be executed if the object is not discarded, so at minimum the behavior needs to be confirmed and then documented. It may be that your and Blake's analysis is correct, but it isn't obvious to myself and others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is correct. Once a manager is stopped it is never restarted again. This follows the same pattern - https://github.com/elastic/elastic-agent/blob/main/pkg/component/runtime/runtime.go#L190; once set it is not unset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is the case we should add a comment explaining this.
buildkite test this |
pkg/component/runtime/manager.go
Outdated
@@ -705,6 +705,12 @@ func (m *Manager) update(model component.Model, teardown bool) error { | |||
var stoppedWg sync.WaitGroup | |||
stoppedWg.Add(len(stop)) | |||
for _, existing := range stop { | |||
if existing.getLatest().State == client.UnitStateStopped { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think instead of making this check here this could be changed to use the atomic that the componentRuntimeState
is already setting.
s.shuttingDown.Store(true) |
I think change line 190 in runtime.go
to check if already set and do nothing would result in the same behavior. Removing the need to hold the lock that is being held when getLatest()
is being called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in b233070
stoppedWg.Done() | ||
continue | ||
} | ||
m.logger.Debugf("Stopping component %q", existing.id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do it here? Why not do it inside of existing.stop(
? Seems safer, that way if the code is updated some where else to call existing.stop(
the same logic would apply. Seems weird that it would be checked outside of the function that sets it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are referring to the log, it's there because that's where I landed when investigating the bug initially.
If you are referring to the stop() itself, the initial fix was only here in the manager without any modification to the service runtime object.
Would you prefer to have only for a service runtime object ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pchila No I am referring to this line function https://github.com/elastic/elastic-agent/blob/main/pkg/component/runtime/runtime.go#L189 that is being called from https://github.com/elastic/elastic-agent/blob/main/pkg/component/runtime/manager.go#L708 (which is what your changing here).
This runtime wraps the actual managers. So this change will apply to all managers.
I thinking the function could easily be changed to:
func (s *componentRuntimeState) stop(teardown bool, signed *component.Signed) error {
if s.shuttingDown.Load() {
// already stopping
return nil
}
s.shuttingDown.Store(true)
if teardown {
return s.runtime.Teardown(signed)
}
return s.runtime.Stop()
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in 2606c30
This pull request is now in conflicts. Could you fix it? 🙏
|
b233070
to
4d4ae1e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any chance you could add a test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Thanks for all the changes.
if s.shuttingDown.Load() { | ||
// already stopping | ||
return nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can shuttingDown
be set to false between this Load
and the following Store
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from a component point of view it should be guarded, but from agent architecture standpoint it should not be concurrent and should be safe to have it as it is
@AndersonQ I thought of that but for this precise timing I need a unit or a "narrow" integration test. |
buildkite test this |
SonarQube Quality Gate |
* Skip stopping already stopped components (cherry picked from commit 97d9c80)
* Skip stopping already stopped components (cherry picked from commit 97d9c80) Co-authored-by: Paolo Chilà <[email protected]>
What does this PR do?
This PR prevents agent from stopping already stopped components.
Why is it important?
Calling stop on an already stopped component may have unforeseen effects especially if uninstall is involved
Checklist
[ ] I have made corresponding changes to the documentation[ ] I have made corresponding change to the default configuration files[ ] I have added tests that prove my fix is effective or that my feature works./changelog/fragments
using the changelog tool[ ] I have added an integration test or an E2E testAuthor's Checklist
How to test this PR locally
Related issues
Use cases
Screenshots
Logs
Questions to ask yourself