Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix double stop components #3482

Merged
merged 6 commits into from
Oct 23, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# Kind can be one of:
# - breaking-change: a change to previously-documented behavior
# - deprecation: functionality that is being removed in a later release
# - bug-fix: fixes a problem in a previous version
# - enhancement: extends functionality but does not break or fix existing behavior
# - feature: new functionality
# - known-issue: problems that we are aware of in a given version
# - security: impacts on the security of a product or a user’s deployment.
# - upgrade: important information for someone upgrading from a prior version
# - other: does not fit into any of the other categories
kind: bug-fix

# Change summary; a 80ish characters long description of the change.
summary: Prevent multiple attempts to stop an already stopped service
AndersonQ marked this conversation as resolved.
Show resolved Hide resolved

# Long description; in case the summary is not enough to describe the change
# this field accommodate a description without length limits.
# NOTE: This field will be rendered only for breaking-change and known-issue kinds at the moment.
#description:

# Affected component; a word indicating the component this changeset affects.
component: runtime

# PR URL; optional; the PR number that added the changeset.
# If not present is automatically filled by the tooling finding the PR where this changelog fragment has been added.
# NOTE: the tooling supports backports, so it's able to fill the original PR number instead of the backport PR number.
# Please provide it if you are adding a fragment for a different PR.
#pr: https://github.com/owner/repo/1234

# Issue URL; optional; the GitHub issue related to this changeset (either closes or is part of).
# If not present is automatically filled by the tooling with the issue linked to the PR number.
#issue: https://github.com/owner/repo/1234
3 changes: 3 additions & 0 deletions pkg/component/runtime/manager.go
Original file line number Diff line number Diff line change
Expand Up @@ -705,6 +705,7 @@ func (m *Manager) update(model component.Model, teardown bool) error {
var stoppedWg sync.WaitGroup
stoppedWg.Add(len(stop))
for _, existing := range stop {
m.logger.Debugf("Stopping component %q", existing.id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do it here? Why not do it inside of existing.stop(? Seems safer, that way if the code is updated some where else to call existing.stop( the same logic would apply. Seems weird that it would be checked outside of the function that sets it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are referring to the log, it's there because that's where I landed when investigating the bug initially.
If you are referring to the stop() itself, the initial fix was only here in the manager without any modification to the service runtime object.
Would you prefer to have only for a service runtime object ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pchila No I am referring to this line function https://github.com/elastic/elastic-agent/blob/main/pkg/component/runtime/runtime.go#L189 that is being called from https://github.com/elastic/elastic-agent/blob/main/pkg/component/runtime/manager.go#L708 (which is what your changing here).

This runtime wraps the actual managers. So this change will apply to all managers.

I thinking the function could easily be changed to:

func (s *componentRuntimeState) stop(teardown bool, signed *component.Signed) error {
	if s.shuttingDown.Load() {
                // already stopping
		return nil
        }
	s.shuttingDown.Store(true)
	if teardown {
		return s.runtime.Teardown(signed)
	}
	return s.runtime.Stop()
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in 2606c30

_ = existing.stop(teardown, model.Signed)
// stop is async, wait for operation to finish,
// otherwise new instance may be started and components
Expand Down Expand Up @@ -755,6 +756,7 @@ func (m *Manager) waitForStopped(comp *componentRuntimeState) {
for {
latestState := comp.getLatest()
if latestState.State == client.UnitStateStopped {
m.logger.Debugf("component %q stopped.", compID)
return
}

Expand All @@ -767,6 +769,7 @@ func (m *Manager) waitForStopped(comp *componentRuntimeState) {

select {
case <-timeoutCh:
m.logger.Errorf("timeout exceeded waiting for component %q to stop", compID)
return
case <-time.After(stopCheckRetryPeriod):
}
Expand Down
4 changes: 4 additions & 0 deletions pkg/component/runtime/runtime.go
Original file line number Diff line number Diff line change
Expand Up @@ -187,6 +187,10 @@ func (s *componentRuntimeState) start() error {
}

func (s *componentRuntimeState) stop(teardown bool, signed *component.Signed) error {
if s.shuttingDown.Load() {
// already stopping
return nil
}
Comment on lines +190 to +193
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can shuttingDown be set to false between this Load and the following Store?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from a component point of view it should be guarded, but from agent architecture standpoint it should not be concurrent and should be safe to have it as it is

s.shuttingDown.Store(true)
if teardown {
return s.runtime.Teardown(signed)
Expand Down
10 changes: 10 additions & 0 deletions pkg/component/runtime/service.go
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ var (
type executeServiceCommandFunc func(ctx context.Context, log *logger.Logger, binaryPath string, spec *component.ServiceOperationsCommandSpec) error

// serviceRuntime provides the command runtime for running a component as a service.
// an instance of serviceRuntime is not reused: after being stopped, it cannot be started again.
type serviceRuntime struct {
comp component.Component
log *logger.Logger
Expand Down Expand Up @@ -124,6 +125,8 @@ func (s *serviceRuntime) Run(ctx context.Context, comm Communicator) (err error)
lastCheckin time.Time
missedCheckins int
tearingDown bool
// flag that signals if we are already stopping
stopping bool
ignoreCheckins bool
)

Expand All @@ -136,6 +139,13 @@ func (s *serviceRuntime) Run(ctx context.Context, comm Communicator) (err error)
defer cisStop()

onStop := func(am actionMode) {
if stopping {
s.log.Debugf("service %s is already stopping: skipping...", s.name())
return
}
// the flag is set once and never reset since the serviceRuntime object
// is not supposed to be reused once it's stopping
stopping = true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That this is never set to false doesn't seem right. What would happen if we stopped the service and then tried to start it again? This would be uninstalling the endpoint integration and then reinstalling it quickly, or perhaps reassigning the agent to a different policy.

The ignoreCheckins on this path is set to false again in the case actionStart: block.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When discussing the fix with @blakerouse , I asked about recycling a service runtime object and we came to the conclusion that a stopped component is ultimately discarded (removed from the maps of running components in the runtime manager) after it eventually stops.

If this assumption does not hold, then we need to keep track of the installation status and if the object gets recycled we need to wait for the ongoing uninstall to finish, then reinstall it (it is done as part of the start()) and set up the component back properly.

@blakerouse any thoughts ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is code in this file that would only be executed if the object is not discarded, so at minimum the behavior needs to be confirmed and then documented. It may be that your and Blake's analysis is correct, but it isn't obvious to myself and others.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is correct. Once a manager is stopped it is never restarted again. This follows the same pattern - https://github.com/elastic/elastic-agent/blob/main/pkg/component/runtime/runtime.go#L190; once set it is not unset.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is the case we should add a comment explaining this.

// Stop check-in timer
s.log.Debugf("stop check-in timer for %s service", s.name())
checkinTimer.Stop()
Expand Down
Loading