Fix double stop components #3482

pchila · 2023-09-28T13:00:16Z

What does this PR do?

This PR prevents agent from stopping already stopped components.

Why is it important?

Calling stop on an already stopped component may have unforeseen effects especially if uninstall is involved

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
~~[ ] I have made corresponding changes to the documentation~~
~~[ ] I have made corresponding change to the default configuration files~~
~~[ ] I have added tests that prove my fix is effective or that my feature works~~
I have added an entry in ./changelog/fragments using the changelog tool
~~[ ] I have added an integration test or an E2E test~~

Author's Checklist

[ ]

How to test this PR locally

Related issues

Use cases

Screenshots

Logs

Questions to ask yourself

How are we going to support this in production?
How are we going to measure its adoption?
How are we going to debug this?
What are the metrics I should take care of?
...

mergify · 2023-09-28T13:00:50Z

This pull request does not have a backport label. Could you fix it @pchila? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-v./d./d./d is the label to automatically backport to the 8./d branch. /d is the digit

NOTE: backport-skip has been added to this pull request.

elasticmachine · 2023-09-28T13:06:40Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2023-10-23T08:07:57.716+0000
Duration: 27 min 18 sec

Test stats 🧪

Test	Results
Failed	0
Passed	6577
Skipped	59
Total	6636

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.
/package : Generate the packages.
run integration tests : Run the Elastic Agent Integration tests.
run end-to-end tests : Generate the packages and run the E2E Tests.
run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

elasticmachine · 2023-09-28T15:49:06Z

🌐 Coverage report

Name	Metrics % (`covered/total`)	Diff
Packages	98.824% (`84/85`)	👍
Files	67.105% (`204/304`)	👍
Classes	66.19% (`370/559`)	👍
Methods	53.583% (`1174/2191`)	👍
Lines	39.804% (`13775/34607`)	👍 0.037
Conditionals	100.0% (`0/0`)	💚

elasticmachine · 2023-09-28T16:02:29Z

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

cmacknz · 2023-09-28T18:13:07Z

pkg/component/runtime/service.go

+			s.log.Debugf("service %s is already stopping: skipping...")
+			return
+		}
+		stopping = true


That this is never set to false doesn't seem right. What would happen if we stopped the service and then tried to start it again? This would be uninstalling the endpoint integration and then reinstalling it quickly, or perhaps reassigning the agent to a different policy.

The ignoreCheckins on this path is set to false again in the case actionStart: block.

When discussing the fix with @blakerouse , I asked about recycling a service runtime object and we came to the conclusion that a stopped component is ultimately discarded (removed from the maps of running components in the runtime manager) after it eventually stops.

If this assumption does not hold, then we need to keep track of the installation status and if the object gets recycled we need to wait for the ongoing uninstall to finish, then reinstall it (it is done as part of the start()) and set up the component back properly.

@blakerouse any thoughts ?

There is code in this file that would only be executed if the object is not discarded, so at minimum the behavior needs to be confirmed and then documented. It may be that your and Blake's analysis is correct, but it isn't obvious to myself and others.

It is correct. Once a manager is stopped it is never restarted again. This follows the same pattern - https://github.com/elastic/elastic-agent/blob/main/pkg/component/runtime/runtime.go#L190; once set it is not unset.

If this is the case we should add a comment explaining this.

pchila · 2023-09-29T07:11:37Z

buildkite test this

blakerouse · 2023-10-02T13:35:09Z

pkg/component/runtime/manager.go

@@ -705,6 +705,12 @@ func (m *Manager) update(model component.Model, teardown bool) error {
 		var stoppedWg sync.WaitGroup
 		stoppedWg.Add(len(stop))
 		for _, existing := range stop {
+			if existing.getLatest().State == client.UnitStateStopped {


I think instead of making this check here this could be changed to use the atomic that the componentRuntimeState is already setting.

elastic-agent/pkg/component/runtime/runtime.go

Line 190 in d316ee0

s.shuttingDown.Store(true)

I think change line 190 in runtime.go to check if already set and do nothing would result in the same behavior. Removing the need to hold the lock that is being held when getLatest() is being called.

done in b233070

blakerouse · 2023-10-02T15:45:36Z

pkg/component/runtime/manager.go

+				stoppedWg.Done()
+				continue
+			}
+			m.logger.Debugf("Stopping component %q", existing.id)


Why do it here? Why not do it inside of existing.stop(? Seems safer, that way if the code is updated some where else to call existing.stop( the same logic would apply. Seems weird that it would be checked outside of the function that sets it.

If you are referring to the log, it's there because that's where I landed when investigating the bug initially.
If you are referring to the stop() itself, the initial fix was only here in the manager without any modification to the service runtime object.
Would you prefer to have only for a service runtime object ?

@pchila No I am referring to this line function https://github.com/elastic/elastic-agent/blob/main/pkg/component/runtime/runtime.go#L189 that is being called from https://github.com/elastic/elastic-agent/blob/main/pkg/component/runtime/manager.go#L708 (which is what your changing here).

This runtime wraps the actual managers. So this change will apply to all managers.

I thinking the function could easily be changed to:

func (s *componentRuntimeState) stop(teardown bool, signed *component.Signed) error { if s.shuttingDown.Load() { // already stopping return nil } s.shuttingDown.Store(true) if teardown { return s.runtime.Teardown(signed) } return s.runtime.Stop() }

done in 2606c30

mergify · 2023-10-05T13:41:12Z

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b fix-double-stop-components upstream/fix-double-stop-components
git merge upstream/main
git push upstream fix-double-stop-components

changelog/fragments/1695920792-Prevent-multiple-stops-of-services.yaml

testing/integration/endpoint_security_test.go

AndersonQ

Any chance you could add a test?

blakerouse

Looks good. Thanks for all the changes.

cmacknz · 2023-10-16T20:59:12Z

pkg/component/runtime/runtime.go

+	if s.shuttingDown.Load() {
+		// already stopping
+		return nil
+	}


Can shuttingDown be set to false between this Load and the following Store?

from a component point of view it should be guarded, but from agent architecture standpoint it should not be concurrent and should be safe to have it as it is

pchila · 2023-10-17T08:21:17Z

Any chance you could add a test?

@AndersonQ I thought of that but for this precise timing I need a unit or a "narrow" integration test.
If the fix was in the runtime manager as it was initially, that could have been maybe easier. For service runtime I will have to check if I can figure something out without big refactor just for testing.

pchila · 2023-10-17T08:21:26Z

buildkite test this

elastic-sonarqube · 2023-10-23T08:18:59Z

SonarQube Quality Gate

0 Bugs
0 Vulnerabilities
0 Security Hotspots
1 Code Smell

41.7% Coverage
0.0% Duplication

* Skip stopping already stopped components (cherry picked from commit 97d9c80)

* Skip stopping already stopped components (cherry picked from commit 97d9c80) Co-authored-by: Paolo Chilà <[email protected]>

pchila added bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team labels Sep 28, 2023

pchila self-assigned this Sep 28, 2023

mergify bot added the backport-skip label Sep 28, 2023

pchila force-pushed the fix-double-stop-components branch from f24d064 to cad774f Compare September 28, 2023 13:03

pchila marked this pull request as ready for review September 28, 2023 16:02

pchila requested a review from a team as a code owner September 28, 2023 16:02

pchila requested review from AndersonQ and michalpristas September 28, 2023 16:02

pchila requested a review from blakerouse September 28, 2023 16:02

pchila force-pushed the fix-double-stop-components branch from f86ce29 to 8c25616 Compare September 28, 2023 17:16

cmacknz reviewed Sep 28, 2023

View reviewed changes

blakerouse reviewed Oct 2, 2023

View reviewed changes

cmacknz added the backport-v8.11.0 Automated backport with mergify label Oct 4, 2023

mergify bot removed the backport-skip label Oct 4, 2023

AndersonQ reviewed Oct 11, 2023

View reviewed changes

changelog/fragments/1695920792-Prevent-multiple-stops-of-services.yaml Show resolved Hide resolved

testing/integration/endpoint_security_test.go Outdated Show resolved Hide resolved

pchila added 4 commits October 16, 2023 15:02

Skip stopping already stopped components

e5eeb1f

add a stopping flag in service runtime to prevent concurrent stop calls

9d76c08

add changelog

a6d5921

use shuttingDown atomic bool instead of component state

4d4ae1e

pchila force-pushed the fix-double-stop-components branch from b233070 to 4d4ae1e Compare October 16, 2023 13:02

integrate PR review comments

2606c30

AndersonQ reviewed Oct 16, 2023

View reviewed changes

blakerouse approved these changes Oct 16, 2023

View reviewed changes

cmacknz reviewed Oct 16, 2023

View reviewed changes

michalpristas approved these changes Oct 17, 2023

View reviewed changes

AndersonQ approved these changes Oct 17, 2023

View reviewed changes

Add comments

21564f6

pchila merged commit 97d9c80 into elastic:main Oct 23, 2023
8 checks passed

mergify bot pushed a commit that referenced this pull request Oct 23, 2023

Fix double stop components (#3482)

db44424

* Skip stopping already stopped components (cherry picked from commit 97d9c80)

mergify bot mentioned this pull request Oct 23, 2023

[8.11](backport #3482) Fix double stop components #3648

Merged

pchila added a commit that referenced this pull request Oct 24, 2023

Fix double stop components (#3482) (#3648)

32a1bfc

* Skip stopping already stopped components (cherry picked from commit 97d9c80) Co-authored-by: Paolo Chilà <[email protected]>

kilfoyle mentioned this pull request Nov 7, 2023

Add Fleet & Agent 8.11.0 Release Notes elastic/ingest-docs#580

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix double stop components #3482

Fix double stop components #3482

pchila commented Sep 28, 2023 •

edited

Loading

mergify bot commented Sep 28, 2023

elasticmachine commented Sep 28, 2023 •

edited

Loading

Build stats

Test stats 🧪

elasticmachine commented Sep 28, 2023 •

edited

Loading

elasticmachine commented Sep 28, 2023

cmacknz Sep 28, 2023

pchila Sep 28, 2023

cmacknz Sep 29, 2023

blakerouse Oct 2, 2023

cmacknz Oct 16, 2023

pchila commented Sep 29, 2023

blakerouse Oct 2, 2023

pchila Oct 2, 2023

blakerouse Oct 2, 2023

pchila Oct 3, 2023

blakerouse Oct 3, 2023

pchila Oct 16, 2023

mergify bot commented Oct 5, 2023

AndersonQ left a comment

blakerouse left a comment

cmacknz Oct 16, 2023

michalpristas Oct 17, 2023

pchila commented Oct 17, 2023

pchila commented Oct 17, 2023

elastic-sonarqube bot commented Oct 23, 2023

Fix double stop components #3482

Fix double stop components #3482

Conversation

pchila commented Sep 28, 2023 • edited Loading

What does this PR do?

Why is it important?

Checklist

Author's Checklist

How to test this PR locally

Related issues

Use cases

Screenshots

Logs

Questions to ask yourself

mergify bot commented Sep 28, 2023

elasticmachine commented Sep 28, 2023 • edited Loading

💚 Build Succeeded

Build stats

Test stats 🧪

💚 Flaky test report

🤖 GitHub comments

elasticmachine commented Sep 28, 2023 • edited Loading

🌐 Coverage report

elasticmachine commented Sep 28, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pchila commented Sep 29, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mergify bot commented Oct 5, 2023

AndersonQ left a comment

Choose a reason for hiding this comment

blakerouse left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pchila commented Oct 17, 2023

pchila commented Oct 17, 2023

elastic-sonarqube bot commented Oct 23, 2023

pchila commented Sep 28, 2023 •

edited

Loading

elasticmachine commented Sep 28, 2023 •

edited

Loading

elasticmachine commented Sep 28, 2023 •

edited

Loading