[Meta] Audit concurrency handling / increase unit test coverage in Agent #3040

faec · 2023-07-07T20:55:16Z

This is a meta-issue for the work that started in #2736, #2849 and https://github.com/elastic/ingest-dev/issues/1936. The overall goal is to identify and address risky concurrency patterns in the Agent codebase, and to increase reliability and test coverage by migrating to more testable internal API patterns.

Initial work focused on Coordinator and its state handling, and as a result we have greatly reduced the amount of locking needed within Coordinator, and simplified the testing model while still covering many more scenarios (#2868). This has both increased short-term reliability and clarified issues with Agent's current state handling that require followup (#2789, #2852, #2887).

The intention now is to similarly audit other components, starting with the most urgent/sensitive and proceeding as time and triage allows. We particularly want to focus on:

Blocking calls from important components / execution threads. These are risky and can escalate the impact of more minor bugs. When possible, control routines should be non-blocking.
Excess / nested mutex use, or golang channels with unclear owners. These often arise from spreading ownership of control logic across several execution threads, which makes race conditions more likely and prevents effective unit testing.
Goroutines without a clearly defined owner / lifecycle. Most goroutines longer than a few lines should be named functions with a clear mechanism for startup/shutdown. (Currently Agent has a fair number of relatively complex anonymous goroutines whose lifecycle is implicit/undocumented.) When creating an object it should always be clear what execution threads they create and how to safely communicate with them.

Components that still need to be examined:

The runtime package (issue)
UpgradeManager (issue)
The Fleet configuration manager (managedConfigManager)
ActionDispatcher, ActionQueue and the Fleet actions framework
The variables manager (composable.controller)

The text was updated successfully, but these errors were encountered:

elasticmachine · 2023-07-07T20:55:18Z

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

faec added enhancement New feature or request Team:Elastic-Agent Label for the Agent team labels Jul 7, 2023

This was referenced Jul 7, 2023

Clarify state reporting / communication in the runtime package #3041

Open

Revisit concurrency / lifecycle handling in UpgradeManager #3042

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Meta] Audit concurrency handling / increase unit test coverage in Agent #3040

[Meta] Audit concurrency handling / increase unit test coverage in Agent #3040

faec commented Jul 7, 2023 •

edited

Loading

elasticmachine commented Jul 7, 2023

[Meta] Audit concurrency handling / increase unit test coverage in Agent #3040

[Meta] Audit concurrency handling / increase unit test coverage in Agent #3040

Comments

faec commented Jul 7, 2023 • edited Loading

elasticmachine commented Jul 7, 2023

faec commented Jul 7, 2023 •

edited

Loading