You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is a meta-issue for the work that started in #2736, #2849 and https://github.com/elastic/ingest-dev/issues/1936. The overall goal is to identify and address risky concurrency patterns in the Agent codebase, and to increase reliability and test coverage by migrating to more testable internal API patterns.
Initial work focused on Coordinator and its state handling, and as a result we have greatly reduced the amount of locking needed within Coordinator, and simplified the testing model while still covering many more scenarios (#2868). This has both increased short-term reliability and clarified issues with Agent's current state handling that require followup (#2789, #2852, #2887).
The intention now is to similarly audit other components, starting with the most urgent/sensitive and proceeding as time and triage allows. We particularly want to focus on:
Blocking calls from important components / execution threads. These are risky and can escalate the impact of more minor bugs. When possible, control routines should be non-blocking.
Excess / nested mutex use, or golang channels with unclear owners. These often arise from spreading ownership of control logic across several execution threads, which makes race conditions more likely and prevents effective unit testing.
Goroutines without a clearly defined owner / lifecycle. Most goroutines longer than a few lines should be named functions with a clear mechanism for startup/shutdown. (Currently Agent has a fair number of relatively complex anonymous goroutines whose lifecycle is implicit/undocumented.) When creating an object it should always be clear what execution threads they create and how to safely communicate with them.
This is a meta-issue for the work that started in #2736, #2849 and https://github.com/elastic/ingest-dev/issues/1936. The overall goal is to identify and address risky concurrency patterns in the Agent codebase, and to increase reliability and test coverage by migrating to more testable internal API patterns.
Initial work focused on
Coordinator
and its state handling, and as a result we have greatly reduced the amount of locking needed withinCoordinator
, and simplified the testing model while still covering many more scenarios (#2868). This has both increased short-term reliability and clarified issues with Agent's current state handling that require followup (#2789, #2852, #2887).The intention now is to similarly audit other components, starting with the most urgent/sensitive and proceeding as time and triage allows. We particularly want to focus on:
Components that still need to be examined:
runtime
package (issue)UpgradeManager
(issue)managedConfigManager
)ActionDispatcher
,ActionQueue
and the Fleet actions frameworkcomposable.controller
)The text was updated successfully, but these errors were encountered: