Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Meta] Audit concurrency handling / increase unit test coverage in Agent #3040

Open
faec opened this issue Jul 7, 2023 · 1 comment
Open
Labels
enhancement New feature or request Team:Elastic-Agent Label for the Agent team

Comments

@faec
Copy link
Contributor

faec commented Jul 7, 2023

This is a meta-issue for the work that started in #2736, #2849 and https://github.com/elastic/ingest-dev/issues/1936. The overall goal is to identify and address risky concurrency patterns in the Agent codebase, and to increase reliability and test coverage by migrating to more testable internal API patterns.

Initial work focused on Coordinator and its state handling, and as a result we have greatly reduced the amount of locking needed within Coordinator, and simplified the testing model while still covering many more scenarios (#2868). This has both increased short-term reliability and clarified issues with Agent's current state handling that require followup (#2789, #2852, #2887).

The intention now is to similarly audit other components, starting with the most urgent/sensitive and proceeding as time and triage allows. We particularly want to focus on:

  • Blocking calls from important components / execution threads. These are risky and can escalate the impact of more minor bugs. When possible, control routines should be non-blocking.
  • Excess / nested mutex use, or golang channels with unclear owners. These often arise from spreading ownership of control logic across several execution threads, which makes race conditions more likely and prevents effective unit testing.
  • Goroutines without a clearly defined owner / lifecycle. Most goroutines longer than a few lines should be named functions with a clear mechanism for startup/shutdown. (Currently Agent has a fair number of relatively complex anonymous goroutines whose lifecycle is implicit/undocumented.) When creating an object it should always be clear what execution threads they create and how to safely communicate with them.

Components that still need to be examined:

  • The runtime package (issue)
  • UpgradeManager (issue)
  • The Fleet configuration manager (managedConfigManager)
  • ActionDispatcher, ActionQueue and the Fleet actions framework
  • The variables manager (composable.controller)
@faec faec added enhancement New feature or request Team:Elastic-Agent Label for the Agent team labels Jul 7, 2023
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Team:Elastic-Agent Label for the Agent team
Projects
None yet
Development

No branches or pull requests

2 participants