Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework runtime manager updates to block the coordinator less #3747

Merged
merged 10 commits into from
Nov 16, 2023

Conversation

faec
Copy link
Contributor

@faec faec commented Nov 10, 2023

What does this PR do?

Fix #3738, which happens when Coordinator and runtime.Manager block on a policy update and state reporting gets backed up.

Summary of changes:

  • Give the runtime manager an explicit run loop in its own function instead of managing all changes via mutex
  • Manager.Update now sends the component model update to the runtime manager loop via a channel
  • Component model updates are triggered asynchronously from the runtime manager loop, so neither Coordinator nor the runtime manager block their communication while an update is in progress
  • Results from the update process are sent back to Coordinator on the runtime manager's (previously unused) error channel instead of returning directly from Update
    • on the Coordinator side, the internal variables for the runtime manager error and the update error have been merged
  • The runtime manager now uses a done channel to signal that it is shutting down and will no longer accept incoming requests
  • (adjacent cleanup) some unneeded / no longer needed synchronization handling has been removed from the runtime manager (updateMx which is no longer needed since (*Manager).runLoop ensures update calls do not conflict, netMx which was only used to get the tcp listener port, and some startup / shutdown atomics)

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool
  • I have added an integration test or an E2E test

Related issues

@faec faec added bug Something isn't working Team:Elastic-Agent Label for the Agent team labels Nov 10, 2023
@faec faec self-assigned this Nov 10, 2023
@faec faec marked this pull request as ready for review November 10, 2023 23:21
@faec faec requested a review from a team as a code owner November 10, 2023 23:21
@faec faec requested review from blakerouse and pchila November 10, 2023 23:21
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

Copy link
Contributor

mergify bot commented Nov 10, 2023

This pull request does not have a backport label. Could you fix it @faec? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v./d./d./d is the label to automatically backport to the 8./d branch. /d is the digit

NOTE: backport-skip has been added to this pull request.

@faec faec added skip-changelog backport-v8.11.0 Automated backport with mergify and removed backport-skip labels Nov 10, 2023
@elasticmachine
Copy link
Contributor

elasticmachine commented Nov 11, 2023

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2023-11-16T17:31:34.988+0000

  • Duration: 14 min 17 sec

❕ Flaky test report

No test was executed to be analysed.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages.

  • run integration tests : Run the Elastic Agent Integration tests.

  • run end-to-end tests : Generate the packages and run the E2E Tests.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

Copy link
Member

@pchila pchila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good, I think that the suggestion I left should take carel of the deadlock we have in unit test ;)

Copy link
Contributor

@blakerouse blakerouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good.

I like the removal of the locks and the fact that the error now comes through the errCh. Unit testing has good coverage with the change as well.

@faec
Copy link
Contributor Author

faec commented Nov 13, 2023

There's still an issue with this PR -- in manual testing it gets stuck forever during the install command, enrollment reaches Kibana but is never properly reported Agent-side. That might also be why the integration tests seem stalled without a result on CI. Investigating.

@faec
Copy link
Contributor Author

faec commented Nov 15, 2023

Found the problem: Coordinator blocks to tell the runtime manager about the new update, and the runtime manager blocks to tell Coordinator about the result of the update. Usually these two aren't in conflict, but during an install two Update calls happen right after each other, so both objects end up waiting on each other.

I still ran the update logic synchronously on the runtime manager goroutine with the idea that it was a "less drastic change" that could be done in a smaller first-pass PR, but I think I just need to fix the blocking entirely.

@faec
Copy link
Contributor Author

faec commented Nov 15, 2023

New version is ready that fixes the full issue instead of just mitigating it. I had to add some more state handling variables and code but I've tried to comment thoroughly, and this should prevent any number of policy changes from ever blocking state update reporting.

Copy link
Contributor

@ycombinator ycombinator left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Very easy to follow the new implementation and thanks for the adjacent cleanup as well.

Copy link
Member

@pchila pchila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link

@faec faec merged commit 112f618 into elastic:main Nov 16, 2023
7 checks passed
@faec faec deleted the blocked-runtime-updates branch November 16, 2023 22:39
mergify bot pushed a commit that referenced this pull request Nov 16, 2023
(cherry picked from commit 112f618)

# Conflicts:
#	internal/pkg/agent/application/coordinator/coordinator_state.go
AndersonQ pushed a commit that referenced this pull request Nov 28, 2023
…ordinator less (#3785)

* Rework runtime manager updates to block the coordinator less (#3747)

(cherry picked from commit 112f618)

# Conflicts:
#	internal/pkg/agent/application/coordinator/coordinator_state.go

* fix broken merge

---------

Co-authored-by: Fae Charlton <[email protected]>
Co-authored-by: Pierre HILBERT <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment