-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework runtime manager updates to block the coordinator less #3747
Conversation
Pinging @elastic/elastic-agent (Team:Elastic-Agent) |
This pull request does not have a backport label. Could you fix it @faec? 🙏
NOTE: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks good, I think that the suggestion I left should take carel of the deadlock we have in unit test ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks really good.
I like the removal of the locks and the fact that the error now comes through the errCh. Unit testing has good coverage with the change as well.
There's still an issue with this PR -- in manual testing it gets stuck forever during the |
Found the problem: Coordinator blocks to tell the runtime manager about the new update, and the runtime manager blocks to tell Coordinator about the result of the update. Usually these two aren't in conflict, but during an install two Update calls happen right after each other, so both objects end up waiting on each other. I still ran the update logic synchronously on the runtime manager goroutine with the idea that it was a "less drastic change" that could be done in a smaller first-pass PR, but I think I just need to fix the blocking entirely. |
New version is ready that fixes the full issue instead of just mitigating it. I had to add some more state handling variables and code but I've tried to comment thoroughly, and this should prevent any number of policy changes from ever blocking state update reporting. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Very easy to follow the new implementation and thanks for the adjacent cleanup as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Co-authored-by: Shaunak Kashyap <[email protected]>
Co-authored-by: Shaunak Kashyap <[email protected]>
…nt into blocked-runtime-updates
SonarQube Quality Gate |
(cherry picked from commit 112f618) # Conflicts: # internal/pkg/agent/application/coordinator/coordinator_state.go
…ordinator less (#3785) * Rework runtime manager updates to block the coordinator less (#3747) (cherry picked from commit 112f618) # Conflicts: # internal/pkg/agent/application/coordinator/coordinator_state.go * fix broken merge --------- Co-authored-by: Fae Charlton <[email protected]> Co-authored-by: Pierre HILBERT <[email protected]>
What does this PR do?
Fix #3738, which happens when
Coordinator
andruntime.Manager
block on a policy update and state reporting gets backed up.Summary of changes:
Manager.Update
now sends the component model update to the runtime manager loop via a channelCoordinator
nor the runtime manager block their communication while an update is in progressCoordinator
on the runtime manager's (previously unused) error channel instead of returning directly fromUpdate
Coordinator
side, the internal variables for the runtime manager error and the update error have been mergedupdateMx
which is no longer needed since(*Manager).runLoop
ensuresupdate
calls do not conflict,netMx
which was only used to get the tcp listener port, and some startup / shutdown atomics)Checklist
I have made corresponding changes to the documentationI have made corresponding change to the default configuration filesI have added an entry in./changelog/fragments
using the changelog toolI have added an integration test or an E2E testRelated issues