-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[cmd/opampsupervisor]: Supervisor waits for configurable healthchecks to report remote config status #34907
Conversation
… to report remote config status
…llector-contrib into issue_21079
…llector-contrib into issue_21079
Looking into unit-test failures |
@srikanthccv With #35488 we're going to have the ability for the Collector to report its own health through the OpAMP extension. Do you think we could use this to get rid of the |
I have a question. Let's consider a scenario involving asynchronous errors resulting from a configuration update. The supervisor has applied the new effective configuration and restarted the agent. The collector initially starts without issues, but shortly after, some component reports an asynchronous error, which causes the collector to shut down. In this case, should we categorize this as a successful or failed configuration update? would extension not report any status in this situation? |
I think this depends on the timeout duration. If it's within the timeout, it would be a failed configuration update, otherwise we would have already that the config was successfully updated and report the issue separately since we don't directly know whether it was related to the config update or not.
The healthcheck extension and OpAMP extension use the same health reporting mechanisms. Currently both only report errors that cause the Collector to be shutdown. |
Went through the service and collector initialization code for a better understanding. IIUC, In the case of an asynchronous error, the extension would initially report a healthy status, but then quickly switch to an unhealthy status during the shutdown process. Can we fully rely on the initial healthy status report? If we do, we might incorrectly flag an update as successful when it's actually failing. What do you think? |
We should be able to get rid of the |
That was my thought as well. I think we can mitigate the case you outlined if we do this. |
Great, I'll proceed with implementing the change. |
Please resolve conflicts and we'll get this merged |
… to report remote config status (open-telemetry#34907) **Description:** This pull request addresses the remote config status reporting issue discussed in open-telemetry#21079 by introducing the following options to the Agent config: 1. `config_apply_timeout`: config update is successful if we receive a healthy status and then observe no failure updates for the entire duration of the timeout period; otherwise, failure is reported. **Link to tracking Issue:** open-telemetry#21079 **Testing:** Added e2e test **Documentation:** <Describe the documentation added.>
… to report remote config status (open-telemetry#34907) **Description:** This pull request addresses the remote config status reporting issue discussed in open-telemetry#21079 by introducing the following options to the Agent config: 1. `config_apply_timeout`: config update is successful if we receive a healthy status and then observe no failure updates for the entire duration of the timeout period; otherwise, failure is reported. **Link to tracking Issue:** open-telemetry#21079 **Testing:** Added e2e test **Documentation:** <Describe the documentation added.>
Description:
This pull request addresses the remote config status reporting issue discussed in #21079 by introducing the following options to the Agent config:
config_apply_timeout
: config update is successful if we receive a healthy status and then observe no failure updates for the entire duration of the timeout period; otherwise, failure is reported.Link to tracking Issue: #21079
Testing: Added e2e test
Documentation: