You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As a consequence of #1913, any change in the output configuration causes the Beats to restart. Since the agent is not expecting this, it marks the component as failed and there is a chance this failed state will be reported to Fleet server and persist there until the next checkin up to 5 minutes later.
To mitigate the negative user experience of possibly seeing the agent go unhealthy as part of normal use, let's add a specification file option instructing the agent not to mark the component as failed if it exits unexpectedly. We will then enable this option for each of the Beats.
The simplest implementation is a boolean flag that ignores all unexpected process restarts. The ideal implementation may be two separate specification file options, one for the time period to watch and another for the number of restarts to ignore in that time period. For example:
This configuration means the agent will begin marking the component as failed on the 4th restart observed in a 5 seconds period. The intent of this configuration is to ensure we continue to report failed states for components that are restarting continuously. If either of the two fields is 0 the feature is disabled.
This implementation is expected to be simpler to implement and involve less changes to the internals of the agent then the first alternative considered: #1946.
Another alternative considered was to add a new state to the control protocol indicating that the agent should expect a restart, or perform a restart of the process. The implementation complexity of this approach was also considered to be too high.
Ideally this implementation can be made in time for the final 8.6.0 build candidate, but it can slip into the 8.6.1 release if it is deemed too risky. We may also choose to abandon this approach and handle this situation in a different way entirely if it does not solve the problem or the implementation becomes unexpectedly complex.
The text was updated successfully, but these errors were encountered:
I don't have any strong preferences for what the numbers are and I haven't tested them.
What our goal is here is to set this configuration so that it ignores restarts triggered from output changes but still catches unintentional or recurring restarts. In the Fleet UI I can't see us getting policy changes more than every few seconds in the most optimistic case, since it is constrained by making a new long polling connection and then breaking it when the policy change action is received.
In standalone mode maybe this could change a bit faster it is constrained by the configuration reload interval.
Maybe we could set this even more conservatively to 1 restart every 5 seconds. It is likely worth some experimentation with the output changes in the policy to see how fast you can get it to happen.
As a consequence of #1913, any change in the output configuration causes the Beats to restart. Since the agent is not expecting this, it marks the component as failed and there is a chance this failed state will be reported to Fleet server and persist there until the next checkin up to 5 minutes later.
To mitigate the negative user experience of possibly seeing the agent go unhealthy as part of normal use, let's add a specification file option instructing the agent not to mark the component as failed if it exits unexpectedly. We will then enable this option for each of the Beats.
The simplest implementation is a boolean flag that ignores all unexpected process restarts. The ideal implementation may be two separate specification file options, one for the time period to watch and another for the number of restarts to ignore in that time period. For example:
This configuration means the agent will begin marking the component as failed on the 4th restart observed in a 5 seconds period. The intent of this configuration is to ensure we continue to report failed states for components that are restarting continuously. If either of the two fields is 0 the feature is disabled.
This implementation is expected to be simpler to implement and involve less changes to the internals of the agent then the first alternative considered: #1946.
Another alternative considered was to add a new state to the control protocol indicating that the agent should expect a restart, or perform a restart of the process. The implementation complexity of this approach was also considered to be too high.
Ideally this implementation can be made in time for the final 8.6.0 build candidate, but it can slip into the 8.6.1 release if it is deemed too risky. We may also choose to abandon this approach and handle this situation in a different way entirely if it does not solve the problem or the implementation becomes unexpectedly complex.
The text was updated successfully, but these errors were encountered: