Add specification file option to ignore unexpected process exits #1993

cmacknz · 2022-12-22T22:27:46Z

As a consequence of #1913, any change in the output configuration causes the Beats to restart. Since the agent is not expecting this, it marks the component as failed and there is a chance this failed state will be reported to Fleet server and persist there until the next checkin up to 5 minutes later.

To mitigate the negative user experience of possibly seeing the agent go unhealthy as part of normal use, let's add a specification file option instructing the agent not to mark the component as failed if it exits unexpectedly. We will then enable this option for each of the Beats.

The simplest implementation is a boolean flag that ignores all unexpected process restarts. The ideal implementation may be two separate specification file options, one for the time period to watch and another for the number of restarts to ignore in that time period. For example:

restart_monitoring_period=5s
maximum_restarts_per_period=3

This configuration means the agent will begin marking the component as failed on the 4th restart observed in a 5 seconds period. The intent of this configuration is to ensure we continue to report failed states for components that are restarting continuously. If either of the two fields is 0 the feature is disabled.

This implementation is expected to be simpler to implement and involve less changes to the internals of the agent then the first alternative considered: #1946.

Another alternative considered was to add a new state to the control protocol indicating that the agent should expect a restart, or perform a restart of the process. The implementation complexity of this approach was also considered to be too high.

Ideally this implementation can be made in time for the final 8.6.0 build candidate, but it can slip into the 8.6.1 release if it is deemed too risky. We may also choose to abandon this approach and handle this situation in a different way entirely if it does not solve the problem or the implementation becomes unexpectedly complex.

michalpristas · 2022-12-23T16:36:00Z

@cmacknz are these 5s and 3 values you want to consider default?
i have implementation ready i just need to prepare spec files to onboard beats

cmacknz · 2022-12-23T16:50:00Z

I don't have any strong preferences for what the numbers are and I haven't tested them.

What our goal is here is to set this configuration so that it ignores restarts triggered from output changes but still catches unintentional or recurring restarts. In the Fleet UI I can't see us getting policy changes more than every few seconds in the most optimistic case, since it is constrained by making a new long polling connection and then breaking it when the policy change action is received.

In standalone mode maybe this could change a bit faster it is constrained by the configuration reload interval.

Maybe we could set this even more conservatively to 1 restart every 5 seconds. It is likely worth some experimentation with the output changes in the policy to see how fast you can get it to happen.

cmacknz added Team:Elastic-Agent Label for the Agent team v8.6.0 labels Dec 22, 2022

cmacknz assigned michalpristas Dec 22, 2022

cmacknz mentioned this issue Dec 23, 2022

Add configurable numbness for component restarts #2003

Merged

6 tasks

michalpristas closed this as completed in #2003 Dec 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add specification file option to ignore unexpected process exits #1993

Add specification file option to ignore unexpected process exits #1993

cmacknz commented Dec 22, 2022

michalpristas commented Dec 23, 2022

cmacknz commented Dec 23, 2022 •

edited

Loading

Add specification file option to ignore unexpected process exits #1993

Add specification file option to ignore unexpected process exits #1993

Comments

cmacknz commented Dec 22, 2022

michalpristas commented Dec 23, 2022

cmacknz commented Dec 23, 2022 • edited Loading

cmacknz commented Dec 23, 2022 •

edited

Loading