Investigate allowing the agent to check in more frequently when the agent status changes #1946

cmacknz · 2022-12-14T18:09:01Z

Currently the agent checks in with fleet-server every 5 minutes, and for scalability reasons the check in interval is likely to get significantly longer (currently we are targeting 30m).

Investigate allowing the agent to checkin more frequently when the agent status changes to ensure we our not presenting a minutes old state to users in Fleet. In the absence of a status change the agent would check in the configured rate (currently 5m, eventually 30m).

Ideally we would update the agent status in real time, however there is a significant performance overhead to checking in due to the work required on fleet-server to authenticate the agent. This means we need to rate limit the check ins, however we have never explored the lower bound of the rate limiting.

Experiment with different rates of status changes in the Fleet scaling tests using Horde or the agent itself (pending #2169) to assess the impact and determine a path forward.

blakerouse · 2022-12-14T18:56:56Z

We also need to update the Fleet gateway code to disconnect and re-connect to the longpoll with the latest status information once it changes.

Elastic Agent has never done any of this before and it was always possible for Elastic Agent to report an unhealthy status for the 5 minute long poll. That is no different than what is in 8.5. For that reason I believe this to be a feature and we should target it to 8.7.

cmacknz · 2022-12-14T19:00:50Z

Agreed, writing this I did notice I was describing a feature. I'll retarget it to 8.7 and we can pull it into 8.6.1 if we have the time and it seems necessary.

cmacknz · 2022-12-22T20:41:21Z

We also need to update the Fleet gateway code to disconnect and re-connect to the longpoll with the latest status information once it changes.

We discussed this further, and there is a concern that more frequent disconnects and reconnects will have a negative effect in large scale deployments. There is actually an effort underway to increase the long polling timeout significantly to reduce the work fleet-server needs to do when authenticating a new agent connection.

Our conclusion is that we need to address the agent health changing rapidly from healthy to failed to healthy again on an intentional restart in another way. Ideally in a way that does not filter out status changes for unintentional restarts.

One way to get the best of both worlds (infrequent reconnections, immediate status updates to Fleet) would be to switch from long polling to a Websocket. We also may need to revisit how agent health is presented in the UI, as the status changes for individual units do not necessarily need to be reflected in the global agent health we currently present.

cmacknz · 2023-01-09T13:55:09Z

Closing this, the ideal fix for this is to use a streaming protocol to talk to fleet server.

blakerouse · 2023-01-09T14:59:45Z

I agree that a streaming protocol is the ideal fix but I still believe that a debounce included with that would be better for scale and load. If we switch to a streaming protocol like GRPC or WebSocket for communication with Fleet Server sending a message on ever change is still probably to much at a large scale. I believe debouncing with the streaming protocol is a better idea.

cmacknz · 2023-01-09T16:04:10Z

Makes sense, I'll leave this open to address this case. Ideally we come back to this after we migrate to a streaming protocol.

joshdover · 2023-01-13T14:20:27Z

In the meantime, could we debounce, but with a really long debounce window? This would allow us to increase the long polling to 30m, but still go ahead and add re-checkin behavior now to avoid the problem where the agent may appear unhealthy for 30m unnecessarily. If we made the debounce something as long as 5m, we could at least use 30m timeouts most of the time, without making the current staleness any worse than it already is.

cmacknz · 2023-01-13T14:38:13Z

We can definitely do that, we just need to align that change into the same release that increases the long polling timeout.

blakerouse · 2023-01-23T18:42:45Z

Do we really want to leave something in an Unhealthy state for 5 minutes, or leave something marked Healthy for 5 minutes when it actually has an issue? Don't we want to show this information clearly with in a reasonable amount of time?

I would think a window of a 1 minute would be a better amount of time. In the case that something is down temporary 1 minute seems like enough time for it to come back and we should update the health status in Fleet. If something is wrong because of configuration that is likely that issue is sticky so once reporting failure it doesn't need to report that same failure again for the 30 minute long-poll, until a new policy is sent to the Elastic Agent which will result in a reconnect of the longpoll when that happens.

joshdover · 2023-01-24T16:31:14Z

To be clear, I'm all for making updating the health info in Fleet as fast as we possibly can. I also want to make sure that our scale targets and resource requirements don't regress when we make this change.

I think what we should do is add a setting to horde to be able to emulate this kind of noise in our scale tests. This would allow us to just run a test with some reasonable guesses for a typical and worst case scenario and see how the system holds up.

I wouldn't be comfortable with making the window shorter than 5 minutes until then, since 5 minute checkins is what all of our tests have been based on thus far.

blakerouse · 2023-01-24T18:15:13Z

+1 for validating in Horde. I believe we would want to have Horde be using the same code as Elastic Agent when it comes to checking into Fleet Server and using the same debounce code. That way the validation would ensure that once released that a production deployment would have the same result.

cmacknz · 2023-01-24T19:56:40Z

I wrote up #2169 for unifying the agent and Horde, I agree this is a good idea and the drift between the two has and probably will continue to cause issues for us since we aren't testing the actual agent code as deployed by real users.

joshdover · 2023-01-25T10:29:08Z

Sounds like we're aligned. Should we update the issue description with this plan outlined to solve #2169 and validate this change in horde first?

cmacknz · 2023-01-26T03:35:35Z

I reworded this issue to be about finding the lower bound for how fast we can check in when the status changes. I don't think we need #2169 for this but it would help.

Do we want a separate issue for specifically implementing and assessing the impact of a 5m minimum bound on how often the agent can check as described in #1946 (comment)? That might be the first thing to do since we have specific numbers in mind, and we may want to track it separately since I believe it will be a UI regression if we move to 30m check in intervals without it.

jlind23 · 2023-02-13T13:05:47Z

@pierrehilbert updating it to a P1 as this is a blocker for our scalability effort. @joshdover and @cmacknz will discuss the path forward here.

joshdover · 2023-02-13T14:35:34Z

@jlind23 as Craig suggested, I've opened a separate issue for this first change: #2257

I will move this issue out for now.

cmacknz added Team:Elastic-Agent Label for the Agent team V2-Architecture v8.6.0 labels Dec 14, 2022

cmacknz assigned blakerouse Dec 14, 2022

cmacknz mentioned this issue Dec 14, 2022

Beats no longer restart automatically when the output configuration changes. #1913

Closed

cmacknz added v8.7.0 and removed v8.6.0 labels Dec 14, 2022

cmacknz mentioned this issue Dec 20, 2022

Add stop functionality for output config changes elastic/beats#34049

Closed

4 tasks

cmacknz mentioned this issue Dec 22, 2022

Add specification file option to ignore unexpected process exits #1993

Closed

cmacknz unassigned blakerouse Jan 5, 2023

cmacknz closed this as completed Jan 9, 2023

cmacknz reopened this Jan 9, 2023

cmacknz removed the v8.7.0 label Jan 9, 2023

cmacknz changed the title ~~The agent should debounce observed unit status changes before reporting them to Fleet server~~ Investigate allowing the agent to check in more frequently when the agent status changes Jan 26, 2023

joshdover mentioned this issue Feb 13, 2023

Support longer checkin intervals when the agent status has not changed #2257

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate allowing the agent to check in more frequently when the agent status changes #1946

Investigate allowing the agent to check in more frequently when the agent status changes #1946

cmacknz commented Dec 14, 2022 •

edited

Loading

blakerouse commented Dec 14, 2022

cmacknz commented Dec 14, 2022

cmacknz commented Dec 22, 2022

cmacknz commented Jan 9, 2023

blakerouse commented Jan 9, 2023

cmacknz commented Jan 9, 2023

joshdover commented Jan 13, 2023

cmacknz commented Jan 13, 2023

blakerouse commented Jan 23, 2023

joshdover commented Jan 24, 2023

blakerouse commented Jan 24, 2023

cmacknz commented Jan 24, 2023

joshdover commented Jan 25, 2023

cmacknz commented Jan 26, 2023

jlind23 commented Feb 13, 2023

joshdover commented Feb 13, 2023

Investigate allowing the agent to check in more frequently when the agent status changes #1946

Investigate allowing the agent to check in more frequently when the agent status changes #1946

Comments

cmacknz commented Dec 14, 2022 • edited Loading

blakerouse commented Dec 14, 2022

cmacknz commented Dec 14, 2022

cmacknz commented Dec 22, 2022

cmacknz commented Jan 9, 2023

blakerouse commented Jan 9, 2023

cmacknz commented Jan 9, 2023

joshdover commented Jan 13, 2023

cmacknz commented Jan 13, 2023

blakerouse commented Jan 23, 2023

joshdover commented Jan 24, 2023

blakerouse commented Jan 24, 2023

cmacknz commented Jan 24, 2023

joshdover commented Jan 25, 2023

cmacknz commented Jan 26, 2023

jlind23 commented Feb 13, 2023

joshdover commented Feb 13, 2023

cmacknz commented Dec 14, 2022 •

edited

Loading