Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate allowing the agent to check in more frequently when the agent status changes #1946

Open
cmacknz opened this issue Dec 14, 2022 · 16 comments
Labels

Comments

@cmacknz
Copy link
Member

cmacknz commented Dec 14, 2022

Currently the agent checks in with fleet-server every 5 minutes, and for scalability reasons the check in interval is likely to get significantly longer (currently we are targeting 30m).

Investigate allowing the agent to checkin more frequently when the agent status changes to ensure we our not presenting a minutes old state to users in Fleet. In the absence of a status change the agent would check in the configured rate (currently 5m, eventually 30m).

Ideally we would update the agent status in real time, however there is a significant performance overhead to checking in due to the work required on fleet-server to authenticate the agent. This means we need to rate limit the check ins, however we have never explored the lower bound of the rate limiting.

Experiment with different rates of status changes in the Fleet scaling tests using Horde or the agent itself (pending #2169) to assess the impact and determine a path forward.

@blakerouse
Copy link
Contributor

We also need to update the Fleet gateway code to disconnect and re-connect to the longpoll with the latest status information once it changes.

Elastic Agent has never done any of this before and it was always possible for Elastic Agent to report an unhealthy status for the 5 minute long poll. That is no different than what is in 8.5. For that reason I believe this to be a feature and we should target it to 8.7.

@cmacknz
Copy link
Member Author

cmacknz commented Dec 14, 2022

Agreed, writing this I did notice I was describing a feature. I'll retarget it to 8.7 and we can pull it into 8.6.1 if we have the time and it seems necessary.

@cmacknz
Copy link
Member Author

cmacknz commented Dec 22, 2022

We also need to update the Fleet gateway code to disconnect and re-connect to the longpoll with the latest status information once it changes.

We discussed this further, and there is a concern that more frequent disconnects and reconnects will have a negative effect in large scale deployments. There is actually an effort underway to increase the long polling timeout significantly to reduce the work fleet-server needs to do when authenticating a new agent connection.

Our conclusion is that we need to address the agent health changing rapidly from healthy to failed to healthy again on an intentional restart in another way. Ideally in a way that does not filter out status changes for unintentional restarts.

One way to get the best of both worlds (infrequent reconnections, immediate status updates to Fleet) would be to switch from long polling to a Websocket. We also may need to revisit how agent health is presented in the UI, as the status changes for individual units do not necessarily need to be reflected in the global agent health we currently present.

@cmacknz
Copy link
Member Author

cmacknz commented Jan 9, 2023

Closing this, the ideal fix for this is to use a streaming protocol to talk to fleet server.

@cmacknz cmacknz closed this as completed Jan 9, 2023
@blakerouse
Copy link
Contributor

I agree that a streaming protocol is the ideal fix but I still believe that a debounce included with that would be better for scale and load. If we switch to a streaming protocol like GRPC or WebSocket for communication with Fleet Server sending a message on ever change is still probably to much at a large scale. I believe debouncing with the streaming protocol is a better idea.

@cmacknz cmacknz reopened this Jan 9, 2023
@cmacknz cmacknz removed the v8.7.0 label Jan 9, 2023
@cmacknz
Copy link
Member Author

cmacknz commented Jan 9, 2023

Makes sense, I'll leave this open to address this case. Ideally we come back to this after we migrate to a streaming protocol.

@joshdover
Copy link
Contributor

In the meantime, could we debounce, but with a really long debounce window? This would allow us to increase the long polling to 30m, but still go ahead and add re-checkin behavior now to avoid the problem where the agent may appear unhealthy for 30m unnecessarily. If we made the debounce something as long as 5m, we could at least use 30m timeouts most of the time, without making the current staleness any worse than it already is.

@cmacknz
Copy link
Member Author

cmacknz commented Jan 13, 2023

We can definitely do that, we just need to align that change into the same release that increases the long polling timeout.

@blakerouse
Copy link
Contributor

Do we really want to leave something in an Unhealthy state for 5 minutes, or leave something marked Healthy for 5 minutes when it actually has an issue? Don't we want to show this information clearly with in a reasonable amount of time?

I would think a window of a 1 minute would be a better amount of time. In the case that something is down temporary 1 minute seems like enough time for it to come back and we should update the health status in Fleet. If something is wrong because of configuration that is likely that issue is sticky so once reporting failure it doesn't need to report that same failure again for the 30 minute long-poll, until a new policy is sent to the Elastic Agent which will result in a reconnect of the longpoll when that happens.

@joshdover
Copy link
Contributor

To be clear, I'm all for making updating the health info in Fleet as fast as we possibly can. I also want to make sure that our scale targets and resource requirements don't regress when we make this change.

I think what we should do is add a setting to horde to be able to emulate this kind of noise in our scale tests. This would allow us to just run a test with some reasonable guesses for a typical and worst case scenario and see how the system holds up.

I wouldn't be comfortable with making the window shorter than 5 minutes until then, since 5 minute checkins is what all of our tests have been based on thus far.

@blakerouse
Copy link
Contributor

+1 for validating in Horde. I believe we would want to have Horde be using the same code as Elastic Agent when it comes to checking into Fleet Server and using the same debounce code. That way the validation would ensure that once released that a production deployment would have the same result.

@cmacknz
Copy link
Member Author

cmacknz commented Jan 24, 2023

I wrote up #2169 for unifying the agent and Horde, I agree this is a good idea and the drift between the two has and probably will continue to cause issues for us since we aren't testing the actual agent code as deployed by real users.

@joshdover
Copy link
Contributor

Sounds like we're aligned. Should we update the issue description with this plan outlined to solve #2169 and validate this change in horde first?

@cmacknz cmacknz changed the title The agent should debounce observed unit status changes before reporting them to Fleet server Investigate allowing the agent to check in more frequently when the agent status changes Jan 26, 2023
@cmacknz
Copy link
Member Author

cmacknz commented Jan 26, 2023

I reworded this issue to be about finding the lower bound for how fast we can check in when the status changes. I don't think we need #2169 for this but it would help.

Do we want a separate issue for specifically implementing and assessing the impact of a 5m minimum bound on how often the agent can check as described in #1946 (comment)? That might be the first thing to do since we have specific numbers in mind, and we may want to track it separately since I believe it will be a UI regression if we move to 30m check in intervals without it.

@jlind23
Copy link
Contributor

jlind23 commented Feb 13, 2023

@pierrehilbert updating it to a P1 as this is a blocker for our scalability effort. @joshdover and @cmacknz will discuss the path forward here.

@joshdover
Copy link
Contributor

@jlind23 as Craig suggested, I've opened a separate issue for this first change: #2257

I will move this issue out for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants