You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When 75k Elastic Agent's where in enrolled 34 stayed in Updating. Updating in this case means that it has enrolled but has never checked in. The reason the Elastic Agent has never checked in is because Fleet Server hit an error and failed to rollback the creation of the API key (aka. invalidate) and the creation of the Elastic Agent (aka. delete document).
Log for a single 503 request with rollback error
503 HTTP Request
rollback error on enrollment failure (Apikey index failed to refresh)
Abort query attempt on apikey (Apikey index failed to refresh)
rollback function "invalidate API key" failed (Apikey index failed to refresh)
What I believe is happening
It is not exactly clear in the log 503 was returned, but the event duration shows 63 seconds which is more than a minute. I believe it is because the HTTP client has decided not to wait for a response due to it timing out (Fleet Server not responding as quickly as it would like). When this happens Fleet Server needs to ensure that rollback completes successfully and it cleans up properly.
The rollback process is not working and I also believe that it out of order. It should perform rollback in reverse order of it being registered but instead it goes in the order it was registered.
Fixes to perform
Add fleet.agent.id to all log messages as soon as the agent ID is generated even if it's going to be rolled back. At the moment non of the events have that information in them making it hard to tell if this truly is the reason an extra Elastic Agent document is created but never used.
Ensure that rollback goes it reverse order.
Fleet Server creates the Agent ID and then tries to save it into elasticsearch (it should always try to remove it once the ID is generated, even if the create might not have been called)
Overview
When 75k Elastic Agent's where in enrolled 34 stayed in Updating. Updating in this case means that it has enrolled but has never checked in. The reason the Elastic Agent has never checked in is because Fleet Server hit an error and failed to rollback the creation of the API key (aka. invalidate) and the creation of the Elastic Agent (aka. delete document).
Log for a single 503 request with rollback error
What I believe is happening
It is not exactly clear in the log 503 was returned, but the event duration shows 63 seconds which is more than a minute. I believe it is because the HTTP client has decided not to wait for a response due to it timing out (Fleet Server not responding as quickly as it would like). When this happens Fleet Server needs to ensure that rollback completes successfully and it cleans up properly.
The rollback process is not working and I also believe that it out of order. It should perform rollback in reverse order of it being registered but instead it goes in the order it was registered.
Fixes to perform
fleet.agent.id
to all log messages as soon as the agent ID is generated even if it's going to be rolled back. At the moment non of the events have that information in them making it hard to tell if this truly is the reason an extra Elastic Agent document is created but never used.The text was updated successfully, but these errors were encountered: