Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enrollment at load causes 503 HTTP response and rollback fails #1884

Open
blakerouse opened this issue Sep 20, 2022 · 0 comments
Open

Enrollment at load causes 503 HTTP response and rollback fails #1884

blakerouse opened this issue Sep 20, 2022 · 0 comments
Labels
Project:FleetScaling Team:Fleet Label for the Fleet team

Comments

@blakerouse
Copy link
Contributor

Overview

When 75k Elastic Agent's where in enrolled 34 stayed in Updating. Updating in this case means that it has enrolled but has never checked in. The reason the Elastic Agent has never checked in is because Fleet Server hit an error and failed to rollback the creation of the API key (aka. invalidate) and the creation of the Elastic Agent (aka. delete document).

Log for a single 503 request with rollback error

  • 503 HTTP Request
  • rollback error on enrollment failure (Apikey index failed to refresh)
  • Abort query attempt on apikey (Apikey index failed to refresh)
  • rollback function "invalidate API key" failed (Apikey index failed to refresh)

What I believe is happening

It is not exactly clear in the log 503 was returned, but the event duration shows 63 seconds which is more than a minute. I believe it is because the HTTP client has decided not to wait for a response due to it timing out (Fleet Server not responding as quickly as it would like). When this happens Fleet Server needs to ensure that rollback completes successfully and it cleans up properly.

The rollback process is not working and I also believe that it out of order. It should perform rollback in reverse order of it being registered but instead it goes in the order it was registered.

Fixes to perform

  1. Add fleet.agent.id to all log messages as soon as the agent ID is generated even if it's going to be rolled back. At the moment non of the events have that information in them making it hard to tell if this truly is the reason an extra Elastic Agent document is created but never used.
  2. Ensure that rollback goes it reverse order.
  3. Fleet Server creates the Agent ID and then tries to save it into elasticsearch (it should always try to remove it once the ID is generated, even if the create might not have been called)
  4. Fix "invalidate API key" to really ensure that it can invalidate the API key even if the load is high and its hitting the timeout here https://github.com/elastic/fleet-server/blob/main/internal/pkg/api/handleEnroll.go#L309
@blakerouse blakerouse added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Sep 20, 2022
@blakerouse blakerouse self-assigned this Sep 20, 2022
@jen-huang jen-huang added Team:Fleet Label for the Fleet team and removed Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team labels Nov 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Project:FleetScaling Team:Fleet Label for the Fleet team
Projects
None yet
Development

No branches or pull requests

3 participants