Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Shutdown to gracefully halt HTTP servers #3165

Merged
merged 9 commits into from
Dec 19, 2023

Conversation

michel-laterman
Copy link
Contributor

@michel-laterman michel-laterman commented Dec 12, 2023

What is the problem this PR solves?

Checkin API long-poll connections terminate with an error when the server needs to shut down (providing a 5xx to the client).

How does this PR solve the problem?

Use the Shutdown method with a timeout to try to gracefully shutdown
HTTP connections. If the server times out while attempting to drain the
connections, connections are forced closed. If there is a long-polling
checkin request, the server should send a 200 response with the ack
token used in the request (same behaviour as the poll completing
normally).

Design Checklist

  • I have ensured my design is stateless and will work when multiple fleet-server instances are behind a load balancer.
  • I have or intend to scale test my changes, ensuring it will work reliably with 100K+ agents connected.
  • I have included fail safe mechanisms to limit the load on fleet-server: rate limiting, circuit breakers, caching, load shedding, etc.

Checklist

  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool

Related issues

@michel-laterman michel-laterman added enhancement New feature or request Team:Fleet Label for the Fleet team labels Dec 12, 2023
Use the Shutdown method with a timeout to try to gracefully shutdown
HTTP connections. If the server times out while attempting to drain the
connections, connections are forced closed. If there is a long-polling
checkin request, the server should send a 200 response with the ack
token used in the request (same behaviour as the poll completing
normally).
@michel-laterman michel-laterman marked this pull request as ready for review December 14, 2023 17:44
@michel-laterman michel-laterman requested a review from a team as a code owner December 14, 2023 17:44
@michel-laterman
Copy link
Contributor Author

Does anyone know why the linter is all of a sudden having issues with our apm imports?

@jlind23
Copy link
Contributor

jlind23 commented Dec 15, 2023

@michel-laterman Looks like the linter is ok and only Sonar analysis complains about the insufficient code coverage.
Here is the coverage report - Profile and handlecheckin doensn't have any test. I let you decide whether or not this is a real problem.
Screenshot 2023-12-15 at 07 46 04

@nchaulet nchaulet self-requested a review December 18, 2023 15:01
@michel-laterman
Copy link
Contributor Author

The changes to handleCheckin should be covered by the new integration test (but coverage for those tests don't work as expected). I've also added a test case for the profiler

Copy link
Member

@jsoriano jsoriano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code-wise LGTM, not tested.

internal/pkg/api/server.go Show resolved Hide resolved
Comment on lines +38 to +39
resp, err = http.DefaultClient.Do(req) //nolint:bodyclose // closed outside the loop
if err == nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit. Could it be closed here?

Suggested change
resp, err = http.DefaultClient.Do(req) //nolint:bodyclose // closed outside the loop
if err == nil {
resp, err = http.DefaultClient.Do(req)
if err == nil {
resp.Body.Close()

internal/pkg/api/handleCheckin.go Show resolved Hide resolved
@jlind23
Copy link
Contributor

jlind23 commented Dec 19, 2023

buildkite test it

@jlind23
Copy link
Contributor

jlind23 commented Dec 19, 2023

@nchaulet @michel-laterman all tests passed, shall we merge this?

Copy link
Member

@nchaulet nchaulet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested locally and it seems to work as expected 🚀

@michel-laterman michel-laterman enabled auto-merge (squash) December 19, 2023 15:03
Copy link

Quality Gate passed Quality Gate passed

The SonarQube Quality Gate passed, but some issues were introduced.

1 New issue
0 Security Hotspots
57.1% 57.1% Coverage on New Code
0.0% 0.0% Duplication on New Code

See analysis details on SonarQube

@michel-laterman michel-laterman merged commit f099b91 into elastic:main Dec 19, 2023
8 checks passed
@michel-laterman michel-laterman deleted the shutdown branch March 13, 2024 16:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Team:Fleet Label for the Fleet team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Gracefully drain checkin requests on shutdown
4 participants