Fix - Under load and during topology changes, thread saturation can occur, causing a lockup #2139

benbenwilde · 2024-11-05T00:07:59Z

Description

Background:

This is an issue that has caused us some pain in our production environments, rendering our proto.actor cluster inoperable until we took action to reduce load on the system. In our environment have have pods coming on and offline all the time, sometimes under heavy load, and this exploited an issue in the EndpointManager, where requests for new endpoints wait behind a lock while another thread disposes an endpoint.

The change:

This PR modifies EndpointManager so that it disposes endpoints outside of the lock, while any concurrent requests to that endpoint return a blocked endpoint, instead of waiting behind a lock. This way, while an endpoint cleans up, new endpoints can still be added, and multiple endpoints can be disposed at the same time.

This change makes EndpointManager more robust by minimizing the time spent inside the lock, preventing potential thread saturation and lockup during topology changes. Since we don't want to send requests to a disposing endpoint, those requests get a blocking endpoint instead. After the dispose is complete, the same conditions apply as before for blocking and unblocking the endpoint, namely the ShouldBlock flag on the event, and the WaitAfterEndpointTerminationTimeSpan config.

Testing:

I've added a project called EndpointManagerTest which reproduces the issue intermittently. After applying the change, the issue no longer occurs. Due to the nature of this issue, results can vary significantly between different environments or CPUs. So a dockerfile has been provided as well so it can be run more consistently, when run with a limit of 1 CPU. Of course results will still vary for different machines. Without the fix, the issue only occurs intermittently (in my case it would occur maybe half the time), due to race conditions that must occur, and which work ends up getting allocated by the threadpool.

Details:

In our production environment, we had a scenario where many proto.actor client pods would start up and start sending messages to actors on a couple proto.actor member pods. The actors would typically handle them easily so the load was not as issue. But all these clients would have to be added when sending responses for partition identity or placement messages. If this occurred while some endpoint was being terminated, it could result in thread saturation and lockup. A thread would enter the EndpointManager lock to dispose an endpoint, while many messages are trying to be sent to new endpoints, which all wait behind the lock. These are all blocking waits, so they all hold the thread that was allocated to them, so other work can't use them. The threadpool of course allocates new threads according to it's algorithm, but if they keep getting allocated to tasks that will end up waiting behind a lock, then no work will be done, potentially for quite some time. Once the endpoint dispose work is finally given a thread and can complete, then everything can start flowing again, but that doesn't always happen when there are so many threads in blocking waits and so much competition for the new threads as they are added. In our production environment, we had another connected resource that would disconnect since its health checks could not complete under the lockup, which would ultimately cause a restart of the pod. Then it would enter an endless cycle of locking up and restarting.

Purpose

This pull request is a:

Bugfix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)

Checklist

I have added tests that prove my fix is effective or that my feature works
I have added necessary documentation (if appropriate)

fix merge

… lock. this also allows messages to other endpoints while one is disposing, as well as multiple endpoint disposes at the same time.

benbenwilde · 2024-11-05T00:17:04Z

The checks for tests that failed here worked when i ran them locally, maybe it needs a retry

…g the DisconnectRequest, so it doesn't time out during kestrel shutdown

benbenwilde · 2024-11-05T07:01:09Z

I found another issue when one of the tests were failing. It was timing out when shutting down the kestrel host, because after sending the disconnect request, it could end up waiting forever for the end of the stream. This way we always get a clean shutdown regardless of what is happening on the other side.

benbenwilde added 3 commits November 1, 2024 21:32

Add endpoint manager test to repro thread lockup

3b75ecb

fix merge

add explanation and dockerfile

df4591b

Block endpoint while it disposes instead of holding requests behind a…

f321c1e

… lock. this also allows messages to other endpoints while one is disposing, as well as multiple endpoint disposes at the same time.

benbenwilde added 4 commits November 4, 2024 23:03

Change EndpointManager Stop to StopAsync

d1e2ce5

ensure the EndpointReader always finishes up the request after sendin…

9c8e464

…g the DisconnectRequest, so it doesn't time out during kestrel shutdown

increase timeout on a couple tests so they fail less often

453a7d6

increase another timeout on flakey test

bab930c

rogeralsing merged commit eaac059 into asynkron:dev Nov 5, 2024
18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix - Under load and during topology changes, thread saturation can occur, causing a lockup #2139

Fix - Under load and during topology changes, thread saturation can occur, causing a lockup #2139

benbenwilde commented Nov 5, 2024

benbenwilde commented Nov 5, 2024

benbenwilde commented Nov 5, 2024

Fix - Under load and during topology changes, thread saturation can occur, causing a lockup #2139

Fix - Under load and during topology changes, thread saturation can occur, causing a lockup #2139

Conversation

benbenwilde commented Nov 5, 2024

Description

Background:

The change:

Testing:

Details:

Purpose

Checklist

benbenwilde commented Nov 5, 2024

benbenwilde commented Nov 5, 2024