Fix - Under load and during topology changes, thread saturation can occur, causing a lockup #2139
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Background:
This is an issue that has caused us some pain in our production environments, rendering our proto.actor cluster inoperable until we took action to reduce load on the system. In our environment have have pods coming on and offline all the time, sometimes under heavy load, and this exploited an issue in the
EndpointManager
, where requests for new endpoints wait behind a lock while another thread disposes an endpoint.The change:
This PR modifies
EndpointManager
so that it disposes endpoints outside of the lock, while any concurrent requests to that endpoint return a blocked endpoint, instead of waiting behind a lock. This way, while an endpoint cleans up, new endpoints can still be added, and multiple endpoints can be disposed at the same time.This change makes
EndpointManager
more robust by minimizing the time spent inside the lock, preventing potential thread saturation and lockup during topology changes. Since we don't want to send requests to a disposing endpoint, those requests get a blocking endpoint instead. After the dispose is complete, the same conditions apply as before for blocking and unblocking the endpoint, namely theShouldBlock
flag on the event, and theWaitAfterEndpointTerminationTimeSpan
config.Testing:
I've added a project called
EndpointManagerTest
which reproduces the issue intermittently. After applying the change, the issue no longer occurs. Due to the nature of this issue, results can vary significantly between different environments or CPUs. So a dockerfile has been provided as well so it can be run more consistently, when run with a limit of 1 CPU. Of course results will still vary for different machines. Without the fix, the issue only occurs intermittently (in my case it would occur maybe half the time), due to race conditions that must occur, and which work ends up getting allocated by the threadpool.Details:
In our production environment, we had a scenario where many proto.actor client pods would start up and start sending messages to actors on a couple proto.actor member pods. The actors would typically handle them easily so the load was not as issue. But all these clients would have to be added when sending responses for partition identity or placement messages. If this occurred while some endpoint was being terminated, it could result in thread saturation and lockup. A thread would enter the
EndpointManager
lock to dispose an endpoint, while many messages are trying to be sent to new endpoints, which all wait behind the lock. These are all blocking waits, so they all hold the thread that was allocated to them, so other work can't use them. The threadpool of course allocates new threads according to it's algorithm, but if they keep getting allocated to tasks that will end up waiting behind a lock, then no work will be done, potentially for quite some time. Once the endpoint dispose work is finally given a thread and can complete, then everything can start flowing again, but that doesn't always happen when there are so many threads in blocking waits and so much competition for the new threads as they are added. In our production environment, we had another connected resource that would disconnect since its health checks could not complete under the lockup, which would ultimately cause a restart of the pod. Then it would enter an endless cycle of locking up and restarting.Purpose
This pull request is a:
Checklist