-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
peer: eliminate circular waiting by calling maybeSendNodeAnn async #7938
peer: eliminate circular waiting by calling maybeSendNodeAnn async #7938
Conversation
In this commit, we attempt to fix circular waiting scenario introduced inadvertently when [fixing a race condition scenario](lightningnetwork#7856). With that PR, we added a new channel that would block `Disconnect`, and `WaitForDisconnect` to ensure that only until the `Start` method has finished would those calls be allowed to succeed. The issue is that if the server is trying to disconnect a peer due to a concurrent connection, but `Start` is blocked on `maybeSendNodeAnn`, which then wants to grab the main server mutex, then `Start` can never exit, which causes `startReady` to never be closed, which then causes the server to be blocked. This PR attempts to fix the issue by calling `maybeSendNodeAnn` in a goroutine, so it can grab the server mutex and not block the `Start` method. Fixes lightningnetwork#7924 Fixes lightningnetwork#7928 Fixes lightningnetwork#7866
IMO this isn't an ideal solution, but pretty sure this is the interaction at hand. Here're some supporting goroutine stack traces: Server blocked on grabbing the mutex to execute the
Peer goroutines unable to be torn down (concurrent connection case), as waiting on
Peer goroutine can't finish starting up as it needs to call
So now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this works. There's also an old PR #7525 which removes maybeSendNodeAnn
as it seems it's not needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🎉
Either this approach or the one in #7525 seem fine to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice catch 🎣
This PR is a follow up, to a [follow up](lightningnetwork#7938) of an [initial concurrency issue](lightningnetwork#7856) fixed in the peer goroutine. In lightningnetwork#7938, we noticed that the introduction of `p.startReady` can cause `Disconnect` to block. This happens as `Disconnect` cannot be called until `p.startReady` has been closed. `Disconnect` is also called from `InboundPeerConnected` (the case of concurrent peers, so we need to remove one of the connections) while the main server mutex is held. If `p.Start` blocks for any reason, then this leads to the deadlock as: we can't disconnect until we've finished starting, and we can't finish starting as we need the disconnect caller to exit as it has the mutex. In this commit, we now make the call to `prunePersistentPeerConnection` async. The call to `prunePersistentPeerConnection` eventually wants to grab the server mutex, which triggers the circular waiting scenario above. The main learning here is that no calls to the main server mutex path can block from `p.Start`. This is more or less a stop gap to resolve the issue initially introduced in v0.16.4. Assuming we want to move forward with this fix, we should reexamine `p.startReady` all together, and also revisit attempt to refactor this section of the code to eliminate the mega mutex in the server in favor of a dedicated event loop.
In this commit, we attempt to fix circular waiting scenario introduced inadvertently when fixing a race condition
scenario. With that PR, we added a new channel that would block
Disconnect
, andWaitForDisconnect
to ensure that only until theStart
method has finished would those calls be allowed to succeed.The issue is that if the server is trying to disconnect a peer due to a concurrent connection, but
Start
is blocked onmaybeSendNodeAnn
, which then wants to grab the main server mutex, thenStart
can never exit, which causesstartReady
to never be closed, which then causes the server to be blocked.This PR attempts to fix the issue by calling
maybeSendNodeAnn
in a goroutine, so it can grab the server mutex and not block theStart
method.Fixes #7924
Fixes #7928
Fixes #7866
Fixes #7917