-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bug]: Lnd doesn't seem to gracefully shutdown on 0.17.0-beta.rc1 #7928
Comments
Could you please pull a goroutine dump while it's attempting to shut down? That would help a lot. |
What extra patch are you running? W/e that was should be already folded into the latest rc. |
The extra patch that is being referred here is this fix: #7922 @niteshbalusu11 did you compile the master, or applied this pr to a branch of yours? |
I compiled master. |
How do I do this? |
I enabled profiling in conf file, restarted lnd and then after it started back up, i just get this when attempting to curl |
New results today
then a whole bunch of
then it's still trying to talk to bitcoind 6 mins later
around the 15min mark, then container gets killed because of docker timeout
|
Ah, so I guess that part is already shut down... |
|
Thanks a lot! Took me a while to comb through ~19k lines... Might be I missed something but this specific area looks suspicious to me: There are 218 goroutines waiting for
And 43 goroutines waiting for
Not sure if we changed anything in that area or if that is a red herring (and it's just normal graph sync and zombie resurrection going on). @niteshbalusu11 for how long has the node been running now without shutting down? The other mutex waiting for a lock is this one, which sounds more like something we recently worked on (cc @yyforyongyu):
|
I had to restart a few times to get these logs so between the last two restarts it's like 5 mins apart. |
Okay, this seems to be related to some of the other reported issues, think this is the most relevant one:
|
In this commit, we attempt to fix circular waiting scenario introduced inadvertently when [fixing a race condition scenario](lightningnetwork#7856). With that PR, we added a new channel that would block `Disconnect`, and `WaitForDisconnect` to ensure that only until the `Start` method has finished would those calls be allowed to succeed. The issue is that if the server is trying to disconnect a peer due to a concurrent connection, but `Start` is blocked on `maybeSendNodeAnn`, which then wants to grab the main server mutex, then `Start` can never exit, which causes `startReady` to never be closed, which then causes the server to be blocked. This PR attempts to fix the issue by calling `maybeSendNodeAnn` in a goroutine, so it can grab the server mutex and not block the `Start` method. Fixes lightningnetwork#7924 Fixes lightningnetwork#7928 Fixes lightningnetwork#7866
We have a candidate fix here: #7928 |
Please merge 🙏 |
@niteshbalusu11 does that seem to do the trick? |
I have not tested it yet. |
Just tested it, seems to be working well. |
Going off of this:
The
This goroutine that was trying to call |
@Crypt-iQ 429mb |
Noticed something in this log dump that is separate from my comment above:
The
Since you're running a pruned node, it will wait here for the block: https://github.com/btcsuite/btcwallet/blob/9c135429d5b7c5c84b19eaffce9e6a32e2d9d5d0/chain/bitcoind_conn.go#L479 |
Isn't that also strange, seems to be circular as |
Good catch. This doesn't prevent shutdown, but is certainly not the right behavior. Apparently, that |
still getting getting logs flooded with lines like
Node will not shutdown until a few thousand of these are spat out. Is this issue actually resolved? |
That just sounds like a logging issue. Shutting down does involve quite a bit of work, so it might take some seconds. And from how I interpret your message, the node does actually shut down? |
@dskvr Your node is processing many |
@guggero @Crypt-iQ To be clear, the original text of this issue
What I have described is the exact same issue described by OP. I added a post to this issue here, the resource consumption and logging patterns I outlined was echoed shortly after by @mycelia1 here, followed up with a goroutine that was reviewed by @guggero here.
I'm not talking seconds, I'm talking 1-2 hours. The only difference between what I've described and the OP that I am not forcing a SIGTERM after 15m. It takes about 1-2 hours to shut down a node for a layer-2 solution that demands ~100% uptime from its nodes. So between the stated purpose and risks of said software, the layer-2 problem it aims to solve and the contents of this thread, it would be difficult to infer anything other than the following: Blocking a restart while updating non-essential data in a graph in a circumstance that can result in monetary loss, is not the intended behavior. Also thanks for highlighting that my eyes completely missed the milestone update, I'll watch out for |
The linked PR that closed this issue did fix a shutdown bug, but it seems that another issue is lock contention with the graph's
I'm not sure how (or if) lock contention is handled in the go runtime (i.e. whether it's FIFO on mutex acquisition or something). I also wonder if we're receiving |
Okay, thanks for the clarification. It did sound like you were mostly complaining about the log spam. But anything more than a minute for shutting down is unacceptable, I agree. |
Maybe this issue is resolved? See #8250 for a description of why lnd refused to stop (using a pruned bitcoind). |
Hi @niteshbalusu11 can you please confirm if this is still an issue with lnd 0.17.4? thanks |
I think it can be closed. Doesn't happen anymore. |
Background
I am running lnd 0.17.0 rc1 plus the fix of Elle that was merged recently. When trying to restart or shutting down LND my docker container never stops, I have 15min timeout in the compose file before a kill signal is sent by docker. I just get these logs and the timeout eventually hits and the container gets killed so LND doesn't shut down gracefully.
2023-08-26 20:26:08.468 [ERR] DISC: Update edge for short_chan_id(878811056959193089) got: router shutting down
2023-08-26 20:26:08.468 [ERR] DISC: Update edge for short_chan_id(818216971034558465) got: router shutting down
2023-08-26 20:26:08.468 [ERR] DISC: Update edge for short_chan_id(878297584961191937) got: router shutting down
2023-08-26 20:26:08.469 [ERR] DISC: Update edge for short_chan_id(771274421704720385) got: router shutting down
2023-08-26 20:26:13.261 [ERR] DISC: Update edge for short_chan_id(865996249021546496) got: router shutting down
2023-08-26 20:26:13.374 [ERR] DISC: Update edge for short_chan_id(875772006717980672) got: router shutting down
Your environment
0.17.0-beta.rc1
ubuntu
bitcoind 24
The text was updated successfully, but these errors were encountered: