[bug]: force close for unclear reason #7180

rkfg · 2022-11-19T11:12:16Z

Background

Channel with lnmarkets was force closed, then cascaded to the downstream channel. For the sake of clarity I'll call these nodes like this: A (my peer) — B (me) — C (lnmarkets). There was a stuck HTLC of 10 sats from A to C going through B, it timed out. For some reason I can't understand it wasn't failed off-chain but instead went on-chain, I can confirm the lnmarkets node (C) was online all the time and I don't see it disconnecting before the FC. The message says:

Nov 18 06:42:53 fullnode lnd[1564518]: 2022-11-18 06:42:53.617 [INF] CNCT: ChannelArbitrator(c0ff658a2a02b5c04efa06e374d12a90c88a7cc6a8be8c87fadd739360fda679:1): go to chain for outgoing htlc 1a0ab6ca1880592e4665372caef469c78983d650f5a093f82b33abc64b8ce134: timeout=763657, blocks_until_expiry=0, broadcast_delta=0

However, due to mempool conditions the fee in that tx (10 sat/vB) wasn't enough to be confirmed. I'm not sure if I'm correct but this caused the channel this HTLC came from (A—B) to also be force closed by peer A in this case (who was also online, we were tracking this issue in real time). As a result I lost two channels because of one stupid 10 sat HTLC that can't even be represented on chain anyway. What's even more weird is that my peer (A) reported seeing that FC (it was also unconfirmed due to the same reason) while to me (B) the channel was active though unusable, I tried to rebalance through it and it failed at hop 0. I tried to restart lnd and also manually reconnect the peer, the channel was still shown as active.

Is that true that until the force close tx is confirmed on chain the corresponding HTLC can't be failed off-chain? If that's the case it can easily cause a chain of FCs when the minimum fee is above 10 (the default maximum for anchor-type channels) and operators don't babysit their channels all the time to manually bump the fee through anchors and CPFP. If it's not then there's a bug that prevented lnd from cancelling the incoming HTLC when the outgoing channel FC is still not confirmed. I suppose the same might have happened to C, the outgoing channel for that HTLC on their node was offline, it timed out and was FCed but couldn't confirm in 40 blocks (our hop timeout) so my node B couldn't cancel it off-chain and had to FC as well. My peer A, however, said that he doesn't see that 10-sat HTLC anywhere among his channels so it must've been cancelled off-chain.

There are no messages in the logs regarding HTLC failure errors, at least not with the default INFO-level settings. Maybe there should be.

Your environment

version of lnd 0.15.4
which operating system (uname -a on *Nix) Raspbian arm64
version of btcd, bitcoind, or other backend bitcoind 23.99

The text was updated successfully, but these errors were encountered:

Roasbeef · 2022-11-29T01:51:34Z

This states the close reason:

go to chain for outgoing htlc 1a0ab6ca1880592e4665372caef469c78983d650f5a093f82b33abc64b8ce134: timeout=763657, blocks_until_expiry=0, broadcast_delta=0

The HTLC was about to timeout. If you send an HTLC and the other peer never resolves or times out, then we need to go on chain to sweep it. If we attempted to fail it off chain, but the peer never responded (or the connection died, or the tor connection stalled, etc, etc), then we have no option but to go on chain to resolve the HTLC.

As a result I lost two channels because of one stupid 10 sat HTLC that can't even be represented on chain anyway.

See #1226. This proposes that we start to take into account the expected gain when we decide to go on chain or not.

rkfg · 2022-11-29T05:59:45Z

Yeah, I just don't understand this flow clearly enough. Who should cancel the HTLC off chain (sender or receiver) and when? What happens if the first attempt failed for some reason, like you said if the connection stalled or was dropped? Do we retry that? And most importantly, was that caused by high fees that prevented an upstream force close to confirm in time and caused this domino effect? If that's the case, should the node operator watch their channels and manually bump the fee? I think lnd can do it automatically if there's a risk of losing another channel.

harvhat · 2022-11-29T22:28:45Z

If we attempted to fail it off chain, but the peer never responded (or the connection died, or the tor connection stalled, etc, etc), then we have no option but to go on chain to resolve the HTLC.

What I often see is dead tor connection and stuck htlcs on it. When I see this restarting LND usually (90% or more) re-establishes the connection and the HTLCs clear. It seems LND should be able to re-establish these connections if that is all it takes (automatically, without my manual intervention) to save the channel.

Roasbeef · 2022-11-30T03:59:04Z

What I often see is dead tor connection and stuck htlcs on it.

Yeah there's an old idea that was never fully implemented to send a ping over a connection before we send an HTLC. If we get a pong back (we should immediately), then we'd actually use the channel. If not, we'd treat it like the channel was actually offline. This would ensure we never try to use a stale connection (due to tor, mobile roaming, etc, etc). This implementation here is pretty simple, so I think we should brush this off again.

It seems LND should be able to re-establish these connections if that is all it takes (automatically, without my manual intervention) to save the channel.

lnd would re-establish the connection as soon as it gets some sort of user space signal that the TCP connection has actually died (you'd see an unable to read: EOF in the logs or something like that). Without something like that, we need a user space ping system which is described above.

Roasbeef · 2022-11-30T04:15:09Z

Who should cancel the HTLC off chain (sender or receiver) and when?

If you have an incoming HTLC, then you should be the one that cancels it. However if that has a corresponding outgoing HTLC (it was a forward), then the remote party needs to cancel it. If the remote party doesn't cancel it (stale connection that wasn't detected, or peer offline), then you (lnd) needs to actually go on chain to cancel it. In your case, we went on chain, but things didn't confirm in time (default 40 block CLTV delta, can be raised on the command line) since the mempool was jam packed. As a result, the incoming time lock also expired, the peer that sent us the HTLC needed to go on chain. This peer will then cancel back off chain once it resolves the HTLC.

So in summary, everything worked as expected, but things took too long to confirm. We have some basic deadline awareness, but it'll only initially target with a higher confirmation target. The missing link here is to dynamically fee bump as it gets closer to the deadline. We have a lot of research and design for stuff like this, but it hasn't all been implemented yet.

One thing that can prevent this in the future, is for a user to manually increase their CLTV delta when the mempool gets "full". In the future, we'll also start to do this automatically.

If that's the case, should the node operator watch their channels and manually bump the fee? I think lnd can do it automatically if there's a risk of losing another channel.

Yes an operator can do that, and yeah ideally lnd should do it automatically. Our goal is to implement the automated bumping for lnd v0.17: #4215.

Feel free to close this issue if the above statement answers your lingering questions @rkfg.

rkfg · 2022-11-30T06:22:56Z

Yeah, I guess that all explains it. For now we need to be extra cautious during high fee times, hopefully these ideas will be implemented soon! Thank you.

rkfg added bug Unintended code behaviour needs triage labels Nov 19, 2022

sputn1ck added force closes HTLC and removed needs triage labels Nov 20, 2022

rkfg closed this as completed Nov 30, 2022

scissorstail mentioned this issue Mar 18, 2023

[bug]: rebalance payment stuck and cause force close #7509

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug]: force close for unclear reason #7180

[bug]: force close for unclear reason #7180

rkfg commented Nov 19, 2022

Roasbeef commented Nov 29, 2022

rkfg commented Nov 29, 2022

harvhat commented Nov 29, 2022

Roasbeef commented Nov 30, 2022

Roasbeef commented Nov 30, 2022

rkfg commented Nov 30, 2022

[bug]: force close for unclear reason #7180

[bug]: force close for unclear reason #7180

Comments

rkfg commented Nov 19, 2022

Background

Your environment

Roasbeef commented Nov 29, 2022

rkfg commented Nov 29, 2022

harvhat commented Nov 29, 2022

Roasbeef commented Nov 30, 2022

Roasbeef commented Nov 30, 2022

rkfg commented Nov 30, 2022