Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexplainable clock drift in LND causing issues #5340

Closed
maxgoldberg opened this issue May 27, 2021 · 11 comments
Closed

Unexplainable clock drift in LND causing issues #5340

maxgoldberg opened this issue May 27, 2021 · 11 comments
Labels
hardware possible issue w/ hardware needs investigation

Comments

@maxgoldberg
Copy link

Background

On May 20 UTC, my node had multiple channel closures due to what I originally thought (from the logs) was because of an ipv6 connectivity error, but upon poring over more logs, it appears to be a mysterious clock issue that only affected LND. For one log entry at 20:38, the clock skips ahead an hour to 21:38, then goes back an hour like nothing happened and continues going about its business. An hour later, when 21:38 actually occurs, all hell breaks loose and multiple channels have been force-closed remotely.

Related issue: #5324 which was one of the remote nodes that closed a channel with my node.

lnd log (timestamps are in local CDT which I believe is UTC-5

2021-05-19 20:38:00.047 [ERR] BTCN: Can't accept connection: unable to accept connection from [__REMOTE_NODE_1_IPV6_ADDR__]:1101: read tcp [__MY_NODE_IPV6_ADDR__]:9735->[__REMOTE_NODE_1_IPV6_ADDR__]:1101: i/o timeout
2021-05-19 21:38:12.609 [ERR] BTCN: Can't accept connection: unable to accept connection from [__REMOTE_NODE_2_IPV6_ADDR__]:24789: read tcp [__MY_NODE_IPV6_ADDR__]:9735->[__REMOTE_NODE_2_IPV6_ADDR__]:24789: i/o timeout
2021-05-19 20:39:01.346 [INF] HSWC: ChannelLink(__REDACTED_CHANNEL_ID__): updating commit fee to 29104 sat/kw sat/kw`
2021-05-19 20:39:01.346 [INF] HSWC: ChannelLink(__REDACTED_CHANNEL_ID2__): updating commit fee to 29104 sat/kw sat/kw
2021-05-19 20:39:29.654 [INF] HSWC: ChannelLink(__REDACTED_CHANNEL_ID3__): updating commit fee to 29104 sat/kw sat/kw
2021-05-19 20:39:45.187 [INF] HSWC: ChannelLink(__REDACTED_CHANNEL_ID4__): updating commit fee to 29104 sat/kw sat/kw
... many log lines cut ...
2021-05-19 21:37:01.825 [ERR] LTND: GetCertificate: acme/autocert: server name component count invalid
2021-05-19 21:38:02.259 [ERR] LTND: GetCertificate: acme/autocert: server name component count invalid
2021-05-19 21:38:12.606 [INF] UTXN: Attempting to graduate height=684256: num_kids=0, num_babies=0
2021-05-19 21:38:12.609 [ERR] LTND: Unable to lookup witness: no witnesses
2021-05-19 21:38:12.609 [ERR] BTCN: Can't accept connection: unable to accept connection from [__REMOTE_NODE_1_IPV6_ADDR__]:57069: read tcp [__MY_NODE_IPV6_ADDR__]:9735->[__REMOTE_NODE_1_IPV6_ADDR__]:57069: i/o timeout
2021-05-19 21:38:12.612 [ERR] BTCN: Can't accept connection: unable to accept connection from __REMOTE_NODE_3_IPV4_ADDR__::46300: EOF
2021-05-19 21:38:12.614 [INF] SRVR: New inbound connection from __REMOTE_NODE_4_IPV4_ADDR__:38716
2021-05-19 21:38:12.631 [INF] HSWC: Garbage collected 2 shared secret hashes at height=684256
2021-05-19 21:38:12.639 [INF] PEER: unable to read message from __REMOTE_NODE_5_PUBKEY__@__REMOTE_NODE_5_IPV4_ADDR__:59282: read tcp __MY_NODE_IPV4_ADDR__:9735->__MY_NODE_IPV6_ADDR__:59282: use of closed network connection
2021-05-19 21:38:12.641 [ERR] RPCS: [/chainrpc.ChainNotifier/RegisterSpendNtfn]: context canceled
... cut many ipv6 errors e.g. Can't accept connection: unable to accept connection:  i/o timeout or EOF ...
2021-05-19 21:38:12.691 [INF] CRTR: Block 00000000000000000006e34f6af5b33a1256edc9b07081ffee44a37c02f2eec7 (height=684257) closed 0 channels
2021-05-19 21:38:12.691 [INF] CRTR: Pruning channel graph using block 0000000000000000000ab9408daec1ecd6f9736b2a58b2912398c205dfbcd70e (height=684258)
2021-05-19 21:38:12.919 [INF] CRTR: Block 0000000000000000000ab9408daec1ecd6f9736b2a58b2912398c205dfbcd70e (height=684258) closed 0 channels
2021-05-19 21:38:12.919 [INF] CRTR: Pruning channel graph using block 000000000000000000098fabad85b4a54c62b942484366c6e4cf61b48cabd87f (height=684259)
2021-05-19 21:38:13.021 [INF] CRTR: Block 000000000000000000098fabad85b4a54c62b942484366c6e4cf61b48cabd87f (height=684259) closed 0 channels
2021-05-19 21:38:13.091 [INF] HSWC: ChannelLink(678835:687:0): updating commit fee to 29126 sat/kw sat/kw
2021-05-19 21:38:13.111 [INF] CHDB: Pruned unconnected node 02ac89e773073d7a8c62dbc1f7e0135a45b7129ecf820398babc229747fd5b0e41 from channel graph
2021-05-19 21:38:13.111 [INF] CHDB: Pruned 1 unconnected nodes from the channel graph
2021-05-19 21:38:13.115 [INF] CRTR: Pruning channel graph using block 000000000000000000006bd3d6ef94d8a01de84e171d3553534783b128f06aad (height=684260)
2021-05-19 21:38:13.219 [INF] CRTR: Block 000000000000000000006bd3d6ef94d8a01de84e171d3553534783b128f06aad (height=684260) closed 0 channels
2021-05-19 21:38:13.355 [INF] NTFN: New block: height=684257, sha=00000000000000000006e34f6af5b33a1256edc9b07081ffee44a37c02f2eec7
2021-05-19 21:38:13.356 [INF] NTFN: Cancelling epoch notification, epoch_id=9
...  bunch of misc stuff cut ...
2021-05-19 21:38:15.062 [INF] UTXN: Attempting to graduate height=684260: num_kids=0, num_babies=0
2021-05-19 21:38:15.062 [ERR] LTND: Unable to lookup witness: no witnesses
2021-05-19 21:38:15.063 [ERR] LTND: Unable to lookup witness: no witnesses
2021-05-19 21:38:15.063 [ERR] LTND: Unable to lookup witness: no witnesses
2021-05-19 21:38:15.075 [INF] HSWC: Garbage collected 6 shared secret hashes at height=684260
2021-05-19 21:38:16.075 [WRN] DISC: ignoring remote ChannelAnnouncement for own channel
2021-05-19 21:38:16.076 [WRN] DISC: incorrect pubkey to resurrect zombie with chan_id=551086:1551:0
2021-05-19 21:38:16.088 [WRN] DISC: ignoring remote ChannelAnnouncement for own channel
2021-05-19 21:38:16.094 [WRN] DISC: ignoring remote ChannelAnnouncement for own channel
2021-05-19 21:38:16.128 [WRN] DISC: incorrect pubkey to resurrect zombie with chan_id=542070:1443:0
2021-05-19 21:38:16.136 [WRN] DISC: incorrect pubkey to resurrect zombie with chan_id=653656:1565:1
2021-05-19 21:38:16.169 [WRN] DISC: incorrect pubkey to resurrect zombie with chan_id=551133:1287:0
2021-05-19 21:38:19.189 [WRN] DISC: incorrect pubkey to resurrect zombie with chan_id=565962:729:0
2021-05-19 21:38:19.216 [WRN] DISC: incorrect pubkey to resurrect zombie with chan_id=563795:889:0
2021-05-19 21:38:19.243 [WRN] DISC: incorrect pubkey to resurrect zombie with chan_id=680469:1141:0
2021-05-19 21:38:19.244 [WRN] DISC: ignoring remote ChannelAnnouncement for own channel
2021-05-19 21:38:19.261 [WRN] DISC: ignoring remote ChannelAnnouncement for own channel
021-05-19 21:38:29.681 [WRN] DISC: incorrect pubkey to resurrect zombie with chan_id=565962:729:0
2021-05-19 21:38:34.910 [WRN] DISC: ignoring remote ChannelAnnouncement for own channel
2021-05-19 21:39:02.290 [ERR] LTND: GetCertificate: acme/autocert: server name component count invalid
2021-05-19 21:39:12.634 [ERR] HSWC: ChannelLink(__CLOSED_CHANNEL_1__): failing link: unable to complete dance with error: remote unresponsive
2021-05-19 21:39:12.634 [INF] HSWC: ChannelLink(__CLOSED_CHANNEL_1__): exited
2021-05-19 21:39:12.636 [INF] HSWC: Removing channel link with ChannelID(a0af035a5dce1d33cd551aa0dd8f4d7026e1524fa2564faf71ab7fde414b52c3)
2021-05-19 21:39:12.636 [INF] HSWC: ChannelLink(__CLOSED_CHANNEL_1__): stopping
2021-05-19 21:39:12.988 [ERR] HSWC: ChannelLink(__CLOSED_CHANNEL_2__): failing link: unable to complete dance with error: remote unresponsive
2021-05-19 21:39:12.988 [INF] HSWC: ChannelLink(__CLOSED_CHANNEL_2__): exited
2021-05-19 21:39:12.988 [INF] HSWC: Removing channel link with ChannelID(9d5715145724ab6f4fdbe19fc95d8b3ae8649652c1c61356c33acff4272e87ef)
2021-05-19 21:39:12.988 [INF] HSWC: ChannelLink(__CLOSED_CHANNEL_2__): stopping
2021-05-19 21:39:13.049 [ERR] HSWC: ChannelLink(__CLOSED_CHANNEL_3__): failing link: unable to complete dance with error: remote unresponsive
2021-05-19 21:39:13.050 [INF] HSWC: ChannelLink(__CLOSED_CHANNEL_3__): exited
2021-05-19 21:39:13.050 [INF] HSWC: Removing channel link with ChannelID(2edfba2c6a181e4b313d9dda0d21b1da51ad42194bfcf9f3983888df70ad5afe)
2021-05-19 21:39:13.050 [INF] HSWC: ChannelLink(__CLOSED_CHANNEL_3__): stopping
2021-05-19 21:39:13.309 [INF] CRTR: Processed channels=4 updates=3483 nodes=273 in last 1m0.703050241s

I've heavily cut out/redacted anything I didn't think was applicable. I'm @max on the lnd slack, and can provide unredacted logs there. Please let me know if I can provide any other helpful logs here as well.

The machine runs ntpd, but as far as I can tell the clock was not changed during this time. Every other log on the system looks normal and I don't see any clock changes or suspect behavior, so I am at a total loss as to what happened. bitcoind was running on the same machine and comparing the timestamps for blocks around this height seem to match as well. The line with 2021-05-19 21:38:12.609 in the lnd log is literally the only thing on the system I can find that shows anything weird happened with the clock.

Your environment

  • lnd 0.13.rc2 (have updated to rc3 since)
  • Linux 4.19.0-16-amd64 Fix name typo in README #1 SMP Debian 4.19.181-1 (2021-03-19) x86_64 GNU/Linux
  • bitcoind 0.21.1
  • colocated with multiple uplinks (so network connectivity less-likely to be an issue)

Steps to reproduce

Unknown

Expected behaviour

Time travel is impossible

Actual behaviour

We go back to the future

@Roasbeef
Copy link
Member

Which Go version are you running? Possibly it's a runtime or kernel issue as we don't do any thing custom w/ the runtime's time source (don't think it's even possible).

Are you monitoring the CPU/resource utilization of the node? During this "time travel" instance, do you see the node continue to receive new blocks?

@C-Otto
Copy link
Contributor

C-Otto commented May 27, 2021

I can provide logs for the other side (in addition to those mentioned in #5324).

@Roasbeef
Copy link
Member

2021-05-19 21:39:13.049 [ERR] HSWC: ChannelLink(CLOSED_CHANNEL_3): failing link: unable to complete dance with error: remote unresponsive

From this, AFAICT, either there's a major network connectivity issue (are either nodes on Tor?) or one of the nodes has simply "frozen". If that happens, then it would make sense that force closes occur: your node is holding onto HTLCs and is no longer live.

@maxgoldberg
Copy link
Author

Which Go version are you running? Possibly it's a runtime or kernel issue as we don't do any thing custom w/ the runtime's time source (don't think it's even possible).

Are you monitoring the CPU/resource utilization of the node? During this "time travel" instance, do you see the node continue to receive new blocks?

# go version
go version go1.16.3 linux/amd64

bitcoind's debug.log shows the following

8 date='2021-05-19T04:36:31Z' progress=1.000000 cache=58.6MiB(364205txo)
2021-05-19T04:43:50Z UpdateTip: new best=000000000000000000083c4631722064f03b62c020dd13401659b1b63930b16d height=684154 version=0x20000004 log2_work=92.887676 tx=64328343
3 date='2021-05-19T04:43:30Z' progress=1.000000 cache=59.5MiB(371453txo)
2021-05-19T04:55:30Z socket send error Connection reset by peer (104)
2021-05-19T05:09:41Z socket send error Broken pipe (32)
2021-05-19T05:11:26Z UpdateTip: new best=0000000000000000000927dac53c6172ad86801073c9b6b1b16d5ef8924be73f height=684155 version=0x2fffe004 log2_work=92.887693 tx=64328572
8 date='2021-05-19T05:11:11Z' progress=1.000000 cache=61.9MiB(390435txo)
2021-05-19T05:15:12Z UpdateTip: new best=0000000000000000000a765c34307fd6f714a0f22c18e12049b3c4597fcfb36c height=684156 version=0x20000000 log2_work=92.887710 tx=64328788
4 date='2021-05-19T05:14:38Z' progress=1.000000 cache=62.4MiB(394772txo)
2021-05-19T05:22:08Z socket send error Broken pipe (32)
2021-05-19T05:23:31Z socket send error Connection reset by peer (104)
2021-05-19T05:25:32Z ping timeout: 1200.027911s
2021-05-19T05:35:01Z socket receive timeout: 1201s
2021-05-19T05:35:16Z ping timeout: 1200.025226s
2021-05-19T05:35:57Z Pre-allocating up to position 0xb00000 in rev02587.dat
2021-05-19T05:35:58Z UpdateTip: new best=0000000000000000000a9cc21ddd38072b8d227f5138e568ac6a2a65582971e8 height=684157 version=0x20004004 log2_work=92.887727 tx=64329038
2 date='2021-05-19T05:35:44Z' progress=1.000000 cache=63.7MiB(405691txo)
2021-05-19T05:39:04Z UpdateTip: new best=00000000000000000004841ae17b506699b766b188fa0e510567ba35c98c92b6 height=684158 version=0x20800004 log2_work=92.887744 tx=64329238
8 date='2021-05-19T05:38:39Z' progress=1.000000 cache=64.6MiB(413185txo)
2021-05-19T05:47:32Z Disconnecting and discouraging peer 242523!
2021-05-19T05:47:46Z UpdateTip: new best=000000000000000000081cec98a78b7d3addeeca679b191190a8811918a521dd height=684159 version=0x20400004 log2_work=92.887761 tx=64329380
6 date='2021-05-19T05:39:26Z' progress=0.999997 cache=65.5MiB(420226txo)
2021-05-19T05:49:25Z ping timeout: 1200.014902s
2021-05-19T05:55:40Z socket send error Broken pipe (32)
2021-05-19T05:56:07Z ping timeout: 1200.026576s
2021-05-19T05:57:08Z socket send error Broken pipe (32)
2021-05-19T05:59:09Z UpdateTip: new best=00000000000000000000dd7bbc4f4c30b652958d3a3c1211d531ff01984e9274 height=684160 version=0x2000e004 log2_work=92.887778 tx=64329589
0 date='2021-05-19T05:58:32Z' progress=1.000000 cache=67.3MiB(435189txo)
2021-05-19T06:03:46Z Disconnecting and discouraging peer 242887!
2021-05-19T06:32:23Z Potential stale tip detected, will try using extra outbound peer (last tip update: 1994 seconds ago)

It appears those times are correct, and that lnd didn't process them until after 21:38 (I'm assuming because once it had the time fluctuation, it thought bitcoind's block notifications were coming from the future?)

I don't do any sort of utilization monitoring that logs locally.

@Roasbeef
Copy link
Member

Perhaps we should revive this old idea: #2992

If the network link is truly dead, or the node not live, then they won't respond to the ping, so we can filter them out and not forward. In the case above, do you see your node responding to pings from other nodes? It would need to be on the debug logging level.

@maxgoldberg
Copy link
Author

2021-05-19 21:39:13.049 [ERR] HSWC: ChannelLink(CLOSED_CHANNEL_3): failing link: unable to complete dance with error: remote unresponsive

From this, AFAICT, either there's a major network connectivity issue (are either nodes on Tor?) or one of the nodes has simply "frozen". If that happens, then it would make sense that force closes occur: your node is holding onto HTLCs and is no longer live.

I think to the outside world, my node appeared frozen because it seemed to stop doing all of the things it thought it should as soon as the clock changed momentarily. I think there are two issues here: 1) why did the clock change (and seemingly only for lnd) and 2) how does lnd react to clock changes that would cause it to wait an hour to time out all the things that should have timed out as soon as the clock change happened?

@Roasbeef
Copy link
Member

Roasbeef commented May 27, 2021

I think there are two issues here: 1) why did the clock change (and seemingly only for lnd) and 2) how does lnd react to clock changes that would cause it to wait an hour to time out all the things that should have timed out as soon as the clock change happened?

Hard to say without any sort of instrumentation of the process and also other processes running at that time. lnd relies on the go runtime for things like wake up routines, timeouts, backoffs, I/O signals, etc. If that "frozen" for w/e reason, then goroutines wouldn't be scheduled as normal.

Don't think there's really much we can do here without something to reproduce the incident, and also more telemetry on the node itself.

@Roasbeef
Copy link
Member

Based on those ping timeouts in the bitcoind log, and all the sends on broken pipes, it looks like it was also affected in some manner. I don't think lnd nor bitcoind are related to the root cause here, instead something happened to the machine which affected those processes.

@C-Otto
Copy link
Contributor

C-Otto commented May 27, 2021

The ping timeouts for bitcoind are quite normal, this just means that some remote peer "died" without disconnecting properly. I had 3304 of such entries for May 2021 on my node, without any noticable issue.

@maxgoldberg
Copy link
Author

Based on those ping timeouts in the bitcoind log, and all the sends on broken pipes, it looks like it was also affected in some manner. I don't think lnd nor bitcoind are related to the root cause here, instead something happened to the machine which affected those processes.

Not that it helps my case much, but those messages are common in bitcoind debug logs all the time both on this node and others I run at other locations. I'd be happy to find out the issue isn't lnd, but nothing I've seen in any other logs on the machine show the clock changing or any odd behavior.

@Roasbeef
Copy link
Member

Closing due to inactivity, not clear if this is a hardware issue, or something in lnd itself. Will re-open if we gain actionable information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hardware possible issue w/ hardware needs investigation
Projects
None yet
Development

No branches or pull requests

3 participants