Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Peer disappears from peerswap-listpeers randomly #185

Closed
grubles opened this issue May 25, 2023 · 13 comments · Fixed by #213
Closed

Peer disappears from peerswap-listpeers randomly #185

grubles opened this issue May 25, 2023 · 13 comments · Fixed by #213
Milestone

Comments

@grubles
Copy link
Collaborator

grubles commented May 25, 2023

On a CLN v23.05 node, a v23.02 peer (Blockstream Store node) running Peerswap seems to randomly disappear from peerswap-listpeers.

Force disconnecting the node and letting CLN reconnect temporarily fixes it, but over time the node will disappear again. The v23.05 node has a channel to another v23.02 peer with Peerswap where this does not happen, which is strange. I'm going to try dig through logs to see if I can spot anything obvious. Will also try and replicate on signet.

@wtogami wtogami added this to the v1.0 milestone Jul 14, 2023
@wtogami
Copy link
Contributor

wtogami commented Jul 14, 2023

Adding this as a release blocker.

We've considered to see Blockstream Store running v23.02 disappear from peerswap-listpeers from multiple CLN v23.05 nodes. We're doing some testing before upgrading that node to v23.05. If that turns out to be the fix then we need to declare a higher minimum CLN version for PeerSwap.

@nepet
Copy link
Contributor

nepet commented Jul 14, 2023

Does the node also disappear from clns listpeers?

@nepet
Copy link
Contributor

nepet commented Jul 14, 2023

Ok, maybe I found the problem, lets try to verify: I am seeing the following log messages:

2023-07-14T13:48:11.174Z INFO    {redacted pubkey}-chan#***: Peer transient failure in CHANNELD_NORMAL: Disconnected
2023-07-14T13:48:12.047Z INFO    {redacted pubkey}-chan#***: Peer transient failure in CHANNELD_NORMAL: channeld WARNING: update_fee 253 outside range 702-39788 (currently 3245)

The node behaves accordingly to the specification

  - if the `update_fee` is too low for timely processing, OR is unreasonably large:
    - MUST send a `warning` and close the connection, or send an
      `error` and fail the channel.

It seems that the node disconnects due to a too low update_fee. Can you find evidence on your end that this is the case for the disconnect?

@nepet
Copy link
Contributor

nepet commented Jul 14, 2023

Just wrote with @wtogami, this does not seem to be the cause of the issue so this needs further investigation.

@wtogami
Copy link
Contributor

wtogami commented Jul 16, 2023

Possible: I think I began seeing this after #189 was merged.

Definite: Two CLN v23.05.x nodes see Blockstream Store v23.02.x disappear from peerswap-listpeers but not listpeers after a few hours. lightning-cli disconnect <PEERID> force followed by lightning-cli connect <PEERID> fixes it for a while but it disappears again.

@wtogami
Copy link
Contributor

wtogami commented Jul 19, 2023

The below might have been hitting the bug fixed by #206 which is different from the original bug here. We need to wait 4+ hours to see if this happens again.

I was hoping it was somehow CLN. Blockstream Store is now upgraded to matching CLN v23.05.2. Unfortunately it still exhibits this problem.

This a release blocker since it breaks swaps with the largest PeerSwap demo node.

I think this began after #189 so that would be the first place I'd look.

More Diagnostics

  • Force disconnect and reconnect of that peer previously fixed it. It doesn't fix it anymore. Curious.
  • Restarting peerswap plugin on the other side fails to see "Received poll" from Blockstream Store while all the other peers work fine. This suggests the plugin stopped on Blockstream Store.

@nepet
Copy link
Contributor

nepet commented Jul 19, 2023

I had an interesting conversation about this issue and it could be the case that we overload cln the way we trigger the poll messages. Right now, we send out the poll messages in parallel every other hour. As the BS node has quite a few peers this might overflow or be rejected.

In the past we might have misunderstood errors on SendCustomMsg as disconnected peers and the BS node was logging this so noisy that we silenced the logs in #189

A possible solution to fix this problem would be to spread out the load of the polling system in a way that we do not send out all messages at once but in a sequential manner with a timeout between the calls. Possible data structures to accomplish this could be a priority queue or a min heap, both ordered by timestamp.

Additionally we need to look out for log messages beginning with poll_service: could not send msg to ... on the store node. These are on log level debug.

@grubles
Copy link
Collaborator Author

grubles commented Jul 19, 2023

Could the Store node be hitting #186? Although checking a node I have access to shows a few zombie processes but commands still work so perhaps that is not the problem.

$ ps aux | grep peerswap
user      951190  0.0  0.0 1174208 17152 pts/4   Sl   Jul17   0:00 .../peerswap
user      965566  0.0  0.0 1174208 15360 pts/4   Sl   Jul17   0:00 .../peerswap
user     1064134  1.3  0.0 2658112 17536 pts/4   Sl   Jul18  12:00 .../peerswap
$ lightning-cli peerswap-listpeers

[                                                                     
   {                                                                                                                                        
      "nodeid": "...",
      
      etc.

@nepet
Copy link
Contributor

nepet commented Jul 19, 2023

I doubt that this is related, but this issue and #186 are my highest priorities.

@wtogami
Copy link
Contributor

wtogami commented Jul 19, 2023

No Blockstream Store restarts entire docker containers in order to do upgrades so it doesn't have the opportunity for old processes to survive.

@wtogami
Copy link
Contributor

wtogami commented Jul 20, 2023

The below might have been hitting the bug fixed by #206 which is different from the original bug here. We need to wait 4+ hours to see if this happens again.

Confirmed it still happens where other nodes can't see Blockstream Store after a few hours. Blockstream Store is running CLN v23.05.2 with PeerSwap 725ca2c. Meanwhile Blockstream Store peerswap-listpeers is able to see the remote peers.

@nepet
Copy link
Contributor

nepet commented Jul 20, 2023

I am wondering, is there any benefit in persisting peerswap-peers in the database? Wouldn't it be sufficient to just store them in memory?

@wtogami
Copy link
Contributor

wtogami commented Jul 20, 2023

I am wondering, is there any benefit in persisting peerswap-peers in the database? Wouldn't it be sufficient to just store them in memory?

Memory only is fine but that wouldn't fix our current problem right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants