Handle connection better: make connectd know which peers are important #7630

rustyrussell · 2024-08-31T07:13:40Z

This series changes connectd to maintain reconnections, rather than the rather convoluted logic we had before. lightningd simply tells it which peers are "important", based on which peers have channels. We also are smarter about saving the last known address, if we have a successfully established an outgoing connection previously.

This lets connectd be smarter in overload, too, when it needs to select a peer to evict. And it's a chance to update our "startup" logic which tried to space out connections: we keep 10 in flight now, and wait until lightingd has forked off a subdaemon before adding more (though we always let through one a second, for corner cases).

These changes should help large nodes.

It's now trivial for us to do this ourselves, since we have gossmap. Signed-off-by: Rusty Russell <[email protected]>

Let lightningd feed us hints to try first, but we can extract the addresses from node_announcement messages ourselves. (Lightningd used to ask gossipd on our behalf: this is far simpler!) One side effect of this is that we don't hand back address hints given to us by lightningd: it would use these again for reconnecting. This is breaks test_sendpay_grouping, so we disable it temporarily. Signed-off-by: Rusty Russell <[email protected]>

In fact, only 951 of 17419 (5%) of node announcements are missing an address (and gossipd doesn't know if we can connect to Tor addresses anyway) so just check it *has* a node_announcement. Signed-off-by: Rusty Russell <[email protected]>

Once connectd is controlling reconnections, it'll need these. Signed-off-by: Rusty Russell <[email protected]>

…not just state. We're going to use this to ask if there are any channels which make it important to reconnect to the peer. Signed-off-by: Rusty Russell <[email protected]>

If we connected out, remember that address. We always remember the last address, but that may be an incoming address. This is explicitly the last outgoing address which worked. Signed-off-by: Rusty Russell <[email protected]>

Signed-off-by: Rusty Russell <[email protected]>

This is more useful than the last address, which may be it connecting to us. And use it when we restore it. Signed-off-by: Rusty Russell <[email protected]>

Rather than have lightningd call us repeatedly to try to connect, have it tell us what peers are transient and aren't, and connectd will automatically try to maintain that connection. There's a new "downgrade_peer" message to tell it a peer is now transient: to make it non-transient we simply tell connectd to connect as a non-transient. The first time, I missed that dual_open_control does its own state transitions :( Signed-off-by: Rusty Russell <[email protected]> Changelog-Changed: `connectd` now handles maintaining/reconnecting to important peers, and we remember the last successful address we connected to.

Signed-off-by: Rusty Russell <[email protected]>

The important flag replaces it, and now we can be more intelligent about eviction in overload. Signed-off-by: Rusty Russell <[email protected]>

We wait until a connection fails, or a subd is connected to the peer, before letting another one through. This should prevent us from overwhelming lightningd on large nodes, but unlike the previous back-off, it's based on how fast lightningd is, not an arbitrary time. We also let one through each second, in case we're connecting to many, but not doing anything but gossip (e.g. 100 explicit connect commands). Signed-off-by: Rusty Russell <[email protected]> Changelog-Changed: Reconnecting to peers at startup should be significantly faster (dependent on machine speed).

Signed-off-by: Rusty Russell <[email protected]>

The seeker can send a full gossip query, which means the ping doesn't happen (it needs 14-45 seconds of quiet!). We disable the gossip_queries feature, so it doesn't ask. ``` def test_ping_timeout(node_factory): # Disconnects after this, but doesn't know it. l1_disconnects = ['xWIRE_PING'] l1, l2 = node_factory.get_nodes(2, opts=[{'dev-no-reconnect': None, 'disconnect': l1_disconnects}, {'dev-no-ping-timer': None}]) l1.rpc.connect(l2.info['id'], 'localhost', l2.port) # This can take 10 seconds (dev-fast-gossip means timer fires every 5 seconds) l1.daemon.wait_for_log('seeker: startup peer finished', timeout=15) # Ping timers runs at 15-45 seconds, *but* only fires if also 60 seconds # after previous traffic. > l1.daemon.wait_for_log('dev_disconnect: xWIRE_PING', timeout=60 + 45 + 5) tests/test_connection.py:4194: ... > raise TimeoutError('Unable to find "{}" in logs.'.format(exs)) E TimeoutError: Unable to find "[re.compile('dev_disconnect: xWIRE_PING')]" in logs. ``` Signed-off-by: Rusty Russell <[email protected]>

Signed-off-by: Rusty Russell <[email protected]>

rustyrussell added the Highlight - Stability and Security Refinement of basics, prevention and cures label Aug 31, 2024

rustyrussell added this to the v24.11 milestone Aug 31, 2024

rustyrussell force-pushed the guilt/ccan-io-scale branch from f3ff208 to 153bbe1 Compare August 31, 2024 07:24

rustyrussell changed the title ~~Handle connection better: make connectd know what are important~~ Handle connection better: make connectd know which peers are important Sep 1, 2024

rustyrussell force-pushed the guilt/ccan-io-scale branch 4 times, most recently from 28997b7 to 0e5f665 Compare November 21, 2024 03:51

vincenzopalazzo mentioned this pull request Nov 21, 2024

connected is not going to die when lightningd dies #7848

Open

rustyrussell force-pushed the guilt/ccan-io-scale branch 2 times, most recently from aa59a03 to 030b46c Compare November 22, 2024 23:51

rustyrussell added 8 commits November 24, 2024 12:04

connectd: send self-advertizing gossip rather than having gossipd do it.

fceeb30

It's now trivial for us to do this ourselves, since we have gossmap. Signed-off-by: Rusty Russell <[email protected]>

connectd: expose --dev-no-reconnect and --dev-fast-reconnect options.

20348fb

Once connectd is controlling reconnections, it'll need these. Signed-off-by: Rusty Russell <[email protected]>

lightningd: generalize peer_any_channel to filter on entire channel, …

1517f39

…not just state. We're going to use this to ask if there are any channels which make it important to reconnect to the peer. Signed-off-by: Rusty Russell <[email protected]>

wallet: save last known address.

b6a60fe

If we connected out, remember that address. We always remember the last address, but that may be an incoming address. This is explicitly the last outgoing address which worked. Signed-off-by: Rusty Russell <[email protected]>

common: routine to make wireaddr_internal from wireaddr.

fcee7e6

Signed-off-by: Rusty Russell <[email protected]>

recovery: save last_known_addr for peer if we know it.

c4971f0

This is more useful than the last address, which may be it connecting to us. And use it when we restore it. Signed-off-by: Rusty Russell <[email protected]>

rustyrussell force-pushed the guilt/ccan-io-scale branch from f989f46 to 567b982 Compare November 24, 2024 01:59

rustyrussell added 7 commits November 25, 2024 11:38

pytest: restore test_sendpay_grouping test.

6d66025

Signed-off-by: Rusty Russell <[email protected]>

connectd: remove transient flag.

6d1e245

The important flag replaces it, and now we can be more intelligent about eviction in overload. Signed-off-by: Rusty Russell <[email protected]>

pytest: add test for connection ratelimiting.

6fb4d0b

Signed-off-by: Rusty Russell <[email protected]>

pytest: fix flake in test_gossip_force_broadcast_channel_msgs

f0ae5a3

Signed-off-by: Rusty Russell <[email protected]>

rustyrussell force-pushed the guilt/ccan-io-scale branch from 567b982 to f0ae5a3 Compare November 25, 2024 01:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle connection better: make connectd know which peers are important #7630

Handle connection better: make connectd know which peers are important #7630

rustyrussell commented Aug 31, 2024

Handle connection better: make connectd know which peers are important #7630

Are you sure you want to change the base?

Handle connection better: make connectd know which peers are important #7630

Conversation

rustyrussell commented Aug 31, 2024