-
Notifications
You must be signed in to change notification settings - Fork 904
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle connection better: make connectd know which peers are important #7630
Open
rustyrussell
wants to merge
15
commits into
ElementsProject:master
Choose a base branch
from
rustyrussell:guilt/ccan-io-scale
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Handle connection better: make connectd know which peers are important #7630
rustyrussell
wants to merge
15
commits into
ElementsProject:master
from
rustyrussell:guilt/ccan-io-scale
+905
−728
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
rustyrussell
added
the
Highlight - Stability and Security
Refinement of basics, prevention and cures
label
Aug 31, 2024
rustyrussell
force-pushed
the
guilt/ccan-io-scale
branch
from
August 31, 2024 07:24
f3ff208
to
153bbe1
Compare
rustyrussell
changed the title
Handle connection better: make connectd know what are important
Handle connection better: make connectd know which peers are important
Sep 1, 2024
rustyrussell
force-pushed
the
guilt/ccan-io-scale
branch
4 times, most recently
from
November 21, 2024 03:51
28997b7
to
0e5f665
Compare
rustyrussell
force-pushed
the
guilt/ccan-io-scale
branch
2 times, most recently
from
November 22, 2024 23:51
aa59a03
to
030b46c
Compare
It's now trivial for us to do this ourselves, since we have gossmap. Signed-off-by: Rusty Russell <[email protected]>
Let lightningd feed us hints to try first, but we can extract the addresses from node_announcement messages ourselves. (Lightningd used to ask gossipd on our behalf: this is far simpler!) One side effect of this is that we don't hand back address hints given to us by lightningd: it would use these again for reconnecting. This is breaks test_sendpay_grouping, so we disable it temporarily. Signed-off-by: Rusty Russell <[email protected]>
In fact, only 951 of 17419 (5%) of node announcements are missing an address (and gossipd doesn't know if we can connect to Tor addresses anyway) so just check it *has* a node_announcement. Signed-off-by: Rusty Russell <[email protected]>
Once connectd is controlling reconnections, it'll need these. Signed-off-by: Rusty Russell <[email protected]>
…not just state. We're going to use this to ask if there are any channels which make it important to reconnect to the peer. Signed-off-by: Rusty Russell <[email protected]>
If we connected out, remember that address. We always remember the last address, but that may be an incoming address. This is explicitly the last outgoing address which worked. Signed-off-by: Rusty Russell <[email protected]>
Signed-off-by: Rusty Russell <[email protected]>
This is more useful than the last address, which may be it connecting to us. And use it when we restore it. Signed-off-by: Rusty Russell <[email protected]>
rustyrussell
force-pushed
the
guilt/ccan-io-scale
branch
from
November 24, 2024 01:59
f989f46
to
567b982
Compare
Rather than have lightningd call us repeatedly to try to connect, have it tell us what peers are transient and aren't, and connectd will automatically try to maintain that connection. There's a new "downgrade_peer" message to tell it a peer is now transient: to make it non-transient we simply tell connectd to connect as a non-transient. The first time, I missed that dual_open_control does its own state transitions :( Signed-off-by: Rusty Russell <[email protected]> Changelog-Changed: `connectd` now handles maintaining/reconnecting to important peers, and we remember the last successful address we connected to.
Signed-off-by: Rusty Russell <[email protected]>
The important flag replaces it, and now we can be more intelligent about eviction in overload. Signed-off-by: Rusty Russell <[email protected]>
We wait until a connection fails, or a subd is connected to the peer, before letting another one through. This should prevent us from overwhelming lightningd on large nodes, but unlike the previous back-off, it's based on how fast lightningd is, not an arbitrary time. We also let one through each second, in case we're connecting to many, but not doing anything but gossip (e.g. 100 explicit connect commands). Signed-off-by: Rusty Russell <[email protected]> Changelog-Changed: Reconnecting to peers at startup should be significantly faster (dependent on machine speed).
Signed-off-by: Rusty Russell <[email protected]>
The seeker can send a full gossip query, which means the ping doesn't happen (it needs 14-45 seconds of quiet!). We disable the gossip_queries feature, so it doesn't ask. ``` def test_ping_timeout(node_factory): # Disconnects after this, but doesn't know it. l1_disconnects = ['xWIRE_PING'] l1, l2 = node_factory.get_nodes(2, opts=[{'dev-no-reconnect': None, 'disconnect': l1_disconnects}, {'dev-no-ping-timer': None}]) l1.rpc.connect(l2.info['id'], 'localhost', l2.port) # This can take 10 seconds (dev-fast-gossip means timer fires every 5 seconds) l1.daemon.wait_for_log('seeker: startup peer finished', timeout=15) # Ping timers runs at 15-45 seconds, *but* only fires if also 60 seconds # after previous traffic. > l1.daemon.wait_for_log('dev_disconnect: xWIRE_PING', timeout=60 + 45 + 5) tests/test_connection.py:4194: ... > raise TimeoutError('Unable to find "{}" in logs.'.format(exs)) E TimeoutError: Unable to find "[re.compile('dev_disconnect: xWIRE_PING')]" in logs. ``` Signed-off-by: Rusty Russell <[email protected]>
Signed-off-by: Rusty Russell <[email protected]>
rustyrussell
force-pushed
the
guilt/ccan-io-scale
branch
from
November 25, 2024 01:08
567b982
to
f0ae5a3
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This series changes connectd to maintain reconnections, rather than the rather convoluted logic we had before. lightningd simply tells it which peers are "important", based on which peers have channels. We also are smarter about saving the last known address, if we have a successfully established an outgoing connection previously.
This lets connectd be smarter in overload, too, when it needs to select a peer to evict. And it's a chance to update our "startup" logic which tried to space out connections: we keep 10 in flight now, and wait until lightingd has forked off a subdaemon before adding more (though we always let through one a second, for corner cases).
These changes should help large nodes.