-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't block on channel arbitator startup #9324
base: master
Are you sure you want to change the base?
Don't block on channel arbitator startup #9324
Conversation
Important Review skippedAuto reviews are limited to specific labels. 🏷️ Labels to auto review (1)
Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
ba3a862
to
2f28012
Compare
When starting the channel arbitrators we make sure they are started concurrently.
The channel arbitrator startups have now a configurable timeout but this config value is hidden.
2f28012
to
c1dc62b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for looking into this. I think we can simplify the code somewhat and also reduce the number of goroutines a bit, see inline comment.
@@ -50,6 +50,11 @@ | |||
|
|||
* Make sure the RPC clients used to access the chain backend are [properly | |||
shutdown](https://github.com/lightningnetwork/lnd/pull/9261). | |||
|
|||
* [Start channel arbitrators concurrently]( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should go into the v0.18.4
release notes.
@@ -770,20 +778,69 @@ func (c *ChainArbitrator) Start() error { | |||
|
|||
// Launch all the goroutines for each arbitrator so they can carry out | |||
// their duties. | |||
// Set a timeout for the group so we don't wait indefinitely if one | |||
// startup function blocks. | |||
ctx, cancel := context.WithTimeout( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure the cancel and error group here gives us much.
This would of course be different if we passed an actual context into arbitrator.Start()
that would then also cancel the blocking behavior. But that's currently not the case and probably a much bigger diff.
So then the error group also doesn't help us, because canceling the context won't stop new goroutines to be spawned.
So here's the approach I came up with (potentially not perfect yet either, mostly to show you what I mean):
// Launch all the goroutines for each arbitrator so they can carry out
// their duties.
// Set a timeout for the group so we don't wait indefinitely if one
// startup function blocks.
c.Lock()
numArbitrators := len(c.activeChannels)
startupTimeout := time.After(c.cfg.StartupTimeout)
startErrors := make(chan error, numArbitrators)
for _, arbitrator := range c.activeChannels {
startState, ok := startStates[arbitrator.cfg.ChanPoint]
if !ok {
c.Unlock()
stopAndLog()
return fmt.Errorf("arbitrator: %v has no start state",
arbitrator.cfg.ChanPoint)
}
// Create a non-blocking goroutine for the actual Start call.
go func(a *ChannelArbitrator) {
startErrors <- a.Start(startState)
}(arbitrator)
}
// Wait for the group to finish and if an error occurred we trigger a
// graceful shutdown of LND.
go func() {
defer c.Unlock()
var numStarted int
for {
select {
case startErr := <-startErrors:
if startErr != nil {
stopAndLog()
log.Criticalf("ChainArbitrator failed "+
"to start all channel "+
"arbitrators: %v", err)
return
}
numStarted++
// We've successfully started all arbitrators.
if numStarted == numArbitrators {
return
}
case <-startupTimeout:
stopAndLog()
log.Criticalf("timeout waiting for channel " +
"arbitrator start")
return
case <-c.quit:
err = ErrChainArbExiting
}
}
}()
Main changes:
- Instead of spawning a goroutine within the error group goroutine, we just spawn one per arbitrator and collect into a single channel.
- We don't use a cancel context because it can't be passed into
arbitrator.Start()
yet anyway. So a simple timeout is enough. - Because we now operate on the arbitrators outside of the scope of
Start()
(within goroutines), we probably also need to block changes to those arbitrators until we're done. That's why I added thec.Lock()
and corresponding unlocks. But not sure if those could lead to new deadlocks, so we need to look at those very carefully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments here, I don't think the fix should happen in lnd
. From the logs, what happened was the chan newInputs
was blocked since we sent two inputs during the startup, and the first input was blocking on aux.ExtraBudgetForInputs
,
2024-11-29 10:15:07.185 [INF] SWPR: Sweep request received: out_point=e4a01568a0cd879df94cd6c9ae8143e1304bf1731ddee7003c6f12df4f7baf13:0, witness_type=TaprootAnchorSweepSpend, relative_time_lock=0, absolute_time_lock=0, amount=0.00000330 BTC, parent=(<nil>),
...
2024-11-29 10:15:07.192 [INF] CNCT: ChannelArbitrator(ce3ab677a419e40558d095faf7165ba143467c1dc5691632ad511224473eabc9:0): offering anchor from local commitment 7a4c66554f168a903202ed29e2a13e1061b8c0f484a4be7a2ef8d69ceac382c5:1 to sweeper with deadline=None, budget=0.00000330 BTC
2024-11-29 10:15:07.192 [INF] SWPR: Sweep request received: out_point=7a4c66554f168a903202ed29e2a13e1061b8c0f484a4be7a2ef8d69ceac382c5:1, witness_type=TaprootAnchorSweepSpend, relative_time_lock=0, absolute_time_lock=0, amount=0.00000330 BTC, parent=(fee=0.00001366 BTC, weight=958 wu), params=(startingFeeRate={false 0}, immediate=false, exclusive_group=948947804208955392, budget=0.00000330 BTC, deadline=none)
There are multiple ways we can fix this,
- make
newInputs
a buffered chan, which will defeatblockbeat
as we no longer guarantee the orderchannel arbitrator offers -> sweeper sweeps
. - we could instead remove the
advanceState
in theChannelArbitrator.Start
since the channels will be handled in the main goroutinechannelAttendant
anyway. - fix it in the
tapd
side, which feels more appropriate as if the calls to methods of theAuxSweeper
are blocking, we may have had some incorrect assumptions when using them in thesweeper
.
contractcourt/chain_arbitrator.go
Outdated
const ( | ||
// chainArbTimeout is the timeout for the chain arbitrator to start | ||
// the channel arbitrators for each channel. | ||
chainArbTimeout = 5 * time.Minute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should identify the exact line that is blocking and fix it properly, like buffering or using a new goroutine. Throwing a random timeout feels hacky.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we can fix this in tapd
. If the sweeper requires additional instructions from the AuxSweeper
, but the aux sweeper cannot start because lnd
isn't fully started, then we can't just make this non-blocking. Otherwise we wouldn't be able to return the correct aux leaf information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I think a good solution is probably 2 then, why trigger the advanceState
when we do it later on in the channelAttendant
. I agree that random timeouts might not be a good solution here. Will evaluate how big the change might be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested out the second idea here, turns out it's a small change, and all the itests passed!
Replaces #9262
Fixes: #9323