-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
multi: ensure link is always torn down due to db failures, add exponential back off for sql-kvdb failures #7927
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for picking up the issue! Straightforward fix and just need to fix the imports.
// returned, then this means that daemon is shutting down so we | ||
// should abort the retries. | ||
waitBeforeRetry := func(attemptNumber int) bool { | ||
retryDelay := randRetryDelay( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Non-blocking, but we could use ticker := chain.NewJitterTicker(DefaultInitialRetryDelay, 0.5)
instead.
5a217ec
to
c768770
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM🌊
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
if IsSerializationError(dbErr) { | ||
_ = tx.Rollback() | ||
|
||
if waitBeforeRetry(i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: if we know that there is not going to be a next iteration, we can skip the last wait (which will also be the longest) and return the ErrRetriesExceeded
here
The log message is off by one.
When the revocation of a channel state fails after receiving a new CommitmentSigned msg we have to fail the channel otherwise we continue with an unclean state.
If we couldn't revoke due to a DB error, then we want to also tear down the connection, as we don't want the other party to continue to send updates. That may lead to de-sync'd state an eventual force close. Otherwise, the database might be able to recover come the next reconnection attempt.
In this commit, we modify the default isolation level to be `sql.LevelSerializable. This is the strictness isolation type for postgres. For sqlite, there's only ever a single writer, so this doesn't apply directly.
…ilures In this commit, we add randomized exponential backoff for serialization failures. For postgres, we''ll his this any time a transaction set fails to be linearized. For sqlite, we'll his this if we have many writers trying to grab the write lock at time same time, manifesting as a `SQLITE_BUSY` error code. As is, we'll retry up to 10 times, waiting a minimum of 50 miliseconds between each attempt, up to 5 seconds without any delay at all. For sqlite, this is also bounded by the busy timeout set, which applies on top of this retry logic (block for busy timeout seconds, then apply this back off logic).
c768770
to
9e25f9f
Compare
This PR resolves two outstanding issues:
kvdb
SQL emulation backends, if we got an error (SQLITE_BUSY
, etc), we should just fail right then. We now implement an exponential back off with some jitter to give other writing goroutines time to finish/interleave their writes.For 1, I tacked on an extra commit to just tear down the link all together. This may save us from a scenario where Alice had a DB failure, Bob then tries to send a new state, but Alice has torn down the link, but not the TCP connection. In this case, Bob is now "stuck" as she's sent out a sig, and if Alice never recovers, they may need to go on chain to time out the HTLC.
In the 0.18 cycle, as we move to integrate the pure SQL backends for payments/invoices, we'll want to attempt to unify some of this logic, as we effectively have two sql scaffoldings today: KV-emulation, and proper schema based.
Replaces https://github.com/lightningnetwork/lnd/pull/7876/files
Fixes #7869