-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvdb/postgres: remove global application level lock #8529
Conversation
In this commit, we remove the global application level lock from the postgres backend. This lock prevents multiple write transactions from happening at the same time, and will also block a writer if a read is on going. Since this lock was added, we know always open DB connections with the strongest level of concurrency control available: `LevelSerializable`. In concert with the new auto retry logic, we ensure that if db transactions conflict (writing the same key/row in this case), then the tx is retried automatically. Removing this lock should increase perf for the postgres backend, as now concurrent write transactions can proceed, being serialized as needed. Rather then trying to handle concurrency at the application level, we'll set postgres do its job, with the application only needing to retry as necessary.
Some sub-systems like btcwallet will return an error from the database, but they won't properly wrap it. As a result, we were unable to actually catch the serialization errors in the first place. To work around this, we'll now attempt to parse the error string directly.
With the new postgres concurrency control, an error may come from a bucket function that's actually a postgres error. In this case, we need to return early so we can retry the txn. Otherwise, we'll be working with an aborted tx, and never actually return the error so we don't auto retry.
In this commit, we fix a bug that would cause the entire db to shutdown if hit a panic (since db operations in the main buckets exit with a panic) while executing a txn call back. This might be a postgres error we need to check, so we don't want to bail out, and instead want to pass up the error to the caller so we can retry if needed.
At times we'll get an error from the transaction call back itself, since we may be using postgres over streaming RPC. In this case, we still need to roll back then attempt to retry.
In this commit, we fix a bug in the commit retry loop, we'll now make sure that if we get an error on commit, we'll map it to the SQL error then attempt to decide if we need to retry or not.
Because rolling back any changes that haven't been committed yet is essential when re-trying, if rolling back we should fail hard.
Important Auto Review SkippedAuto reviews are limited to the following labels: llm-review. Please add one of these labels to enable auto reviews. Please check the settings in the CodeRabbit UI or the To trigger a single review, invoke the Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (invoked as PR comments)
Additionally, you can add CodeRabbit Configration File (
|
Pull reviewers statsStats of the last 30 days for lnd:
|
Turned it into a draft, since it doesn't actually pass the integration tests yet. Still debugging the cause(s). |
Includes a fix for a wrap, and module replace.
@@ -345,7 +345,7 @@ func (db *db) executeTransaction(f func(tx walletdb.ReadWriteTx) error, | |||
} | |||
} | |||
|
|||
return commitErr | |||
return nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should still return the commitErr
as otherwise, if it can't be mapped to a SQLError
, then we always return nil
, when the actual error should be thread through.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also revisiting this section, it should be trying to map the commitErr
here, now fnErrr
(shadowed from above).
channeldb/payment_control.go
Outdated
@@ -171,14 +171,15 @@ func (p *PaymentControl) InitPayment(paymentHash lntypes.Hash, | |||
// retrying the payment or return a specific error. | |||
case err == nil: | |||
if err := paymentStatus.initializable(); err != nil { | |||
updateErr = err | |||
updateErr = fmt.Errorf("initialize: %w", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So If I revert this back to just:
updateErr = err
Then strangely, the sendtoroute_multi_path_payment
starts to pass again 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, I also don't wrap the error down below, pushing this up as a commit to see how CI fares now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why, but when we wrap the error here, this error check fails, causing the whole itest to fail:
Line 2554 in 48a3d56
case mpp != nil && errors.Is(err, channeldb.ErrPaymentExists): |
This causes itests to fail, still need to investigate why the other errors.Is check isn't catching this.
This fixes a bug as we were mapping the `fnErr`, which was already nil if we got to this point. We'll also return the `commitErr` as is, if we were unable to match it. This ensures we don't mask an error.
Pushed a few commits. Unit tests should pass again now, allowing us to narrow down if the postgres itests are still failing. Left some comments above, and put some context in the temp commits pushed up. |
We'll now actually catch the error for a failed rollback, so the error we return in the case of a severed connection differs from before, requiring an update to the tests.
Pushed up another unit test fix. The error on panic due to a closed connection differs now as we catch the rollback error. The latest itest failure is now:
This is from the async payment benchmark test ( |
This is an attempt to get the async payment test passing. This test creates 488 concurrent payments at once.
Also I realize we need to port/merge these serialization retry changes to |
@guggero, remember to re-request review from reviewers when ready |
Will be replaced by #8611. |
Replacing #7992, pushing to upstream to enable collaboration.
Depends on btcsuite/btcwallet#913.
Fixes #8049.
In this PR, we remove the global application level lock from the postgres backend. This lock prevents multiple write transactions from happening at the same time, and will also block a writer if a read is on going. Since this lock was added, we know always open DB connections with the strongest level of concurrency control available:
LevelSerializable
. In concert with the new auto retry logic, we ensure that if db transactions conflict (writing the same key/row in this case), then the tx is retried automatically.Removing this lock should increase perf for the postgres backend, as now concurrent write transactions can proceed, being serialized as needed. Rather then trying to handle concurrency at the application level, we'll set postgres do its job, with the application only needing to retry as necessary.