Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvdb/postgres: remove global application level lock #7992

Closed

Conversation

Roasbeef
Copy link
Member

In this commit, we remove the global application level lock from the postgres backend. This lock prevents multiple write transactions from happening at the same time, and will also block a writer if a read is on going. Since this lock was added, we know always open DB connections with the strongest level of concurrency control available: LevelSerializable. In concert with the new auto retry logic, we ensure that if db transactions conflict (writing the same key/row in this case), then the tx is retried automatically.

Removing this lock should increase perf for the postgres backend, as now concurrent write transactions can proceed, being serialized as needed. Rather then trying to handle concurrency at the application level, we'll set postgres do its job, with the application only needing to retry as necessary.

@guggero
Copy link
Collaborator

guggero commented Sep 19, 2023

Looks like transactions are actually conflicting even at startup:

 2023-09-18 20:47:30.985 UTC [57] ERROR:  could not serialize access due to read/write dependencies among transactions
2023-09-18 20:47:30.985 UTC [57] DETAIL:  Reason code: Canceled on identification as a pivot, during write.
2023-09-18 20:47:30.985 UTC [57] HINT:  The transaction might succeed if retried.
2023-09-18 20:47:30.985 UTC [57] STATEMENT:  INSERT INTO walletdb_kv (key, value, parent_id) VALUES($1, $2, $3) ON CONFLICT (key, parent_id) WHERE parent_id IS NOT NULL DO UPDATE SET value=$2 WHERE walletdb_kv.value IS NOT NULL
2023-09-18 20:47:31.003 UTC [57] ERROR:  could not serialize access due to read/write dependencies among transactions
2023-09-18 20:47:31.003 UTC [57] DETAIL:  Reason code: Canceled on identification as a pivot, during write.
2023-09-18 20:47:31.003 UTC [57] HINT:  The transaction might succeed if retried.
2023-09-18 20:47:31.003 UTC [57] STATEMENT:  INSERT INTO walletdb_kv (key, value, parent_id) VALUES($1, $2, $3) ON CONFLICT (key, parent_id) WHERE parent_id IS NOT NULL DO UPDATE SET value=$2 WHERE walletdb_kv.value IS NOT NULL
2023-09-18 20:47:31.061 UTC [57] ERROR:  could not serialize access due to read/write dependencies among transactions
2023-09-18 20:47:31.061 UTC [57] DETAIL:  Reason code: Canceled on identification as a pivot, during conflict out checking.
2023-09-18 20:47:31.061 UTC [57] HINT:  The transaction might succeed if retried.
2023-09-18 20:47:31.061 UTC [57] STATEMENT:  SELECT id FROM walletdb_kv WHERE parent_id=107 AND key=$1 AND value IS NULL

I remember that we had to add an application level lock for the wallet DB even for etcd due to a similar issue:

lnd/lncfg/db.go

Line 314 in 0730337

CloneWithSingleWriter(),

Though I'm not sure why that isn't a problem with SQLite...

Perhaps we do need that single writer lock just for the wallet (because it was even more built with only bbolt in mind)?

In this commit, we remove the global application level lock from the
postgres backend. This lock prevents multiple write transactions from
happening at the same time, and will also block a writer if a read is on
going. Since this lock was added, we know always open DB connections
with the strongest level of concurrency control available:
`LevelSerializable`. In concert with the new auto retry logic, we ensure
that if db transactions conflict (writing the same key/row in this
case), then the tx is retried automatically.

Removing this lock should increase perf for the postgres backend, as now
concurrent write transactions can proceed, being serialized as needed.
Rather then trying to handle concurrency at the application level, we'll
set postgres do its job, with the application only needing to retry as
necessary.
Some sub-systems like btcwallet will return an error from the database,
but they won't properly wrap it. As a result, we were unable to actually
catch the serialization errors in the first place. To work around this,
we'll now attempt to parse the error string directly.
With the new postgres concurrency control, an error may come from a
bucket function that's actually a postgres error. In this case, we need
to return early so we can retry the txn. Otherwise, we'll be working
with an aborted tx, and never actually return the error so we don't auto
retry.
In this commit, we fix a bug that would cause the entire db to shutdown
if hit a panic (since db operations in the main buckets exit with a
panic) while executing a txn call back. This might be a postgres error
we need to check, so we don't want to bail out, and instead want to pass
up the error to the caller so we can retry if needed.
At times we'll get an error from the transaction call back itself, since
we may be using postgres over streaming RPC. In this case, we still need
to roll back then attempt to retry.
In this commit, we fix a bug in the commit retry loop, we'll now make
sure that if we get an error on commit, we'll map it to the SQL error
then attempt to decide if we need to retry or not.
@Roasbeef Roasbeef force-pushed the remove-postgres-master-lock branch from e4e4b86 to c88e270 Compare September 20, 2023 01:22
@Roasbeef
Copy link
Member Author

Pushed up a series of new commits, fixing some issues:

  • btcwallet won't wrap db errors proeprly, so the calls to errors.As failed. We'll now just check for the string directly.
  • With the way the kvdb mapping works, certain calls (based on the interface) can never fail. This includes calls like Sequence, so it ends up panicking instead. When this happens, we now no longer always call Criticalf. Instead, we'll check to see what the error is, so we can pass it to the retry loop.
  • Some DB calls get an error, check it, then continue. We can't do that anymore as that might be a single that the txn is borked, and that we need to rollback+retry. I found one instance of this in the grpah for pruning so far and fixed it.
  • We'll now properly map the error to a SQL error on commit fail.
  • We didn't go to retry when the call back returned an error. For btcwallet and some other cases, we'll get the error on call back rather than on begin or commit. We now have a retry check here.

With this, everything starts up locally for me, and I was able to pass icase=basic_funding_flow. Checkpointing here for now to move on to some other more near term things.

Copy link
Collaborator

@guggero guggero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a first pass, looks pretty good. Though something's still not working correctly with the wallet. I'm not sure if we have to keep some sort of lock just for the wallet still? Because that code was never updated to not rely on the assumption of a global per-database lock.

kvdb/sqlbase/db.go Show resolved Hide resolved
sqldb/sqlerrors.go Show resolved Hide resolved
kvdb/sqlbase/sqlerrors_sqlite.go Show resolved Hide resolved
Copy link
Collaborator

@bhandras bhandras left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Great we get rid of this lock finally. Will also do some stress testing before final approval.

kvdb/sqlbase/db.go Show resolved Hide resolved
kvdb/sqlbase/db.go Show resolved Hide resolved
@lightninglabs-deploy
Copy link

@Roasbeef, remember to re-request review from reviewers when ready

@guggero
Copy link
Collaborator

guggero commented Mar 7, 2024

Replaced by #8529.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants