-
Notifications
You must be signed in to change notification settings - Fork 695
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Release RowExclusiveLock on pg_dist_transaction as soon as remote xacts are recovered #7863
Conversation
|
Given the presence of upgrade script on versions >= 11.3, this needs to be backported at least to versions 11.3, 12.0 and 12.1. |
Running in my local environment with PG 15.7
With PG15.6 and PG15.7 the same test (
Similar results when running |
Just fixed one of those tests in the latest commit. |
It's quite strange that we have this diff: (Note that we should consider / compare with upgrade_pg_dist_cleanup_after_0.out when investigating this test failure, not upgrade_pg_dist_cleanup_after.out one) --- a/src/test/regress/expected/upgrade_pg_dist_cleanup_after_0.out
+++ b/src/test/regress/expected/upgrade_pg_dist_cleanup_after_0.out
@@ -20,11 +20,9 @@ SELECT COUNT(*) FROM pg_dist_placement WHERE shardid IN (SELECT shardid FROM pg_
(1 row)
SELECT * FROM pg_dist_cleanup;
- record_id | operation_id | object_type | object_name | node_group_id | policy_type
+ record_id | operation_id | object_type | object_name | node_group_id | policy_type
---------------------------------------------------------------------
- 1 | 0 | 1 | table_with_orphaned_shards_980001 | 1 | 0
-(1 row)
+(0 rows)
CALL citus_cleanup_orphaned_resources();
-NOTICE: cleaned up 1 orphaned resources
DROP TABLE table_with_orphaned_shards;
(regress) 22.04: $ We should already be registering the orphaned shards into pg_dist_cleanup in citus--11.1-1--11.2-1.sql: -- drop orphaned shards after inserting records for them into pg_dist_cleanup
INSERT INTO pg_dist_cleanup
SELECT nextval('pg_dist_cleanup_recordid_seq'), 0, 1, shard_name(sh.logicalrelid, sh.shardid) AS object_name, plc.groupid AS node_group_id, 0
FROM pg_dist_placement plc
JOIN pg_dist_shard sh ON sh.shardid = plc.shardid
WHERE plc.shardstate = 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
taking my approval back until we figure out what to do with isolation test failures
…ts are recovered As of this commit, after recovering the remote transactions, now we release the lock on pg_dist_transaction while closing it to avoid deadlocks that might occur because of trying to acquire a lock on pg_dist_authinfo while holding a lock on pg_dist_transaction. Such a scenario can only cause a deadlock if another transaction is trying to acquire a strong lock on pg_dist_transaction while holding a lock on pg_dist_authinfo. As of today, we (implicitly) acquire a strong lock on pg_dist_transaction only when upgrading Citus to 11.3-1 and this happens when creating a REPLICA IDENTITY on pg_dist_transaction. And regardless of the code-path we are in, it should be okay to release the lock there because all we do after that point is to abort the prepared transactions that are not part of an in-progress distributed transaction and releasing the lock before doing so should be just fine. This also changes the blocking behavior between citus_create_restore_point and the transaction recovery code-path in the sense that now citus_create_restore_point doesn't until transaction recovery completes aborting the prepared transactions that are not part of an in-progress distributed transaction. However, this should be fine because even before this was possible, e.g., if transaction recovery fails to open a remote connection to a node.
1d9d384
to
cf49a07
Compare
.. as documented in actions/upload-artifact#480.
cf49a07
to
0d4c676
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## release-13.0 #7863 +/- ##
================================================
+ Coverage 89.48% 89.49% +0.01%
================================================
Files 276 276
Lines 60063 60067 +4
Branches 7524 7524
================================================
+ Hits 53747 53758 +11
+ Misses 4166 4160 -6
+ Partials 2150 2149 -1 |
* transactions that are not part of an in-progress distributed transaction and | ||
* releasing the lock before doing so should be just fine. | ||
*/ | ||
table_close(pgDistTransaction, RowExclusiveLock); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Allowing DDL commands to go through unencumbered before this transaction commits seems a bit risky / undefined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the heads-up @marcoslot! We'll certainly re-think this before moving forward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Allowing DDL commands to go through unencumbered before this transaction commits seems a bit risky / undefined.
@marcoslot, do you mean there is a risk of partially committing propagated DDL commands with this change? We would appreciate it tremendously if you elaborate on the risk and provide the scenario of concern for our testing. Thanks a lot.
DESCRIPTION: Fixes a bug that might cause a deadlock when upgrading Citus.
As of this PR, after recovering the remote transactions, now we release the lock
on pg_dist_transaction while closing it to avoid deadlocks that might occur because
of trying to acquire a lock on pg_dist_authinfo while holding a lock on
pg_dist_transaction. Such a scenario can only cause a deadlock if another transaction
is trying to acquire a strong lock on pg_dist_transaction while holding a lock on
pg_dist_authinfo. As of today, we (implicitly) acquire a strong lock on
pg_dist_transaction only when upgrading Citus to 11.3-1 and this happens when creating
a REPLICA IDENTITY on pg_dist_transaction.
And regardless of the code-path we are in, it should be okay to release the lock there
because all we do after that point is to abort the prepared transactions that are not
part of an in-progress distributed transaction and releasing the lock before doing so
should be just fine.
This also changes the blocking behavior between citus_create_restore_point and the
transaction recovery code-path in the sense that now citus_create_restore_point doesn't
until transaction recovery completes aborting the prepared transactions that are not
part of an in-progress distributed transaction. However, this should be fine because
even before this was possible, e.g., if transaction recovery fails to open a remote
connection to a node.
Also fix a flaky citus upgrade test.
Finally, upgrade download-artifacts action to 4.1.8 and upload-artifacts action to 4.6.0
since v3.x.y is no longer supported for both actions. And while doing that, had to do
some changes to distinguish the artifacts we produce in CI jobs as documented in
actions/upload-artifact#480.
After merging this PR;