-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: avoid unnecessary singleflight requests for role membership cache #135852
sql: avoid unnecessary singleflight requests for role membership cache #135852
Conversation
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
eff7146
to
8225f6d
Compare
3790881
to
8d21326
Compare
Release note: None
This setting was added only as an escape hatch, and we've never needed to recommend using it. Since the single scan implementation is simpler anyway, removing the old code will make other refactors to this code easier. Release note (ops change): Removed the sql.auth.resolve_membership_single_scan.enabled cluster setting. This was added out of precaution in case it was necessary to revert back to the old behavior for looking up role memberships, but this escape hatch has never been needed in practice since this was added in v23.1.
8d21326
to
26daef2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing work @rafiss . I only had one minor comment
Do we want to eventually consider back porting this or is the risk too high?
Reviewed 2 of 2 files at r2, 12 of 12 files at r9, 11 of 11 files at r10, 7 of 7 files at r11, 7 of 7 files at r12, 7 of 7 files at r13, 7 of 7 files at r14, 8 of 8 files at r15, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @michae2 and @rafiss)
pkg/sql/authorization.go
line 873 at r11 (raw file):
var ok bool for ok, err = it.Next(ctx); ok; ok, err = it.Next(ctx) {
Are we missing error checking after the loop?
pkg/sql/authorization.go
line 622 at r15 (raw file):
return userMapping, nil } else { return nil, sqlerrors.NewUndefinedUserError(member)
I think this fine, though its a behaviour change. I think previously we would do avoid the cache to confirm a user didn't exist. In case there was any funny case where the lease manager didn't know of the new version because of range feed lag. However, the old behaviour seems less correct since the user commit may not have completed (waiting for one version).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @fqazi and @michae2)
pkg/sql/authorization.go
line 873 at r11 (raw file):
Previously, fqazi (Faizan Qazi) wrote…
Are we missing error checking after the loop?
i don't believe so; there's a return err
right after the loop. i can make it an explicit if err != nil
check though.
pkg/sql/authorization.go
line 622 at r15 (raw file):
I think this fine, though its a behaviour change. I think previously we would do avoid the cache to confirm a user didn't exist.
i don't think it's a behavior change. previously, we would never cache non-existence of a role. so if you tried to look up a role that didn't exist, it would be a cache miss that caused resolveMemberOfWithAdminOption
to be called, and that would do the RoleExists
check.
now, we don't need to do that, since we know that if the cache has any entries, it has entries for all existing users. this is because ac96718 changed the query that populates the cache.
so it's an optimization, but not a functional behavior change. perhaps that's already what you meant with your comment?
This new query ensures that users who have no parent roles appear in the query results. This will support a refactor that allows us to populate the entire cache all at once, rather than doing extra work for each user. Release note: None
This avoids the behavior of each goroutine doing the work of populating the cache, which was entirely wasteful. It also supports a future refactor to populate the whole cache all at once. Release note: None
Now that we return a row for every user that exists, regardless of whether the user has any parent roles, we don't need to make a separate query for role existence. Release note: None
The new function returns a flattened map of the role hierarchy for each role that exists. This is a refactor that will make populating the membership cache all at once easier. Release note: None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to eventually consider back porting this or is the risk too high?
I do want to try a backport for this if it doesn't have many conflicts, since I think we should try to help with the linked escalation
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @fqazi and @michae2)
pkg/sql/authorization.go
line 622 at r15 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
I think this fine, though its a behaviour change. I think previously we would do avoid the cache to confirm a user didn't exist.
i don't think it's a behavior change. previously, we would never cache non-existence of a role. so if you tried to look up a role that didn't exist, it would be a cache miss that caused
resolveMemberOfWithAdminOption
to be called, and that would do theRoleExists
check.now, we don't need to do that, since we know that if the cache has any entries, it has entries for all existing users. this is because ac96718 changed the query that populates the cache.
so it's an optimization, but not a functional behavior change. perhaps that's already what you meant with your comment?
Hm, maybe you are right though. this test (link) failed with a user not found error (for a user that should have existed). so you are saying that the cache invalidation may not happen in time for this to work correctly?
/var/lib/engflow/worker/work/0/exec/bazel-out/k8-fastbuild/bin/pkg/ccl/backupccl/backupccl_test_/backupccl_test.runfiles/com_github_cockroachdb_cockroach/pkg/ccl/backupccl/testdata/backup-restore/descriptor-conflicts:67:
SELECT count(1) FROM [SHOW TABLES FROM system] WHERE table_name LIKE 'crdb_internal_copy_%';
expected:
4
found:
pq: role/user "hamburger" does not exist
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @michae2 and @rafiss)
pkg/sql/authorization.go
line 873 at r11 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
i don't believe so; there's a
return err
right after the loop. i can make it an explicitif err != nil
check though.
oh I was wondering about the case where it.Next returns an error.
pkg/sql/authorization.go
line 622 at r15 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
Hm, maybe you are right though. this test failed with a user not found error (for a user that should have existed). so you are saying that the cache invalidation may not happen in time for this to work correctly?
/var/lib/engflow/worker/work/0/exec/bazel-out/k8-fastbuild/bin/pkg/ccl/backupccl/backupccl_test_/backupccl_test.runfiles/com_github_cockroachdb_cockroach/pkg/ccl/backupccl/testdata/backup-restore/descriptor-conflicts:67: SELECT count(1) FROM [SHOW TABLES FROM system] WHERE table_name LIKE 'crdb_internal_copy_%'; expected: 4 found: pq: role/user "hamburger" does not exist
I wonder if we just need a WaitForOneVersion in there 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @fqazi and @michae2)
pkg/sql/authorization.go
line 873 at r11 (raw file):
Previously, fqazi (Faizan Qazi) wrote…
oh I was wondering about the case where it.Next returns an error.
yeah, if it.Next
returns an error, the loop will exit, and the line right after the loop ends says return err
.
pkg/sql/authorization.go
line 622 at r15 (raw file):
Previously, fqazi (Faizan Qazi) wrote…
I wonder if we just need a WaitForOneVersion in there 🤔
where are you referring to? we already have a WaitForOneVersion
which is called after running create/drop/grant/revoke statements, here:
cockroach/pkg/sql/conn_executor.go
Line 4037 in 2b30f8e
if err := ex.waitOneVersionForNewVersionDescriptorsWithoutJobs(descIDsInJobs); err != nil { |
i think the issue in this case might just be that the restore code inserts directly into system.users, which will skip the table version bump logic. let me verify this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @michae2 and @rafiss)
pkg/sql/authorization.go
line 622 at r15 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
where are you referring to? we already have a
WaitForOneVersion
which is called after running create/drop/grant/revoke statements, here:cockroach/pkg/sql/conn_executor.go
Line 4037 in 2b30f8e
if err := ex.waitOneVersionForNewVersionDescriptorsWithoutJobs(descIDsInJobs); err != nil { i think the issue in this case might just be that the restore code inserts directly into system.users, which will skip the table version bump logic. let me verify this.
Yeah, I was wondering about the restore code, we might just need to add the wait for one version in it.
26daef2
to
e39e87b
Compare
tftr! bors r+ |
135034: sql: disable gossip-based physical planning by default r=yuzefovich a=yuzefovich Now that we improved handling of draining procedure in the virtual clusters in 24.3 timeframe, I believe it's time to begin fully deprecating the gossip-based physical planning, and as the first step this commit introduces a cluster setting to control which method is used in single-tenant deployments (instance-based planning being the default). I plan to have the setting as an escape hatch in case we find problems when rolling out this change and to remove the setting altogether after 25.2. One of the differences between two planning methods is slightly adjusted in this commit. In particular, in the instance-based planning we stitch together consecutive spans whenever `EnsureSafeSplitKey` for the keys in the spans returns the same prefix. The idea is that spans corresponding to different column families of the same SQL row must be placed on the same instance. However, the condition on the length of the keys returned by `EnsureSafeSplitKey` is too broad and captures more cases than needed (i.e. two keys that cannot be part of the same SQL row end up in the same `SpanPartition`). To partially alleviate this difference, this commit introduces "boundary granularity" knob which indicates whether consecutive spans _might_ be part of the same SQL row, and we now use the stitching logic only if so. All callers of `PartitionSpans` have been audited, and we need the stitching logic only in two call sites that correspond to planning of table readers. Longer term, if we change the encoding so that column family IDs are encoded in a special way, we'll be able to tell definitively whether stitching is needed simply based on the key itself, and this additional logic could be removed. Epic: None Release note: None 135852: sql: avoid unnecessary singleflight requests for role membership cache r=rafiss a=rafiss See individual commits for refactors that make the enhancement easier. --- Since we use a query that scans the whole role_members table to populate the membership cache, it is wasteful to launch multiple singleflight requests for different users. This patch changes the cache refresh logic of the singleflight so that it always populates the entire cache, which means there will only ever be one in-flight query per each node. ``` $ benchdiff --old 98b0521 --new 26daef2 . -r BenchmarkUseManyRoles pkg=1/1 iter=10/10 cockroachdb/cockroach/pkg/bench/rttanalysis \ name old time/op new time/op delta UseManyRoles/use_50_roles-10 225ms ± 6% 114ms ± 2% -49.43% (p=0.000 n=10+10) UseManyRoles/use_2_roles-10 10.7ms ± 4% 7.4ms ± 3% -30.74% (p=0.000 n=10+8) name old roundtrips new roundtrips delta UseManyRoles/use_2_roles-10 9.00 ± 0% 5.00 ± 0% -44.44% (p=0.000 n=10+10) UseManyRoles/use_50_roles-10 3.00 ± 0% 3.00 ± 0% ~ (all equal) name old alloc/op new alloc/op delta UseManyRoles/use_50_roles-10 438MB ± 0% 377MB ± 1% -13.88% (p=0.000 n=9+10) UseManyRoles/use_2_roles-10 20.7MB ± 2% 19.1MB ± 1% -7.46% (p=0.000 n=10+10) name old allocs/op new allocs/op delta UseManyRoles/use_50_roles-10 881k ± 0% 398k ± 4% -54.81% (p=0.000 n=8+10) UseManyRoles/use_2_roles-10 54.4k ± 8% 36.0k ± 5% -33.85% (p=0.000 n=10+10) ``` fixes #135931 Release note (performance improvement): Improved the internal caching logic for role membership information. This reduces the latency impact of commands such as `DROP ROLE`, `CREATE ROLE`, and `GRANT role TO user`, which cause the role membership cache to be invalidated. 136115: roachtest: rm vmod logging in rebalance/by-load/* tests r=arulajmani a=kvoli Historically, the `rebalance/by-load/*` roachtests were difficult to track failure down for. To aid investigation, we bumped the vmodule verbosity to the maximum level in relevant test files. The default logging and metrics have since improved. We have also noticed that while vmodule is enabled, there can be skewed resource usage as a result which is not observed by the replica. Disable the verbose logging, we will rely on the defaults for future failures. Note, not all failures linked below were due to vmodule logging. However, this change will remove some degree of noise. Part of: #133054 Part of: #132633 Part of: #135055 Part of: #135869 Part of: #135811 Part of: #133004 Part of: #135791 Part of: #132019 Release note: None 136116: roachtest: rm redundant cpu profiling in rebalance/by-load tests r=arulajmani a=kvoli After #97699 was resolved, the cpu profiling parameter declaration in the `rebalance/by-load/*` roachtests became redundant. Remove the declaration and use the default settings instead. Informs: #133054 Informs: #132633 Informs: #135055 Informs: #135869 Informs: #135811 Informs: #133004 Informs: #135791 Informs: #132019 Release note: None Co-authored-by: Yahor Yuzefovich <[email protected]> Co-authored-by: Rafi Shamim <[email protected]> Co-authored-by: Austen McClernon <[email protected]>
Build failed (retrying...): |
As of cockroachdb#135852, restore increments the generation of the user role table in order to invalidate the user cache. This step isn't needed for table or database level restores. Skipping the cache invalidation is desirable because it prevents an unrelated long running transaction from blocking the completion of the restore. Release Note: none Part of: cockroachdb#138010
As of cockroachdb#135852, restore increments the generation of the user role table in order to invalidate the user cache. This step isn't needed for table or database level restores. Skipping the cache invalidation is desirable because it prevents an unrelated long running transaction from blocking the completion of the restore. Release Note: none Part of: cockroachdb#138010
As of cockroachdb#135852, restore increments the generation of the user role table in order to invalidate the user cache. This step isn't needed for table or database level restores. Skipping the cache invalidation is desirable because it prevents an unrelated long running transaction from blocking the completion of the restore. Release Note: none Part of: cockroachdb#138010
As of cockroachdb#135852, restore increments the generation of the user role table in order to invalidate the user cache. This step isn't needed for table or database level restores. Skipping the cache invalidation is desirable because it prevents an unrelated long running transaction from blocking the completion of the restore. Release Note: none Part of: cockroachdb#138010
As of cockroachdb#135852, restore increments the generation of the user role table in order to invalidate the user cache. This step isn't needed for table or database level restores. Skipping the cache invalidation is desirable because it prevents an unrelated long running transaction from blocking the completion of the restore. Release Note: none Part of: cockroachdb#138010
As of cockroachdb#135852, restore increments the generation of the user role table in order to invalidate the user cache. This step isn't needed for table or database level restores. Skipping the cache invalidation is desirable because it prevents an unrelated long running transaction from blocking the completion of the restore. Release Note: none Part of: cockroachdb#138010
138137: vecindex: combine split and merge fixups r=drewkimball a=andy-kimball Create a single combined fixup type for splits and merges. Determine whether to split or merge based on the size of the partition. Epic: CRDB-42943 Release note: None 138166: backup: only invalidate user cache for system restores r=jeffswenson a=jeffswenson As of #135852, restore increments the generation of the user role table in order to invalidate the user cache. This step isn't needed for table or database level restores. Skipping the cache invalidation is desirable because it prevents an unrelated long running transaction from blocking the completion of the restore. Release Note: none Part of: #138010 138226: kverver: deflake TestRangefeedCheckpointsRecoverFromLeaseExpiration r=miraradeva a=miraradeva When running with leader leases, it's possible that the nudge to check the old lease status takes longer if the old leader still has store liveness support and hasn't stepped down yet (and lost the lease). This commit adds a SucceedsSoon around the expected nudge assertion. Fixes: #137917 Release note: None Co-authored-by: Andrew Kimball <[email protected]> Co-authored-by: Jeff Swenson <[email protected]> Co-authored-by: Mira Radeva <[email protected]>
As of cockroachdb#135852, restore increments the generation of the user role table in order to invalidate the user cache. This step isn't needed for table or database level restores. Skipping the cache invalidation is desirable because it prevents an unrelated long running transaction from blocking the completion of the restore. Release Note: none Part of: cockroachdb#138010
As of cockroachdb#135852, restore increments the generation of the user role table in order to invalidate the user cache. This step isn't needed for table or database level restores. Skipping the cache invalidation is desirable because it prevents an unrelated long running transaction from blocking the completion of the restore. Release Note: none Part of: cockroachdb#138010
As of cockroachdb#135852, restore increments the generation of the user role table in order to invalidate the user cache. This step isn't needed for table or database level restores. Skipping the cache invalidation is desirable because it prevents an unrelated long running transaction from blocking the completion of the restore. Release Note: none Part of: cockroachdb#138010
As of cockroachdb#135852, restore increments the generation of the user role table in order to invalidate the user cache. This step isn't needed for table or database level restores. Skipping the cache invalidation is desirable because it prevents an unrelated long running transaction from blocking the completion of the restore. Release Note: none Part of: cockroachdb#138010
As of cockroachdb#135852, restore increments the generation of the user role table in order to invalidate the user cache. This step isn't needed for table or database level restores. Skipping the cache invalidation is desirable because it prevents an unrelated long running transaction from blocking the completion of the restore. Release Note: none Part of: cockroachdb#138010
As of cockroachdb#135852, restore increments the generation of the user role table in order to invalidate the user cache. This step isn't needed for table or database level restores. Skipping the cache invalidation is desirable because it prevents an unrelated long running transaction from blocking the completion of the restore. Release Note: none Part of: cockroachdb#138010
As of cockroachdb#135852, restore increments the generation of the user role table in order to invalidate the user cache. This step isn't needed for table or database level restores. Skipping the cache invalidation is desirable because it prevents an unrelated long running transaction from blocking the completion of the restore. Release Note: none Part of: cockroachdb#138010
As of cockroachdb#135852, restore increments the generation of the user role table in order to invalidate the user cache. This step isn't needed for table or database level restores. Skipping the cache invalidation is desirable because it prevents an unrelated long running transaction from blocking the completion of the restore. Release Note: none Part of: cockroachdb#138010
As of cockroachdb#135852, restore increments the generation of the user role table in order to invalidate the user cache. This step isn't needed for table or database level restores. Skipping the cache invalidation is desirable because it prevents an unrelated long running transaction from blocking the completion of the restore. Release Note: none Part of: cockroachdb#138010
See individual commits for refactors that make the enhancement easier.
Since we use a query that scans the whole role_members table to
populate the membership cache, it is wasteful to launch multiple
singleflight requests for different users. This patch changes
the cache refresh logic of the singleflight so that it always populates
the entire cache, which means there will only ever be one in-flight
query per each node.
fixes #135931
Release note (performance improvement): Improved the internal caching
logic for role membership information. This reduces the latency impact
of commands such as
DROP ROLE
,CREATE ROLE
, andGRANT role TO user
,which cause the role membership cache to be invalidated.