Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: role membership cache refresh logic can cause a thundering herd of queries to system.role_members #135931

Open
rafiss opened this issue Nov 21, 2024 · 1 comment · May be fixed by #135852
Open
Assignees
Labels
branch-release-24.1 Used to mark GA and release blockers, technical advisories, and bugs for 24.1 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs P-1 Issues/test failures with a fix SLA of 1 month T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions)

Comments

@rafiss
Copy link
Collaborator

rafiss commented Nov 21, 2024

Describe the problem

When the role membership cache is invalidated, each node will need to rebuild the cache. Currently, each user whose membership is being looked up will cause a query on the system.role_members table. This can be seen by looking at the requestKey of the singleflight operation here:

fmt.Sprintf("%s-%d", member.Normalized(), tableVersion),

This is wasteful. Each of those queries is already scanning all the data from system.role_members, so we should change this so that the cache only gets populated once per node. This will change the number of internal queries needed to repopulate the cache to be O(node count) rather than O((node count) * (active user count)).

To Reproduce
The cache is invalidated by CREATE ROLE, DROP ROLE, and GRANT role statements, so this problem is easy to reproduce.

See the escalation https://github.com/cockroachlabs/support/issues/3128 for an example.

Expected behavior
Don't have such a big latency hit.

Additional data / screenshots
See the linked ticket.

Environment:
This was reported on CockroachDB version v24.1.6, but all supported versions are affected.

Jira issue: CRDB-44792

Epic CRDB-43310

@rafiss rafiss added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) branch-release-24.1 Used to mark GA and release blockers, technical advisories, and bugs for 24.1 labels Nov 21, 2024
@exalate-issue-sync exalate-issue-sync bot added the P-1 Issues/test failures with a fix SLA of 1 month label Nov 21, 2024
@rafiss rafiss added the T-server-and-security DB Server & Security label Nov 22, 2024
Copy link

blathers-crl bot commented Nov 22, 2024

This issue has multiple T-eam labels. Please make sure it only has one, or else issue synchronization will not work correctly.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@rafiss rafiss removed the T-server-and-security DB Server & Security label Nov 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-release-24.1 Used to mark GA and release blockers, technical advisories, and bugs for 24.1 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs P-1 Issues/test failures with a fix SLA of 1 month T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant