Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: avoid unnecessary singleflight requests for role membership cache #135852

Merged
merged 7 commits into from
Nov 25, 2024

Conversation

rafiss
Copy link
Collaborator

@rafiss rafiss commented Nov 20, 2024

See individual commits for refactors that make the enhancement easier.


Since we use a query that scans the whole role_members table to
populate the membership cache, it is wasteful to launch multiple
singleflight requests for different users. This patch changes
the cache refresh logic of the singleflight so that it always populates
the entire cache, which means there will only ever be one in-flight
query per each node.

$ benchdiff --old 98b0521cb18 --new 26daef27780 . -r BenchmarkUseManyRoles

pkg=1/1 iter=10/10 cockroachdb/cockroach/pkg/bench/rttanalysis \

name                          old time/op     new time/op     delta
UseManyRoles/use_50_roles-10      225ms ± 6%      114ms ± 2%  -49.43%  (p=0.000 n=10+10)
UseManyRoles/use_2_roles-10      10.7ms ± 4%      7.4ms ± 3%  -30.74%  (p=0.000 n=10+8)

name                          old roundtrips  new roundtrips  delta
UseManyRoles/use_2_roles-10        9.00 ± 0%       5.00 ± 0%  -44.44%  (p=0.000 n=10+10)
UseManyRoles/use_50_roles-10       3.00 ± 0%       3.00 ± 0%     ~     (all equal)

name                          old alloc/op    new alloc/op    delta
UseManyRoles/use_50_roles-10      438MB ± 0%      377MB ± 1%  -13.88%  (p=0.000 n=9+10)
UseManyRoles/use_2_roles-10      20.7MB ± 2%     19.1MB ± 1%   -7.46%  (p=0.000 n=10+10)

name                          old allocs/op   new allocs/op   delta
UseManyRoles/use_50_roles-10       881k ± 0%       398k ± 4%  -54.81%  (p=0.000 n=8+10)
UseManyRoles/use_2_roles-10       54.4k ± 8%      36.0k ± 5%  -33.85%  (p=0.000 n=10+10)

fixes #135931
Release note (performance improvement): Improved the internal caching
logic for role membership information. This reduces the latency impact
of commands such as DROP ROLE, CREATE ROLE, and GRANT role TO user,
which cause the role membership cache to be invalidated.

Copy link

blathers-crl bot commented Nov 20, 2024

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@rafiss rafiss force-pushed the improve-singleflight-role-cache branch 3 times, most recently from eff7146 to 8225f6d Compare November 22, 2024 21:11
@rafiss
Copy link
Collaborator Author

rafiss commented Nov 23, 2024

I created a simple read-only workload that runs concurrent queries with a thousand different users, and ran it on a 3 node cluster.

Click to see the workload script
package main

import (
	"context"
	"fmt"
	"github.com/jackc/pgx/v5"
	"github.com/jackc/pgx/v5/pgxpool"
	"golang.org/x/sync/errgroup"
	"math/rand"
	"strconv"
	"strings"
)

func main() {
	u := "<node 1 connection string redacted>"
	ctx := context.Background()
	conn, err := pgx.Connect(ctx, u)
	if err != nil {
		panic(err)
	}

	fmt.Println("Creating table ...")

	_, err = conn.Exec(ctx, "DROP TABLE IF EXISTS tab")
	if err != nil {
		panic(err)
	}
	_, err = conn.Exec(ctx, "CREATE TABLE IF NOT EXISTS tab (a INT PRIMARY KEY)")
	if err != nil {
		panic(err)
	}
	_, err = conn.Exec(ctx, "INSERT INTO tab VALUES (1)")
	if err != nil {
		panic(err)
	}

	fmt.Println("Creating roles ...")

	_, err = conn.Exec(ctx, "CREATE ROLE IF NOT EXISTS parent_role")
	if err != nil {
		panic(err)
	}

	_, err = conn.Exec(ctx, "GRANT ALL ON TABLE tab TO parent_role")
	if err != nil {
		panic(err)
	}

	roles := make([]string, 0, 1000)
	for i := 0; i < 1000; i++ {
		roleName := "role" + strconv.Itoa(i)
		_, err = conn.Exec(ctx, "CREATE ROLE IF NOT EXISTS "+roleName)
		if err != nil {
			panic(err)
		}
		roles = append(roles, roleName)

		if i%50 == 0 {
			fmt.Println("Created role", i)
		}
	}

	_, err = conn.Exec(ctx, "GRANT parent_role TO "+strings.Join(roles, ", "))
	if err != nil {
		panic(err)
	}

	urls := []string{
		"<node 1 connection string redacted>",
		"<node 2 connection string redacted>",
		"<node 3 connection string redacted>",
	}
	pools := make([]*pgxpool.Pool, len(urls))

	fmt.Println("Creating pools ...")
	for i, u := range urls {
		cfg, err := pgxpool.ParseConfig(u)
		if err != nil {
			panic(err)
		}
		cfg.MaxConns = 500
		pool, err := pgxpool.NewWithConfig(ctx, cfg)
		if err != nil {
			panic(err)
		}
		pools[i] = pool
	}

	g, _ := errgroup.WithContext(ctx)
	g.SetLimit(1000)

	fmt.Println("Sending load ...")

	i := 0
	for {
		pool := pools[rand.Intn(len(pools))]
		roleID := i % 1000
		g.Go(func() error {
			conn, err := pool.Acquire(ctx)
			if err != nil {
				return err
			}
			defer conn.Release()

			_, err = conn.Exec(ctx, "SET ROLE role"+strconv.Itoa(roleID))
			if err != nil {
				return err
			}
			var res int
			err = conn.QueryRow(ctx, "SELECT * FROM tab LIMIT 1", pgx.QueryExecModeSimpleProtocol).Scan(&res)
			if err != nil {
				return err
			}

			return nil
		})
		if i%100 == 0 {
			fmt.Println("Sent load", i)
		}
		i++
	}
}

Before this PR, CREATE ROLE and DROP ROLE caused a spike on the 99.9th percentile latency and also the rate of internal queries.

root@localhost:26257/defaultdb> select timestamp, "eventType" from system.eventlog where timestamp > '2024-11-22 23:07'::timestamptz and timestamp < '2024-11-22 23:11'::timestamptz
                             -> order by timestamp asc;
          timestamp          |  eventType
-----------------------------+--------------
  2024-11-22 23:08:57.942819 | create_role
  2024-11-22 23:09:27.021219 | drop_role
image image

With larger node counts and greater numbers of concurrent active SQL users, the impact would be greater.


With this PR, there was no noticeable impact.

root@localhost:26257/defaultdb> select timestamp, "eventType" from system.eventlog where timestamp > '2024-11-22 23:53'::timestamptz and timestamp < '2024-11-22 23:57'::timestamptz
                             -> order by timestamp asc;
          timestamp          |  eventType
-----------------------------+--------------
  2024-11-22 23:54:47.286715 | create_role
  2024-11-22 23:55:29.026177 | drop_role
image image

To be absolutely certain this will help with the linked escalation, we would need to run this test on a 48 node cluster and with even more concurrent active users. But even without that larger test, it seems quite likely that this PR will help substantially.

@rafiss rafiss force-pushed the improve-singleflight-role-cache branch 2 times, most recently from 3790881 to 8d21326 Compare November 23, 2024 00:39
@rafiss rafiss marked this pull request as ready for review November 23, 2024 00:39
@rafiss rafiss requested review from a team as code owners November 23, 2024 00:39
@rafiss rafiss requested review from michae2 and fqazi and removed request for a team November 23, 2024 00:39
This setting was added only as an escape hatch, and we've never needed
to recommend using it. Since the single scan implementation is simpler
anyway, removing the old code will make other refactors to this code
easier.

Release note (ops change): Removed the
sql.auth.resolve_membership_single_scan.enabled cluster setting. This
was added out of precaution in case it was necessary to revert back to
the old behavior for looking up role memberships, but this escape hatch
has never been needed in practice since this was added in v23.1.
@rafiss rafiss force-pushed the improve-singleflight-role-cache branch from 8d21326 to 26daef2 Compare November 23, 2024 07:06
Copy link
Collaborator

@fqazi fqazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing work @rafiss :lgtm_strong: . I only had one minor comment

Do we want to eventually consider back porting this or is the risk too high?

Reviewed 2 of 2 files at r2, 12 of 12 files at r9, 11 of 11 files at r10, 7 of 7 files at r11, 7 of 7 files at r12, 7 of 7 files at r13, 7 of 7 files at r14, 8 of 8 files at r15, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @michae2 and @rafiss)


pkg/sql/authorization.go line 873 at r11 (raw file):

	var ok bool
	for ok, err = it.Next(ctx); ok; ok, err = it.Next(ctx) {

Are we missing error checking after the loop?


pkg/sql/authorization.go line 622 at r15 (raw file):

			return userMapping, nil
		} else {
			return nil, sqlerrors.NewUndefinedUserError(member)

I think this fine, though its a behaviour change. I think previously we would do avoid the cache to confirm a user didn't exist. In case there was any funny case where the lease manager didn't know of the new version because of range feed lag. However, the old behaviour seems less correct since the user commit may not have completed (waiting for one version).

Copy link
Collaborator Author

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @fqazi and @michae2)


pkg/sql/authorization.go line 873 at r11 (raw file):

Previously, fqazi (Faizan Qazi) wrote…

Are we missing error checking after the loop?

i don't believe so; there's a return err right after the loop. i can make it an explicit if err != nil check though.


pkg/sql/authorization.go line 622 at r15 (raw file):

I think this fine, though its a behaviour change. I think previously we would do avoid the cache to confirm a user didn't exist.

i don't think it's a behavior change. previously, we would never cache non-existence of a role. so if you tried to look up a role that didn't exist, it would be a cache miss that caused resolveMemberOfWithAdminOption to be called, and that would do the RoleExists check.

now, we don't need to do that, since we know that if the cache has any entries, it has entries for all existing users. this is because ac96718 changed the query that populates the cache.

so it's an optimization, but not a functional behavior change. perhaps that's already what you meant with your comment?

This new query ensures that users who have no parent roles appear in the
query results. This will support a refactor that allows us to populate
the entire cache all at once, rather than doing extra work for each
user.

Release note: None
This avoids the behavior of each goroutine doing the work of populating
the cache, which was entirely wasteful. It also supports a future
refactor to populate the whole cache all at once.

Release note: None
Now that we return a row for every user that exists, regardless of
whether the user has any parent roles, we don't need to make a separate
query for role existence.

Release note: None
The new function returns a flattened map of the role hierarchy for each
role that exists.

This is a refactor that will make populating the membership cache all at
once easier.

Release note: None
Copy link
Collaborator Author

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to eventually consider back porting this or is the risk too high?

I do want to try a backport for this if it doesn't have many conflicts, since I think we should try to help with the linked escalation

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @fqazi and @michae2)


pkg/sql/authorization.go line 622 at r15 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

I think this fine, though its a behaviour change. I think previously we would do avoid the cache to confirm a user didn't exist.

i don't think it's a behavior change. previously, we would never cache non-existence of a role. so if you tried to look up a role that didn't exist, it would be a cache miss that caused resolveMemberOfWithAdminOption to be called, and that would do the RoleExists check.

now, we don't need to do that, since we know that if the cache has any entries, it has entries for all existing users. this is because ac96718 changed the query that populates the cache.

so it's an optimization, but not a functional behavior change. perhaps that's already what you meant with your comment?

Hm, maybe you are right though. this test (link) failed with a user not found error (for a user that should have existed). so you are saying that the cache invalidation may not happen in time for this to work correctly?

        /var/lib/engflow/worker/work/0/exec/bazel-out/k8-fastbuild/bin/pkg/ccl/backupccl/backupccl_test_/backupccl_test.runfiles/com_github_cockroachdb_cockroach/pkg/ccl/backupccl/testdata/backup-restore/descriptor-conflicts:67:
         SELECT count(1) FROM [SHOW TABLES FROM system] WHERE table_name LIKE 'crdb_internal_copy_%';
        expected:
        4
        
        found:
        pq: role/user "hamburger" does not exist

Copy link
Collaborator

@fqazi fqazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @michae2 and @rafiss)


pkg/sql/authorization.go line 873 at r11 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

i don't believe so; there's a return err right after the loop. i can make it an explicit if err != nil check though.

oh I was wondering about the case where it.Next returns an error.


pkg/sql/authorization.go line 622 at r15 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

Hm, maybe you are right though. this test failed with a user not found error (for a user that should have existed). so you are saying that the cache invalidation may not happen in time for this to work correctly?

        /var/lib/engflow/worker/work/0/exec/bazel-out/k8-fastbuild/bin/pkg/ccl/backupccl/backupccl_test_/backupccl_test.runfiles/com_github_cockroachdb_cockroach/pkg/ccl/backupccl/testdata/backup-restore/descriptor-conflicts:67:
         SELECT count(1) FROM [SHOW TABLES FROM system] WHERE table_name LIKE 'crdb_internal_copy_%';
        expected:
        4
        
        found:
        pq: role/user "hamburger" does not exist

I wonder if we just need a WaitForOneVersion in there 🤔

Copy link
Collaborator Author

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @fqazi and @michae2)


pkg/sql/authorization.go line 873 at r11 (raw file):

Previously, fqazi (Faizan Qazi) wrote…

oh I was wondering about the case where it.Next returns an error.

yeah, if it.Next returns an error, the loop will exit, and the line right after the loop ends says return err.


pkg/sql/authorization.go line 622 at r15 (raw file):

Previously, fqazi (Faizan Qazi) wrote…

I wonder if we just need a WaitForOneVersion in there 🤔

where are you referring to? we already have a WaitForOneVersion which is called after running create/drop/grant/revoke statements, here:

if err := ex.waitOneVersionForNewVersionDescriptorsWithoutJobs(descIDsInJobs); err != nil {

i think the issue in this case might just be that the restore code inserts directly into system.users, which will skip the table version bump logic. let me verify this.

Copy link
Collaborator

@fqazi fqazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @michae2 and @rafiss)


pkg/sql/authorization.go line 622 at r15 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

where are you referring to? we already have a WaitForOneVersion which is called after running create/drop/grant/revoke statements, here:

if err := ex.waitOneVersionForNewVersionDescriptorsWithoutJobs(descIDsInJobs); err != nil {

i think the issue in this case might just be that the restore code inserts directly into system.users, which will skip the table version bump logic. let me verify this.

Yeah, I was wondering about the restore code, we might just need to add the wait for one version in it.

@rafiss rafiss force-pushed the improve-singleflight-role-cache branch from 26daef2 to e39e87b Compare November 25, 2024 18:41
@rafiss rafiss requested review from a team as code owners November 25, 2024 18:41
@rafiss
Copy link
Collaborator Author

rafiss commented Nov 25, 2024

tftr!

bors r+

craig bot pushed a commit that referenced this pull request Nov 25, 2024
135034: sql: disable gossip-based physical planning by default r=yuzefovich a=yuzefovich

Now that we improved handling of draining procedure in the virtual clusters in 24.3 timeframe, I believe it's time to begin fully deprecating the gossip-based physical planning, and as the first step this commit introduces a cluster setting to control which method is used in single-tenant deployments (instance-based planning being the default). I plan to have the setting as an escape hatch in case we find problems when rolling out this change and to remove the setting altogether after 25.2.

One of the differences between two planning methods is slightly adjusted in this commit. In particular, in the instance-based planning we stitch together consecutive spans whenever `EnsureSafeSplitKey` for the keys in the spans returns the same prefix. The idea is that spans corresponding to different column families of the same SQL row must be placed on the same instance. However, the condition on the length of the keys returned by `EnsureSafeSplitKey` is too broad and captures more cases than needed (i.e. two keys that cannot be part of the same SQL row end up in the same `SpanPartition`). To partially alleviate this difference, this commit introduces "boundary granularity" knob which indicates whether consecutive spans _might_ be part of the same SQL row, and we now use the stitching logic only if so. All callers of `PartitionSpans` have been audited, and we need the stitching logic only in two call sites that correspond to planning of table readers. Longer term, if we change the encoding so that column family IDs are encoded in a special way, we'll be able to tell definitively whether stitching is needed simply based on the key itself, and this additional logic could be removed.

Epic: None

Release note: None

135852: sql: avoid unnecessary singleflight requests for role membership cache r=rafiss a=rafiss

See individual commits for refactors that make the enhancement easier.

---

Since we use a query that scans the whole role_members table to
populate the membership cache, it is wasteful to launch multiple
singleflight requests for different users. This patch changes
the cache refresh logic of the singleflight so that it always populates
the entire cache, which means there will only ever be one in-flight
query per each node.

```
$ benchdiff --old 98b0521 --new 26daef2 . -r BenchmarkUseManyRoles

pkg=1/1 iter=10/10 cockroachdb/cockroach/pkg/bench/rttanalysis \

name                          old time/op     new time/op     delta
UseManyRoles/use_50_roles-10      225ms ± 6%      114ms ± 2%  -49.43%  (p=0.000 n=10+10)
UseManyRoles/use_2_roles-10      10.7ms ± 4%      7.4ms ± 3%  -30.74%  (p=0.000 n=10+8)

name                          old roundtrips  new roundtrips  delta
UseManyRoles/use_2_roles-10        9.00 ± 0%       5.00 ± 0%  -44.44%  (p=0.000 n=10+10)
UseManyRoles/use_50_roles-10       3.00 ± 0%       3.00 ± 0%     ~     (all equal)

name                          old alloc/op    new alloc/op    delta
UseManyRoles/use_50_roles-10      438MB ± 0%      377MB ± 1%  -13.88%  (p=0.000 n=9+10)
UseManyRoles/use_2_roles-10      20.7MB ± 2%     19.1MB ± 1%   -7.46%  (p=0.000 n=10+10)

name                          old allocs/op   new allocs/op   delta
UseManyRoles/use_50_roles-10       881k ± 0%       398k ± 4%  -54.81%  (p=0.000 n=8+10)
UseManyRoles/use_2_roles-10       54.4k ± 8%      36.0k ± 5%  -33.85%  (p=0.000 n=10+10)
```

fixes #135931
Release note (performance improvement): Improved the internal caching
logic for role membership information. This reduces the latency impact
of commands such as `DROP ROLE`, `CREATE ROLE`, and `GRANT role TO user`,
which cause the role membership cache to be invalidated.

136115: roachtest: rm vmod logging in rebalance/by-load/* tests r=arulajmani a=kvoli

Historically, the `rebalance/by-load/*` roachtests were difficult to track failure down for. To aid investigation, we bumped the vmodule verbosity to the maximum level in relevant test files.

The default logging and metrics have since improved. We have also noticed that while vmodule is enabled, there can be skewed resource usage as a result which is not observed by the replica.

Disable the verbose logging, we will rely on the defaults for future failures.

Note, not all failures linked below were due to vmodule logging. However, this change will remove some degree of noise.

Part of: #133054
Part of: #132633
Part of: #135055
Part of: #135869
Part of: #135811
Part of: #133004
Part of: #135791
Part of: #132019

Release note: None

136116: roachtest: rm redundant cpu profiling in rebalance/by-load tests r=arulajmani a=kvoli

After #97699 was resolved, the cpu profiling parameter declaration in the `rebalance/by-load/*` roachtests became redundant.

Remove the declaration and use the default settings instead.

Informs: #133054
Informs: #132633
Informs: #135055
Informs: #135869
Informs: #135811
Informs: #133004
Informs: #135791
Informs: #132019

Release note: None

Co-authored-by: Yahor Yuzefovich <[email protected]>
Co-authored-by: Rafi Shamim <[email protected]>
Co-authored-by: Austen McClernon <[email protected]>
@craig
Copy link
Contributor

craig bot commented Nov 25, 2024

Build failed (retrying...):

@craig craig bot merged commit 027f790 into cockroachdb:master Nov 25, 2024
22 of 23 checks passed
@rafiss rafiss deleted the improve-singleflight-role-cache branch November 25, 2024 21:38
jeffswenson added a commit to jeffswenson/cockroach that referenced this pull request Jan 2, 2025
As of cockroachdb#135852, restore increments the generation of the user role table
in order to invalidate the user cache. This step isn't needed for table
or database level restores. Skipping the cache invalidation is desirable
because it prevents an unrelated long running transaction from blocking
the completion of the restore.

Release Note: none
Part of: cockroachdb#138010
jeffswenson added a commit to jeffswenson/cockroach that referenced this pull request Jan 2, 2025
As of cockroachdb#135852, restore increments the generation of the user role table
in order to invalidate the user cache. This step isn't needed for table
or database level restores. Skipping the cache invalidation is desirable
because it prevents an unrelated long running transaction from blocking
the completion of the restore.

Release Note: none
Part of: cockroachdb#138010
jeffswenson added a commit to jeffswenson/cockroach that referenced this pull request Jan 2, 2025
As of cockroachdb#135852, restore increments the generation of the user role table
in order to invalidate the user cache. This step isn't needed for table
or database level restores. Skipping the cache invalidation is desirable
because it prevents an unrelated long running transaction from blocking
the completion of the restore.

Release Note: none
Part of: cockroachdb#138010
jeffswenson added a commit to jeffswenson/cockroach that referenced this pull request Jan 2, 2025
As of cockroachdb#135852, restore increments the generation of the user role table
in order to invalidate the user cache. This step isn't needed for table
or database level restores. Skipping the cache invalidation is desirable
because it prevents an unrelated long running transaction from blocking
the completion of the restore.

Release Note: none
Part of: cockroachdb#138010
jeffswenson added a commit to jeffswenson/cockroach that referenced this pull request Jan 2, 2025
As of cockroachdb#135852, restore increments the generation of the user role table
in order to invalidate the user cache. This step isn't needed for table
or database level restores. Skipping the cache invalidation is desirable
because it prevents an unrelated long running transaction from blocking
the completion of the restore.

Release Note: none
Part of: cockroachdb#138010
jeffswenson added a commit to jeffswenson/cockroach that referenced this pull request Jan 6, 2025
As of cockroachdb#135852, restore increments the generation of the user role table
in order to invalidate the user cache. This step isn't needed for table
or database level restores. Skipping the cache invalidation is desirable
because it prevents an unrelated long running transaction from blocking
the completion of the restore.

Release Note: none
Part of: cockroachdb#138010
craig bot pushed a commit that referenced this pull request Jan 6, 2025
138137: vecindex: combine split and merge fixups r=drewkimball a=andy-kimball

Create a single combined fixup type for splits and merges. Determine whether to split or merge based on the size of the partition.

Epic: CRDB-42943

Release note: None

138166: backup: only invalidate user cache for system restores r=jeffswenson a=jeffswenson

As of #135852, restore increments the generation of the user role table in order to invalidate the user cache. This step isn't needed for table or database level restores. Skipping the cache invalidation is desirable because it prevents an unrelated long running transaction from blocking the completion of the restore.

Release Note: none
Part of: #138010

138226: kverver: deflake TestRangefeedCheckpointsRecoverFromLeaseExpiration r=miraradeva a=miraradeva

When running with leader leases, it's possible that the nudge to check the old lease status takes longer if the old leader still has store liveness support and hasn't stepped down yet (and lost the lease).

This commit adds a SucceedsSoon around the expected nudge assertion.

Fixes: #137917

Release note: None

Co-authored-by: Andrew Kimball <[email protected]>
Co-authored-by: Jeff Swenson <[email protected]>
Co-authored-by: Mira Radeva <[email protected]>
andy-kimball pushed a commit to andy-kimball/cockroach that referenced this pull request Jan 7, 2025
As of cockroachdb#135852, restore increments the generation of the user role table
in order to invalidate the user cache. This step isn't needed for table
or database level restores. Skipping the cache invalidation is desirable
because it prevents an unrelated long running transaction from blocking
the completion of the restore.

Release Note: none
Part of: cockroachdb#138010
jeffswenson added a commit to jeffswenson/cockroach that referenced this pull request Jan 17, 2025
As of cockroachdb#135852, restore increments the generation of the user role table
in order to invalidate the user cache. This step isn't needed for table
or database level restores. Skipping the cache invalidation is desirable
because it prevents an unrelated long running transaction from blocking
the completion of the restore.

Release Note: none
Part of: cockroachdb#138010
jeffswenson added a commit to jeffswenson/cockroach that referenced this pull request Jan 17, 2025
As of cockroachdb#135852, restore increments the generation of the user role table
in order to invalidate the user cache. This step isn't needed for table
or database level restores. Skipping the cache invalidation is desirable
because it prevents an unrelated long running transaction from blocking
the completion of the restore.

Release Note: none
Part of: cockroachdb#138010
jeffswenson added a commit to jeffswenson/cockroach that referenced this pull request Jan 17, 2025
As of cockroachdb#135852, restore increments the generation of the user role table
in order to invalidate the user cache. This step isn't needed for table
or database level restores. Skipping the cache invalidation is desirable
because it prevents an unrelated long running transaction from blocking
the completion of the restore.

Release Note: none
Part of: cockroachdb#138010
jeffswenson added a commit to jeffswenson/cockroach that referenced this pull request Jan 17, 2025
As of cockroachdb#135852, restore increments the generation of the user role table
in order to invalidate the user cache. This step isn't needed for table
or database level restores. Skipping the cache invalidation is desirable
because it prevents an unrelated long running transaction from blocking
the completion of the restore.

Release Note: none
Part of: cockroachdb#138010
jeffswenson added a commit to jeffswenson/cockroach that referenced this pull request Jan 17, 2025
As of cockroachdb#135852, restore increments the generation of the user role table
in order to invalidate the user cache. This step isn't needed for table
or database level restores. Skipping the cache invalidation is desirable
because it prevents an unrelated long running transaction from blocking
the completion of the restore.

Release Note: none
Part of: cockroachdb#138010
jeffswenson added a commit to jeffswenson/cockroach that referenced this pull request Jan 17, 2025
As of cockroachdb#135852, restore increments the generation of the user role table
in order to invalidate the user cache. This step isn't needed for table
or database level restores. Skipping the cache invalidation is desirable
because it prevents an unrelated long running transaction from blocking
the completion of the restore.

Release Note: none
Part of: cockroachdb#138010
jeffswenson added a commit to jeffswenson/cockroach that referenced this pull request Jan 17, 2025
As of cockroachdb#135852, restore increments the generation of the user role table
in order to invalidate the user cache. This step isn't needed for table
or database level restores. Skipping the cache invalidation is desirable
because it prevents an unrelated long running transaction from blocking
the completion of the restore.

Release Note: none
Part of: cockroachdb#138010
jeffswenson added a commit to jeffswenson/cockroach that referenced this pull request Jan 17, 2025
As of cockroachdb#135852, restore increments the generation of the user role table
in order to invalidate the user cache. This step isn't needed for table
or database level restores. Skipping the cache invalidation is desirable
because it prevents an unrelated long running transaction from blocking
the completion of the restore.

Release Note: none
Part of: cockroachdb#138010
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sql: role membership cache refresh logic can cause a thundering herd of queries to system.role_members
3 participants