Improve TopoServer Performance and Efficiency For Keyspace Shards #15047

mattlord · 2024-01-26T16:40:21Z

Description

There are various cases — e.g. when working with VReplication workflows — where we get all of the [serving] shards in a keyspace. To do this we were first getting a list of all shard names, then getting each shard record serially. This resulted in many topo server calls and when the topo server has high latency it could cause various commands to timeout and have some knock on effects on other things due to long running reads blocking other operations (especially e.g. with older etcd versions).

For example, if you had a keyspace with 128 shards then you would make 129 topo server calls when getting all [serving] shards:

Get all shard names (1)
Get each shard record serially (128)

With this PR you would go from 129 topo server calls in this example case down to 1 when the topo server supports key prefix scans (all but ZooKeeper do), and when the topo server does not (or the response message is beyond the max message size) then we fall back to the shard by shard method but do so concurrently. So either way, the total time taken to get all of the shards in a keyspace should improve dramatically.

Related Issues

Fixes: Bug Report: Cross Region Topo Server Can Cause VReplication Client Commands to Timeout #15048

Checklist

"Backport to:" labels have been added if this change should be back-ported to release branches
If this change is to be back-ported to previous releases, a justification is included in the PR description
Tests were added or are not required
Did the new or modified tests pass consistently locally and on CI?
Documentation was added or is not requiredk

Signed-off-by: Matt Lord <[email protected]>

vitess-bot · 2024-01-26T16:40:24Z

Signed-off-by: Matt Lord <[email protected]>

frouioui · 2024-01-26T17:41:47Z

@mattlord, I think it would be great to add a benchmark test for this. Is it feasible to add an E2E benchmark test, maybe not with a 128 shards, but at least a significant amount?

ajm188

this rules, just one question, but happy to approve whenever you're ready for a final pass

go/vt/topo/keyspace.go

ajm188 · 2024-01-26T17:43:58Z

go/vt/topo/keyspace.go

 	}
 	if len(result) == 0 {
 		return nil, vterrors.Errorf(vtrpcpb.Code_FAILED_PRECONDITION, "%v has no serving shards", keyspace)
 	}
+	// Sort the shards by KeyRange for deterministic results.


codecov · 2024-01-26T17:44:16Z

Codecov Report

Attention: 24 lines in your changes are missing coverage. Please review.

Comparison is base (44d6a6b) 47.49% compared to head (484c7dc) 47.70%.
Report is 20 commits behind head on main.

❗ Current head 484c7dc differs from pull request most recent head 3cdba40. Consider uploading reports for the commit 3cdba40 to get more accurate results

Files	Patch %	Lines
go/vt/topo/keyspace.go	64.81%	14 Missing and 5 partials ⚠️
go/vt/topo/memorytopo/lock.go	50.00%	2 Missing ⚠️
go/vt/topo/memorytopo/memorytopo.go	50.00%	2 Missing ⚠️
go/vt/vtctl/workflow/utils.go	83.33%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #15047      +/-   ##
==========================================
+ Coverage   47.49%   47.70%   +0.21%     
==========================================
  Files        1149     1155       +6     
  Lines      239387   240181     +794     
==========================================
+ Hits       113692   114577     +885     
+ Misses     117102   117001     -101     
- Partials     8593     8603      +10

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

go/vt/topo/keyspace.go

Signed-off-by: Matt Lord <[email protected]>

deepthi

Nice optimization.
As a follow up we can probably do another PR to replace the call to GetServingShards in materializer.go with a special purpose func that returns only the first shard. Because that's all we ever use in that function.

deepthi · 2024-01-29T04:42:16Z

go/vt/topo/keyspace.go

+	if IsErrType(err, NoNode) {
+		// The path doesn't exist, let's see if the keyspace exists.
+		_, kerr := ts.GetKeyspace(ctx, keyspace)
+		if kerr == nil {


nit: I believe it is idiomatic to handle the not-nil case first. In fact we should probably link to the standard golang style guide on our website for developers to use as guidance. It will be even better if we can somehow enforce golang style in CI so that it doesn't depend on reviewers' preferences.

I couldn't find anything in the style guide on this and there are places we do it this way, even within this same file (Alain in GetShardNames()), but it's definitely more typical and there's no reason NOT to do that here so I'll change it.

generally true, yes (exception is in a switch-case)

i like the idea of a CI check but it's probably not really feasible since a lot of the guidelines take the form of "prefer X" or "avoid Y" without outright banning many things. a lot (all?) of the mandates are covered by gofmt + go vet

deepthi · 2024-01-29T04:47:42Z

go/vt/topo/keyspace.go

+// This uses a heuristic based on the number of vCPUs available -- where it's
+// assumed that as larger machines are used for Vitess deployments they will
+// be able to do more concurrently.
+var DefaultConcurrency = runtime.NumCPU()


We are defaulting this to 32 for GetTablets so we now have two separate defaults depending on whether we are getting tablets or shards. The reasoning for 32 is covered by this comment. #14693 (comment)
I'm not completely opposed to changing the default for GetTablets to this value, but I'd prefer to change this to 32 for the reasons listed in that comment.

OK, I see that you were basing it on fs.Int64Var(&topoReadConcurrency, "topo_read_concurrency", 32, "Concurrency of topo reads."), but you didn't reference the value. I'll see if I can use the variable somehow in both places.

I addressed both of your comments here: c7a27d0

ajm188

approving, pending resolution of the concurrency default (which i don't have strong feelings on)

ajm188 · 2024-01-29T13:48:21Z

go/vt/topo/keyspace.go

+	if IsErrType(err, NoNode) {
+		// The path doesn't exist, let's see if the keyspace exists.
+		_, kerr := ts.GetKeyspace(ctx, keyspace)
+		if kerr == nil {


generally true, yes (exception is in a switch-case)

i like the idea of a CI check but it's probably not really feasible since a lot of the guidelines take the form of "prefer X" or "avoid Y" without outright banning many things. a lot (all?) of the mandates are covered by gofmt + go vet

frouioui · 2024-01-29T14:33:44Z

go/vt/schemamanager/tablet_executor.go

+	shards, err := exec.ts.FindAllShardsInKeyspace(ctx, keyspace, &topo.FindAllShardsInKeyspaceOptions{
+		Concurrency: topo.DefaultConcurrency, // Limit concurrency to avoid overwhelming the topo server.
+	})


Nit picking here but we are defining a default &topo.FindAllShardsInKeyspaceOptions in three different places, perhaps we should create a global default variable and use that here and in the two other places.

In this file, in go/vt/topo/test/shard.go and in go/vt/vtctl/workflow/utils.go

frouioui · 2024-01-29T14:43:23Z

go/vt/topo/keyspace.go

+		result := make(map[string]*ShardInfo, len(listResults))
+		for _, entry := range listResults {
+			// The shard key looks like this: /vitess/global/keyspaces/commerce/shards/-80/Shard
+			shardKey := string(entry.Key)
+			shardName := path.Base(path.Dir(shardKey)) // The base part of the dir is "-80"
+			// Validate the extracted shard name.


nit picking here too, it might read better if we extract the content of this entire if block. it seems like we only need to send in listResults as an argument and return results, err, something like:

if err == nil { return handleResults(listResults) }

Not sure it's more readable, but: 3fb3e04

frouioui

It looks good to me! I will leave it up to @deepthi to give a second approval once her feedback has been addressed 😃

Signed-off-by: Matt Lord <[email protected]>

deepthi

It will be nice to address the int64 vs int issue. Rest LGTM.

deepthi · 2024-01-29T15:44:58Z

go/vt/topo/keyspace.go

+var DefaultConcurrency int64
+
+func registerFlags(fs *pflag.FlagSet) {
+	fs.Int64Var(&DefaultConcurrency, "topo_read_concurrency", 32, "Concurrency of topo reads.")


I like this! Instead of hard-coding the default for all but healthcheck, we now allow this to be customized for any binary that imports the topo package.

deepthi · 2024-01-29T15:53:54Z

go/vt/topo/keyspace.go

+// This file contains keyspace utility functions.
+
+// Default concurrency to use in order to avoid overhwelming the topo server.
+var DefaultConcurrency int64


Ugh, I just realized the flag value was always int64, and there's no good reason for that. Should we change this to int? Is there any reason not to?

No reason not to. int is int64 on 64-bit platforms, and for 32-bit they can't really be using bigger numbers anyway 🤷

You can use int64 on 32 bit machines -- two 32 bit pieces are used. We (the royal we) were using long long / int64 when needed when most machines were still 32 bit architectures.

I'm fine moving this to int though. I initially started there but then preferred not to change the existing flag.

Changed it here (and now using it in another location where we should be): 99a7f86

Signed-off-by: Matt Lord <[email protected]>

deepthi · 2024-01-30T07:06:09Z

go/vt/discovery/healthcheck.go

@@ -107,8 +104,6 @@ const (
 	DefaultHealthCheckRetryDelay = 5 * time.Second
 	DefaultHealthCheckTimeout    = 1 * time.Minute

-	// DefaultTopoReadConcurrency is used as the default value for the topoReadConcurrency parameter of a TopologyWatcher.
-	DefaultTopoReadConcurrency int = 5


Good catch. This was unused 🤦

Signed-off-by: deepthi <[email protected]>

deepthi · 2024-01-30T10:06:34Z

Failures in the Code coverage workflow are being caused by flaky tests which will be fixed separately.

Make GetServingShards concurrent

1264391

Signed-off-by: Matt Lord <[email protected]>

github-actions bot added this to the v19.0.0 milestone Jan 26, 2024

mattlord changed the title ~~Topo perf~~ Improve topo server performance and efficiency for shards Jan 26, 2024

mattlord changed the title ~~Improve topo server performance and efficiency for shards~~ Improve TopoServer Performance and Efficiency For Shards Jan 26, 2024

mattlord changed the title ~~Improve TopoServer Performance and Efficiency For Shards~~ Improve TopoServer Performance and Efficiency For Keyspace Shards Jan 26, 2024

mattlord force-pushed the topo_perf branch from cba6a9e to 5aeb539 Compare January 26, 2024 16:47

Get all shards in a keyspace via List when we can

b26f7cf

Signed-off-by: Matt Lord <[email protected]>

mattlord force-pushed the topo_perf branch from 5aeb539 to b26f7cf Compare January 26, 2024 16:52

mattlord removed NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsWebsiteDocsUpdate What it says labels Jan 26, 2024

mattlord added 2 commits January 26, 2024 12:14

Return error when keyspace doesn't exist

ca2ea86

Signed-off-by: Matt Lord <[email protected]>

Use default concurrency based on vCPUs

cc1aec4

Signed-off-by: Matt Lord <[email protected]>

mattlord force-pushed the topo_perf branch from c3bd500 to cc1aec4 Compare January 26, 2024 17:29

mattlord removed the NeedsBackportReason If backport labels have been applied to a PR, a justification is required label Jan 26, 2024

ajm188 reviewed Jan 26, 2024

View reviewed changes

frouioui reviewed Jan 26, 2024

View reviewed changes

go/vt/topo/keyspace.go Outdated Show resolved Hide resolved

Improve the logic in processing list results

d4b0609

Signed-off-by: Matt Lord <[email protected]>

mattlord added Type: Enhancement Logical improvement (somewhere between a bug and feature) Type: Performance Component: Topology and removed Type: Enhancement Logical improvement (somewhere between a bug and feature) labels Jan 26, 2024

Utilize in callsites and improve test coverage

b6286ff

Signed-off-by: Matt Lord <[email protected]>

mattlord force-pushed the topo_perf branch from c374a0b to b6286ff Compare January 26, 2024 20:50

mattlord added 3 commits January 27, 2024 10:29

Add keyrange validity check

8a9f8d5

Signed-off-by: Matt Lord <[email protected]>

Support legacy sequences as shard names

1041e59

Signed-off-by: Matt Lord <[email protected]>

Use existing ValidateShardName function

cebde7a

Signed-off-by: Matt Lord <[email protected]>

mattlord force-pushed the topo_perf branch 3 times, most recently from 01a2344 to c874653 Compare January 27, 2024 16:52

Can't stop nitting... (send help)

4cd8a23

Signed-off-by: Matt Lord <[email protected]>

mattlord force-pushed the topo_perf branch from c874653 to 4cd8a23 Compare January 27, 2024 17:01

deepthi reviewed Jan 29, 2024

View reviewed changes

ajm188 approved these changes Jan 29, 2024

View reviewed changes

frouioui reviewed Jan 29, 2024

View reviewed changes

mattlord added 3 commits January 29, 2024 10:09

Address review comments

c7a27d0

Signed-off-by: Matt Lord <[email protected]>

Address review comment

8a0638e

Signed-off-by: Matt Lord <[email protected]>

Address review comment

3fb3e04

Signed-off-by: Matt Lord <[email protected]>

deepthi approved these changes Jan 29, 2024

View reviewed changes

mattlord added 3 commits January 29, 2024 12:38

Remove unnecessary opt usage spot (and kick CI)

76e49a0

Signed-off-by: Matt Lord <[email protected]>

Move topo_read_concurrency to int

99a7f86

Signed-off-by: Matt Lord <[email protected]>

Restore the old hardcoded default value for binaries w/o the flag

484c7dc

Signed-off-by: Matt Lord <[email protected]>

ajm188 approved these changes Jan 29, 2024

View reviewed changes

deepthi approved these changes Jan 30, 2024

View reviewed changes

healthcheck: remove unused constant

3cdba40

Signed-off-by: deepthi <[email protected]>

deepthi merged commit c156ca2 into vitessio:main Jan 30, 2024
99 of 100 checks passed

deepthi deleted the topo_perf branch January 30, 2024 10:06

mattlord mentioned this pull request Feb 2, 2024

Ignore non-Shard keys in FindAllShardsInKeyspace List impl #15117

Merged

5 tasks

mattlord mentioned this pull request Apr 23, 2024

Properly unescape keyspace name in FindAllShardsInKeyspace #15765

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve TopoServer Performance and Efficiency For Keyspace Shards #15047

Improve TopoServer Performance and Efficiency For Keyspace Shards #15047

mattlord commented Jan 26, 2024 •

edited

Loading

vitess-bot bot commented Jan 26, 2024 •

edited by frouioui

Loading

frouioui commented Jan 26, 2024

ajm188 left a comment

ajm188 Jan 26, 2024

codecov bot commented Jan 26, 2024 •

edited

Loading

deepthi left a comment

deepthi Jan 29, 2024

mattlord Jan 29, 2024

ajm188 Jan 29, 2024

deepthi Jan 29, 2024 •

edited

Loading

mattlord Jan 29, 2024 •

edited

Loading

mattlord Jan 29, 2024

ajm188 left a comment

ajm188 Jan 29, 2024

frouioui Jan 29, 2024

frouioui Jan 29, 2024

mattlord Jan 29, 2024

frouioui Jan 29, 2024

mattlord Jan 29, 2024

frouioui left a comment

deepthi left a comment

deepthi Jan 29, 2024

deepthi Jan 29, 2024

ajm188 Jan 29, 2024

mattlord Jan 29, 2024

mattlord Jan 29, 2024

mattlord Jan 29, 2024

deepthi Jan 30, 2024

deepthi commented Jan 30, 2024

Improve TopoServer Performance and Efficiency For Keyspace Shards #15047

Improve TopoServer Performance and Efficiency For Keyspace Shards #15047

Conversation

mattlord commented Jan 26, 2024 • edited Loading

Description

Related Issues

Checklist

vitess-bot bot commented Jan 26, 2024 • edited by frouioui Loading

Review Checklist

General

Tests

Documentation

New flags

If a workflow is added or modified:

Backward compatibility

frouioui commented Jan 26, 2024

ajm188 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Jan 26, 2024 • edited Loading

Codecov Report

deepthi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

deepthi Jan 29, 2024 • edited Loading

Choose a reason for hiding this comment

mattlord Jan 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ajm188 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

frouioui left a comment

Choose a reason for hiding this comment

deepthi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

deepthi commented Jan 30, 2024

mattlord commented Jan 26, 2024 •

edited

Loading

vitess-bot bot commented Jan 26, 2024 •

edited by frouioui

Loading

codecov bot commented Jan 26, 2024 •

edited

Loading

deepthi Jan 29, 2024 •

edited

Loading

mattlord Jan 29, 2024 •

edited

Loading