-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve TopoServer Performance and Efficiency For Keyspace Shards #15047
Conversation
Signed-off-by: Matt Lord <[email protected]>
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Tests
Documentation
New flags
If a workflow is added or modified:
Backward compatibility
|
Signed-off-by: Matt Lord <[email protected]>
Signed-off-by: Matt Lord <[email protected]>
Signed-off-by: Matt Lord <[email protected]>
@mattlord, I think it would be great to add a benchmark test for this. Is it feasible to add an E2E benchmark test, maybe not with a 128 shards, but at least a significant amount? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this rules, just one question, but happy to approve whenever you're ready for a final pass
} | ||
if len(result) == 0 { | ||
return nil, vterrors.Errorf(vtrpcpb.Code_FAILED_PRECONDITION, "%v has no serving shards", keyspace) | ||
} | ||
// Sort the shards by KeyRange for deterministic results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #15047 +/- ##
==========================================
+ Coverage 47.49% 47.70% +0.21%
==========================================
Files 1149 1155 +6
Lines 239387 240181 +794
==========================================
+ Hits 113692 114577 +885
+ Misses 117102 117001 -101
- Partials 8593 8603 +10 ☔ View full report in Codecov by Sentry. |
Signed-off-by: Matt Lord <[email protected]>
Signed-off-by: Matt Lord <[email protected]>
Signed-off-by: Matt Lord <[email protected]>
Signed-off-by: Matt Lord <[email protected]>
Signed-off-by: Matt Lord <[email protected]>
01a2344
to
c874653
Compare
Signed-off-by: Matt Lord <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice optimization.
As a follow up we can probably do another PR to replace the call to GetServingShards
in materializer.go with a special purpose func that returns only the first shard. Because that's all we ever use in that function.
go/vt/topo/keyspace.go
Outdated
if IsErrType(err, NoNode) { | ||
// The path doesn't exist, let's see if the keyspace exists. | ||
_, kerr := ts.GetKeyspace(ctx, keyspace) | ||
if kerr == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I believe it is idiomatic to handle the not-nil case first. In fact we should probably link to the standard golang style guide on our website for developers to use as guidance. It will be even better if we can somehow enforce golang style in CI so that it doesn't depend on reviewers' preferences.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't find anything in the style guide on this and there are places we do it this way, even within this same file (Alain in GetShardNames()), but it's definitely more typical and there's no reason NOT to do that here so I'll change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
generally true, yes (exception is in a switch-case
)
i like the idea of a CI check but it's probably not really feasible since a lot of the guidelines take the form of "prefer X" or "avoid Y" without outright banning many things. a lot (all?) of the mandates are covered by gofmt
+ go vet
go/vt/topo/keyspace.go
Outdated
// This uses a heuristic based on the number of vCPUs available -- where it's | ||
// assumed that as larger machines are used for Vitess deployments they will | ||
// be able to do more concurrently. | ||
var DefaultConcurrency = runtime.NumCPU() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are defaulting this to 32 for GetTablets
so we now have two separate defaults depending on whether we are getting tablets or shards. The reasoning for 32 is covered by this comment. #14693 (comment)
I'm not completely opposed to changing the default for GetTablets
to this value, but I'd prefer to change this to 32 for the reasons listed in that comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I see that you were basing it on fs.Int64Var(&topoReadConcurrency, "topo_read_concurrency", 32, "Concurrency of topo reads.")
, but you didn't reference the value. I'll see if I can use the variable somehow in both places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I addressed both of your comments here: c7a27d0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
approving, pending resolution of the concurrency default (which i don't have strong feelings on)
go/vt/topo/keyspace.go
Outdated
if IsErrType(err, NoNode) { | ||
// The path doesn't exist, let's see if the keyspace exists. | ||
_, kerr := ts.GetKeyspace(ctx, keyspace) | ||
if kerr == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
generally true, yes (exception is in a switch-case
)
i like the idea of a CI check but it's probably not really feasible since a lot of the guidelines take the form of "prefer X" or "avoid Y" without outright banning many things. a lot (all?) of the mandates are covered by gofmt
+ go vet
shards, err := exec.ts.FindAllShardsInKeyspace(ctx, keyspace, &topo.FindAllShardsInKeyspaceOptions{ | ||
Concurrency: topo.DefaultConcurrency, // Limit concurrency to avoid overwhelming the topo server. | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit picking here but we are defining a default &topo.FindAllShardsInKeyspaceOptions
in three different places, perhaps we should create a global default variable and use that here and in the two other places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this file, in go/vt/topo/test/shard.go
and in go/vt/vtctl/workflow/utils.go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
go/vt/topo/keyspace.go
Outdated
result := make(map[string]*ShardInfo, len(listResults)) | ||
for _, entry := range listResults { | ||
// The shard key looks like this: /vitess/global/keyspaces/commerce/shards/-80/Shard | ||
shardKey := string(entry.Key) | ||
shardName := path.Base(path.Dir(shardKey)) // The base part of the dir is "-80" | ||
// Validate the extracted shard name. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit picking here too, it might read better if we extract the content of this entire if block. it seems like we only need to send in listResults
as an argument and return results, err
, something like:
if err == nil {
return handleResults(listResults)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure it's more readable, but: 3fb3e04
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks good to me! I will leave it up to @deepthi to give a second approval once her feedback has been addressed 😃
Signed-off-by: Matt Lord <[email protected]>
Signed-off-by: Matt Lord <[email protected]>
Signed-off-by: Matt Lord <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will be nice to address the int64 vs int issue. Rest LGTM.
go/vt/topo/keyspace.go
Outdated
var DefaultConcurrency int64 | ||
|
||
func registerFlags(fs *pflag.FlagSet) { | ||
fs.Int64Var(&DefaultConcurrency, "topo_read_concurrency", 32, "Concurrency of topo reads.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this! Instead of hard-coding the default for all but healthcheck, we now allow this to be customized for any binary that imports the topo package.
go/vt/topo/keyspace.go
Outdated
// This file contains keyspace utility functions. | ||
|
||
// Default concurrency to use in order to avoid overhwelming the topo server. | ||
var DefaultConcurrency int64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ugh, I just realized the flag value was always int64, and there's no good reason for that. Should we change this to int? Is there any reason not to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No reason not to. int
is int64
on 64-bit platforms, and for 32-bit they can't really be using bigger numbers anyway 🤷
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use int64 on 32 bit machines -- two 32 bit pieces are used. We (the royal we) were using long long / int64 when needed when most machines were still 32 bit architectures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine moving this to int though. I initially started there but then preferred not to change the existing flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed it here (and now using it in another location where we should be): 99a7f86
Signed-off-by: Matt Lord <[email protected]>
Signed-off-by: Matt Lord <[email protected]>
Signed-off-by: Matt Lord <[email protected]>
@@ -107,8 +104,6 @@ const ( | |||
DefaultHealthCheckRetryDelay = 5 * time.Second | |||
DefaultHealthCheckTimeout = 1 * time.Minute | |||
|
|||
// DefaultTopoReadConcurrency is used as the default value for the topoReadConcurrency parameter of a TopologyWatcher. | |||
DefaultTopoReadConcurrency int = 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. This was unused 🤦
Signed-off-by: deepthi <[email protected]>
Failures in the Code coverage workflow are being caused by flaky tests which will be fixed separately. |
Description
There are various cases — e.g. when working with VReplication workflows — where we get all of the [serving] shards in a keyspace. To do this we were first getting a list of all shard names, then getting each shard record serially. This resulted in many topo server calls and when the topo server has high latency it could cause various commands to timeout and have some knock on effects on other things due to long running reads blocking other operations (especially e.g. with older etcd versions).
For example, if you had a keyspace with 128 shards then you would make 129 topo server calls when getting all [serving] shards:
With this PR you would go from 129 topo server calls in this example case down to 1 when the topo server supports key prefix scans (all but ZooKeeper do), and when the topo server does not (or the response message is beyond the max message size) then we fall back to the shard by shard method but do so concurrently. So either way, the total time taken to get all of the shards in a keyspace should improve dramatically.
Related Issues
Checklist