Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve TopoServer Performance and Efficiency For Keyspace Shards #15047

Merged
merged 22 commits into from
Jan 30, 2024

Conversation

mattlord
Copy link
Contributor

@mattlord mattlord commented Jan 26, 2024

Description

There are various cases — e.g. when working with VReplication workflows — where we get all of the [serving] shards in a keyspace. To do this we were first getting a list of all shard names, then getting each shard record serially. This resulted in many topo server calls and when the topo server has high latency it could cause various commands to timeout and have some knock on effects on other things due to long running reads blocking other operations (especially e.g. with older etcd versions).

For example, if you had a keyspace with 128 shards then you would make 129 topo server calls when getting all [serving] shards:

  1. Get all shard names (1)
  2. Get each shard record serially (128)

With this PR you would go from 129 topo server calls in this example case down to 1 when the topo server supports key prefix scans (all but ZooKeeper do), and when the topo server does not (or the response message is beyond the max message size) then we fall back to the shard by shard method but do so concurrently. So either way, the total time taken to get all of the shards in a keyspace should improve dramatically.

Related Issues

Checklist

  • "Backport to:" labels have been added if this change should be back-ported to release branches
  • If this change is to be back-ported to previous releases, a justification is included in the PR description
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on CI?
  • Documentation was added or is not requiredk

Copy link
Contributor

vitess-bot bot commented Jan 26, 2024

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • Ensure there is a link to an issue (except for internal cleanup and flaky test fixes), new features should have an RFC that documents use cases and test cases.

Tests

  • Bug fixes should have at least one unit or end-to-end test, enhancement and new features should have a sufficient number of tests.

Documentation

  • Apply the release notes (needs details) label if users need to know about this change.
  • New features should be documented.
  • There should be some code comments as to why things are implemented the way they are.
  • There should be a comment at the top of each new or modified test to explain what the test does.

New flags

  • Is this flag really necessary?
  • Flag names must be clear and intuitive, use dashes (-), and have a clear help text.

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow needs to be marked as required, the maintainer team must be notified.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from vitess-operator and arewefastyet, if used there.
  • vtctl command output order should be stable and awk-able.

@vitess-bot vitess-bot bot added NeedsBackportReason If backport labels have been applied to a PR, a justification is required NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsIssue A linked issue is missing for this Pull Request NeedsWebsiteDocsUpdate What it says labels Jan 26, 2024
@github-actions github-actions bot added this to the v19.0.0 milestone Jan 26, 2024
@mattlord mattlord changed the title Topo perf Improve topo server performance and efficiency for shards Jan 26, 2024
@mattlord mattlord changed the title Improve topo server performance and efficiency for shards Improve TopoServer Performance and Efficiency For Shards Jan 26, 2024
@mattlord mattlord changed the title Improve TopoServer Performance and Efficiency For Shards Improve TopoServer Performance and Efficiency For Keyspace Shards Jan 26, 2024
@mattlord mattlord removed NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsWebsiteDocsUpdate What it says labels Jan 26, 2024
@mattlord mattlord removed the NeedsBackportReason If backport labels have been applied to a PR, a justification is required label Jan 26, 2024
@frouioui
Copy link
Member

@mattlord, I think it would be great to add a benchmark test for this. Is it feasible to add an E2E benchmark test, maybe not with a 128 shards, but at least a significant amount?

Copy link
Contributor

@ajm188 ajm188 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this rules, just one question, but happy to approve whenever you're ready for a final pass

go/vt/topo/keyspace.go Outdated Show resolved Hide resolved
}
if len(result) == 0 {
return nil, vterrors.Errorf(vtrpcpb.Code_FAILED_PRECONDITION, "%v has no serving shards", keyspace)
}
// Sort the shards by KeyRange for deterministic results.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link

codecov bot commented Jan 26, 2024

Codecov Report

Attention: 24 lines in your changes are missing coverage. Please review.

Comparison is base (44d6a6b) 47.49% compared to head (484c7dc) 47.70%.
Report is 20 commits behind head on main.

❗ Current head 484c7dc differs from pull request most recent head 3cdba40. Consider uploading reports for the commit 3cdba40 to get more accurate results

Files Patch % Lines
go/vt/topo/keyspace.go 64.81% 14 Missing and 5 partials ⚠️
go/vt/topo/memorytopo/lock.go 50.00% 2 Missing ⚠️
go/vt/topo/memorytopo/memorytopo.go 50.00% 2 Missing ⚠️
go/vt/vtctl/workflow/utils.go 83.33% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #15047      +/-   ##
==========================================
+ Coverage   47.49%   47.70%   +0.21%     
==========================================
  Files        1149     1155       +6     
  Lines      239387   240181     +794     
==========================================
+ Hits       113692   114577     +885     
+ Misses     117102   117001     -101     
- Partials     8593     8603      +10     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

go/vt/topo/keyspace.go Outdated Show resolved Hide resolved
@mattlord mattlord added Type: Enhancement Logical improvement (somewhere between a bug and feature) Type: Performance Component: Topology and removed Type: Enhancement Logical improvement (somewhere between a bug and feature) labels Jan 26, 2024
@mattlord mattlord force-pushed the topo_perf branch 3 times, most recently from 01a2344 to c874653 Compare January 27, 2024 16:52
Copy link
Member

@deepthi deepthi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice optimization.
As a follow up we can probably do another PR to replace the call to GetServingShards in materializer.go with a special purpose func that returns only the first shard. Because that's all we ever use in that function.

if IsErrType(err, NoNode) {
// The path doesn't exist, let's see if the keyspace exists.
_, kerr := ts.GetKeyspace(ctx, keyspace)
if kerr == nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I believe it is idiomatic to handle the not-nil case first. In fact we should probably link to the standard golang style guide on our website for developers to use as guidance. It will be even better if we can somehow enforce golang style in CI so that it doesn't depend on reviewers' preferences.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find anything in the style guide on this and there are places we do it this way, even within this same file (Alain in GetShardNames()), but it's definitely more typical and there's no reason NOT to do that here so I'll change it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally true, yes (exception is in a switch-case)

i like the idea of a CI check but it's probably not really feasible since a lot of the guidelines take the form of "prefer X" or "avoid Y" without outright banning many things. a lot (all?) of the mandates are covered by gofmt + go vet

// This uses a heuristic based on the number of vCPUs available -- where it's
// assumed that as larger machines are used for Vitess deployments they will
// be able to do more concurrently.
var DefaultConcurrency = runtime.NumCPU()
Copy link
Member

@deepthi deepthi Jan 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are defaulting this to 32 for GetTablets so we now have two separate defaults depending on whether we are getting tablets or shards. The reasoning for 32 is covered by this comment. #14693 (comment)
I'm not completely opposed to changing the default for GetTablets to this value, but I'd prefer to change this to 32 for the reasons listed in that comment.

Copy link
Contributor Author

@mattlord mattlord Jan 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I see that you were basing it on fs.Int64Var(&topoReadConcurrency, "topo_read_concurrency", 32, "Concurrency of topo reads."), but you didn't reference the value. I'll see if I can use the variable somehow in both places.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I addressed both of your comments here: c7a27d0

Copy link
Contributor

@ajm188 ajm188 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approving, pending resolution of the concurrency default (which i don't have strong feelings on)

if IsErrType(err, NoNode) {
// The path doesn't exist, let's see if the keyspace exists.
_, kerr := ts.GetKeyspace(ctx, keyspace)
if kerr == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally true, yes (exception is in a switch-case)

i like the idea of a CI check but it's probably not really feasible since a lot of the guidelines take the form of "prefer X" or "avoid Y" without outright banning many things. a lot (all?) of the mandates are covered by gofmt + go vet

Comment on lines 110 to 112
shards, err := exec.ts.FindAllShardsInKeyspace(ctx, keyspace, &topo.FindAllShardsInKeyspaceOptions{
Concurrency: topo.DefaultConcurrency, // Limit concurrency to avoid overwhelming the topo server.
})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit picking here but we are defining a default &topo.FindAllShardsInKeyspaceOptions in three different places, perhaps we should create a global default variable and use that here and in the two other places.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this file, in go/vt/topo/test/shard.go and in go/vt/vtctl/workflow/utils.go

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 207 to 212
result := make(map[string]*ShardInfo, len(listResults))
for _, entry := range listResults {
// The shard key looks like this: /vitess/global/keyspaces/commerce/shards/-80/Shard
shardKey := string(entry.Key)
shardName := path.Base(path.Dir(shardKey)) // The base part of the dir is "-80"
// Validate the extracted shard name.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit picking here too, it might read better if we extract the content of this entire if block. it seems like we only need to send in listResults as an argument and return results, err, something like:

if err == nil {
	return handleResults(listResults)
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure it's more readable, but: 3fb3e04

Copy link
Member

@frouioui frouioui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good to me! I will leave it up to @deepthi to give a second approval once her feedback has been addressed 😃

Signed-off-by: Matt Lord <[email protected]>
Signed-off-by: Matt Lord <[email protected]>
Signed-off-by: Matt Lord <[email protected]>
Copy link
Member

@deepthi deepthi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be nice to address the int64 vs int issue. Rest LGTM.

var DefaultConcurrency int64

func registerFlags(fs *pflag.FlagSet) {
fs.Int64Var(&DefaultConcurrency, "topo_read_concurrency", 32, "Concurrency of topo reads.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this! Instead of hard-coding the default for all but healthcheck, we now allow this to be customized for any binary that imports the topo package.

// This file contains keyspace utility functions.

// Default concurrency to use in order to avoid overhwelming the topo server.
var DefaultConcurrency int64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh, I just realized the flag value was always int64, and there's no good reason for that. Should we change this to int? Is there any reason not to?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No reason not to. int is int64 on 64-bit platforms, and for 32-bit they can't really be using bigger numbers anyway 🤷

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use int64 on 32 bit machines -- two 32 bit pieces are used. We (the royal we) were using long long / int64 when needed when most machines were still 32 bit architectures.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine moving this to int though. I initially started there but then preferred not to change the existing flag.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it here (and now using it in another location where we should be): 99a7f86

@@ -107,8 +104,6 @@ const (
DefaultHealthCheckRetryDelay = 5 * time.Second
DefaultHealthCheckTimeout = 1 * time.Minute

// DefaultTopoReadConcurrency is used as the default value for the topoReadConcurrency parameter of a TopologyWatcher.
DefaultTopoReadConcurrency int = 5
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. This was unused 🤦

@deepthi
Copy link
Member

deepthi commented Jan 30, 2024

Failures in the Code coverage workflow are being caused by flaky tests which will be fixed separately.

@deepthi deepthi merged commit c156ca2 into vitessio:main Jan 30, 2024
99 of 100 checks passed
@deepthi deepthi deleted the topo_perf branch January 30, 2024 10:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug Report: Cross Region Topo Server Can Cause VReplication Client Commands to Timeout
4 participants