-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Limit concurrent creation of healthcheck gRPC connections #15053
Limit concurrent creation of healthcheck gRPC connections #15053
Conversation
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Tests
Documentation
New flags
If a workflow is added or modified:
Backward compatibility
|
On 2nd thought, |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #15053 +/- ##
==========================================
- Coverage 67.41% 65.44% -1.98%
==========================================
Files 1560 1562 +2
Lines 192752 193677 +925
==========================================
- Hits 129952 126751 -3201
- Misses 62800 66926 +4126 ☔ View full report in Codecov by Sentry. |
cb2638e
to
24142aa
Compare
should a VTGate be cell-bound then to connect to ~10k vttablets? |
@harshit-gangal the Wherever possible our |
39e99ba
to
223967b
Compare
} | ||
|
||
func (thc *tabletHealthCheck) connectionLocked() queryservice.QueryService { | ||
func healthCheckDialerFactory(hc *HealthCheckImpl) func(ctx context.Context, addr string) (net.Conn, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If other parts of the code could benefit from a similar limit, perhaps this should be moved to a helper/util library or baked into grpcclient? 🤔 cc @deepthi (and anyone else) for thoughts
66555dc
to
fdae21d
Compare
go/vt/grpcclient/client.go
Outdated
@@ -39,6 +40,7 @@ import ( | |||
) | |||
|
|||
var ( | |||
grpcOptionsMu sync.Mutex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added to fix data race:
WARNING: DATA RACE
Read at 0x000003e57f70 by goroutine 71:
vitess.io/vitess/go/vt/grpcclient.RegisterGRPCDialOptions()
/Users/tim/github/vitess/go/vt/grpcclient/client.go:89 +0xdd
vitess.io/vitess/go/vt/discovery.(*tabletHealthCheck).connectionLocked()
/Users/tim/github/vitess/go/vt/discovery/tablet_health_check.go:164 +0x68
vitess.io/vitess/go/vt/discovery.(*tabletHealthCheck).Connection()
/Users/tim/github/vitess/go/vt/discovery/tablet_health_check.go:144 +0xb5
vitess.io/vitess/go/vt/discovery.(*tabletHealthCheck).stream()
/Users/tim/github/vitess/go/vt/discovery/tablet_health_check.go:128 +0x44
vitess.io/vitess/go/vt/discovery.(*tabletHealthCheck).checkConn()
/Users/tim/github/vitess/go/vt/discovery/tablet_health_check.go:295 +0x6ee
vitess.io/vitess/go/vt/discovery.(*HealthCheckImpl).AddTablet.func2()
/Users/tim/github/vitess/go/vt/discovery/healthcheck.go:426 +0x44
Previous write at 0x000003e57f70 by goroutine 69:
vitess.io/vitess/go/vt/grpcclient.RegisterGRPCDialOptions()
/Users/tim/github/vitess/go/vt/grpcclient/client.go:89 +0x184
vitess.io/vitess/go/vt/discovery.(*tabletHealthCheck).connectionLocked()
/Users/tim/github/vitess/go/vt/discovery/tablet_health_check.go:164 +0x68
vitess.io/vitess/go/vt/discovery.(*tabletHealthCheck).Connection()
/Users/tim/github/vitess/go/vt/discovery/tablet_health_check.go:144 +0xb5
vitess.io/vitess/go/vt/discovery.(*tabletHealthCheck).stream()
/Users/tim/github/vitess/go/vt/discovery/tablet_health_check.go:128 +0x44
vitess.io/vitess/go/vt/discovery.(*tabletHealthCheck).checkConn()
/Users/tim/github/vitess/go/vt/discovery/tablet_health_check.go:295 +0x6ee
vitess.io/vitess/go/vt/discovery.(*HealthCheckImpl).AddTablet.func2()
/Users/tim/github/vitess/go/vt/discovery/healthcheck.go:426 +0x44
Goroutine 71 (running) created at:
vitess.io/vitess/go/vt/discovery.(*HealthCheckImpl).AddTablet()
/Users/tim/github/vitess/go/vt/discovery/healthcheck.go:426 +0xad5
vitess.io/vitess/go/vt/discovery.TestHealthCheckErrorOnPrimaryAfterExternalReparent()
/Users/tim/github/vitess/go/vt/discovery/healthcheck_test.go:358 +0x971
testing.tRunner()
/usr/local/Cellar/go/1.21.7/libexec/src/testing/testing.go:1595 +0x261
testing.(*T).Run.func1()
/usr/local/Cellar/go/1.21.7/libexec/src/testing/testing.go:1648 +0x44
Goroutine 69 (running) created at:
vitess.io/vitess/go/vt/discovery.(*HealthCheckImpl).AddTablet()
/Users/tim/github/vitess/go/vt/discovery/healthcheck.go:426 +0xad5
vitess.io/vitess/go/vt/discovery.TestHealthCheckErrorOnPrimaryAfterExternalReparent()
/Users/tim/github/vitess/go/vt/discovery/healthcheck_test.go:351 +0x5cc
testing.tRunner()
/usr/local/Cellar/go/1.21.7/libexec/src/testing/testing.go:1595 +0x261
testing.(*T).Run.func1()
/usr/local/Cellar/go/1.21.7/libexec/src/testing/testing.go:1648 +0x44
Signed-off-by: Tim Vaillancourt <[email protected]>
4bda1f6
to
a5316d7
Compare
Signed-off-by: Tim Vaillancourt <[email protected]>
This PR is slowing down the parallel vttablet connections being opened but still has the 10k limit on the thread count. I found that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look fine to me. An issue and adding release notes would be good.
@harshit-gangal thanks! vitessio/website#1696 adds this to the 3 x binaries that changed |
@harshit-gangal we considered raising the max threads limit but tried to avoid that in case it allowed a different problem to spiral out of control, also it doesn't scale as easily with growth unless it's cranked to a very high value We've been using the Related, we've noticed a blip of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. There is actually no way in the GRPC options to limit the amount of concurrent dials so the custom semaphore seems fine.
@timvaillancourt I don't think you need to update the website directly: you're supposed to add this to the release notes in this repository, as part of the PR, and the website will get updated once we release the new Vitess version. |
Thanks @vmg, I'll add to the release notes in this PR 👍 I think the |
Signed-off-by: Tim Vaillancourt <[email protected]>
Signed-off-by: Tim Vaillancourt <[email protected]>
@harshit-gangal / @vmg I believe the doc changes are ready 🙇 |
Thanks Tim! LGTM. I'm 99% sure the flags section in the website is auto-generated now. @frouioui will know for sure. Need +1 to merge @harshit-gangal |
Signed-off-by: Tim Vaillancourt <[email protected]>
@timvaillancourt now it needs a conflict resolution. |
Signed-off-by: Tim Vaillancourt <[email protected]>
Signed-off-by: Tim Vaillancourt <[email protected]>
…izations (#227) * Load `--grpc_auth_static_client_creds` file once (#15030) Signed-off-by: Tim Vaillancourt <[email protected]> * Filter by keyspace earlier in `tabletgateway`s `WaitForTablets(...)` (#15347) Signed-off-by: Tim Vaillancourt <[email protected]> * Limit concurrent creation of healthcheck gRPC connections (#15053) Signed-off-by: Tim Vaillancourt <[email protected]> * go mod tidy Signed-off-by: Tim Vaillancourt <[email protected]> * Update MySQL apt package and GPG signature (#14785) Signed-off-by: Matt Lord <[email protected]> * remove unrelated workflow files from v20 Signed-off-by: Tim Vaillancourt <[email protected]> --------- Signed-off-by: Tim Vaillancourt <[email protected]> Signed-off-by: Matt Lord <[email protected]> Co-authored-by: Matt Lord <[email protected]>
Signed-off-by: Tim Vaillancourt <[email protected]>
* Make `Durabler` interface methods public (#15548) Signed-off-by: Tim Vaillancourt <[email protected]> Signed-off-by: Manan Gupta <[email protected]> Co-authored-by: Manan Gupta <[email protected]> * Load `--grpc_auth_static_client_creds` file once (#15030) Signed-off-by: Tim Vaillancourt <[email protected]> * Limit concurrent creation of healthcheck gRPC connections (#15053) Signed-off-by: Tim Vaillancourt <[email protected]> * Filter by keyspace earlier in `tabletgateway`s `WaitForTablets(...)` (#15347) Signed-off-by: Tim Vaillancourt <[email protected]> * Use slack-15.0 as previous release Signed-off-by: Tim Vaillancourt <[email protected]> * empty commit Signed-off-by: Tim Vaillancourt <[email protected]> * force ci to run * Update GH Action runners Signed-off-by: Tim Vaillancourt <[email protected]> * test templates Signed-off-by: Tim Vaillancourt <[email protected]> * set GH access token in build Signed-off-by: Tim Vaillancourt <[email protected]> * Fix reparent old tests Signed-off-by: Tim Vaillancourt <[email protected]> * Remove CIs we don't need Signed-off-by: Tim Vaillancourt <[email protected]> * Remove CIs we don't need again Signed-off-by: Tim Vaillancourt <[email protected]> * Add private repo setup to upgrade_downgrade_test_backups_e2e.yml Signed-off-by: Tim Vaillancourt <[email protected]> * Add private repo setup to more CI Signed-off-by: Tim Vaillancourt <[email protected]> * remove CI skip logic for upstream stuff Signed-off-by: Tim Vaillancourt <[email protected]> * CODEOWNERS Signed-off-by: Tim Vaillancourt <[email protected]> * [release-19.0] Add timeout to all the contexts used for RPC calls in vtorc (#15991) (#16103) Signed-off-by: Manan Gupta <[email protected]> * `slack-vitess-r15.0.5`: forward-port consul topo limits PR #111 (#297) * `slack-vitess-r14.0.5`: allow conn overrides in consul topo (#111) * `slack-vitess-r14.0.5`: allow conn overrides in consul topo Signed-off-by: Tim Vaillancourt <[email protected]> * fix e2e test Signed-off-by: Tim Vaillancourt <[email protected]> --------- Signed-off-by: Tim Vaillancourt <[email protected]> * Update flags tests that didn't exist in v14 Signed-off-by: Tim Vaillancourt <[email protected]> --------- Signed-off-by: Tim Vaillancourt <[email protected]> * update vtcombo e2e Signed-off-by: Tim Vaillancourt <[email protected]> * Fix err with installing percona-xtrabackup-24 Signed-off-by: Tim Vaillancourt <[email protected]> * `slack-vitess-r15.0.5`: fix races in `Unit Test (Race)` CI, fix "old" reparent CIs (#356) * update vtcombo e2e test Signed-off-by: Tim Vaillancourt <[email protected]> * Fix bad merge conflict fix Signed-off-by: Tim Vaillancourt <[email protected]> * go mod tidy Signed-off-by: Tim Vaillancourt <[email protected]> * update vtcombo e2e test again Signed-off-by: Tim Vaillancourt <[email protected]> * [release-19.0] Upgrade the Golang version to `go1.22.5` (#16322) Signed-off-by: GitHub <[email protected]> Signed-off-by: Florent Poinsard <[email protected]> Co-authored-by: frouioui <[email protected]> Co-authored-by: Florent Poinsard <[email protected]> * merge conflict fixes Signed-off-by: Tim Vaillancourt <[email protected]> * make vtadmin_web_proto_types Signed-off-by: Tim Vaillancourt <[email protected]> --------- Signed-off-by: Tim Vaillancourt <[email protected]> Signed-off-by: Manan Gupta <[email protected]> Signed-off-by: GitHub <[email protected]> Signed-off-by: Florent Poinsard <[email protected]> Co-authored-by: Manan Gupta <[email protected]> Co-authored-by: Manan Gupta <[email protected]> Co-authored-by: vitess-bot <[email protected]> Co-authored-by: frouioui <[email protected]> Co-authored-by: Florent Poinsard <[email protected]>
Description
This PR adds a concurrency limit for the number of gRPC connections created by the healthcheck handler of
vtgate
, in order to avoid hittingruntime: program exceeds 10000-thread limit
panicsRelated, in #15030 I was able to remove 1 x sycall/stream (for gRPC client creds), however this is not enough to avoid hitting panics - now we're seeing raw network syscalls as the main thread-usage offenders
Hot syscall # 1 (DNS due to new gRPC connection to tablet)
Hot syscall # 2 (same - gRPC connecting to tablet)
Hot syscall # 3
I lost this one unfortunately, but it was a
syscall.Open
for the client TCP connection openRelated Issue(s)
--grpc_auth_static_client_creds
file once #15030Checklist
Deployment Notes