Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit concurrent creation of healthcheck gRPC connections #15053

Merged
merged 7 commits into from
Feb 28, 2024

Conversation

timvaillancourt
Copy link
Contributor

@timvaillancourt timvaillancourt commented Jan 27, 2024

Description

This PR adds a concurrency limit for the number of gRPC connections created by the healthcheck handler of vtgate, in order to avoid hitting runtime: program exceeds 10000-thread limit panics

Related, in #15030 I was able to remove 1 x sycall/stream (for gRPC client creds), however this is not enough to avoid hitting panics - now we're seeing raw network syscalls as the main thread-usage offenders

Hot syscall # 1 (DNS due to new gRPC connection to tablet)

(dlv) stack
 0  0x000000000047f120 in syscall.Syscall6
    at syscall/asm_linux_amd64.s:43
 1  0x000000000047e66c in syscall.setsockopt
    at syscall/zsyscall_linux_amd64.go:1481
 2  0x00000000005ecf33 in syscall.SetsockoptInt
    at syscall/syscall_unix.go:427
 3  0x00000000005ecf33 in net.setDefaultSockopts
    at net/sockopt_linux.go:21
 4  0x00000000005eb679 in net.socket
    at net/sock_posix.go:23
 5  0x00000000005e08f8 in net.internetSocket
    at net/ipsock_posix.go:142
 6  0x00000000005f30fd in net.(*sysDialer).dialUDP
    at net/udpsock_posix.go:206
 7  0x00000000005ca0d4 in net.(*sysDialer).dialSingle
    at net/dial.go:586
 8  0x00000000005c99b2 in net.(*sysDialer).dialSerial
    at net/dial.go:551
 9  0x00000000005c8836 in net.(*Dialer).DialContext
    at net/dial.go:428
10  0x00000000005e4485 in net.(*Resolver).dial
    at net/lookup_unix.go:70
11  0x00000000005cd905 in net.(*Resolver).exchange
    at net/dnsclient_unix.go:160
12  0x00000000005cec45 in net.(*Resolver).tryOneName
    at net/dnsclient_unix.go:260
13  0x00000000005d20c5 in net.(*Resolver).goLookupIPCNAMEOrder.func3.1
    at net/dnsclient_unix.go:612
14  0x00000000005d200a in net.(*Resolver).goLookupIPCNAMEOrder.func3.2
    at net/dnsclient_unix.go:615
15  0x00000000004692c1 in runtime.goexit
    at runtime/asm_amd64.s:1571

Hot syscall # 2 (same - gRPC connecting to tablet)

(dlv) stack
 0  0x000000000047f120 in syscall.Syscall6
    at syscall/asm_linux_amd64.s:43
 1  0x000000000047e40f in syscall.fstatat
    at syscall/zsyscall_linux_amd64.go:1450
 2  0x00000000004f9658 in syscall.Stat
    at syscall/syscall_linux_amd64.go:60
 3  0x00000000004f9658 in os.statNolog.func1
    at os/stat_unix.go:32
 4  0x00000000004f9658 in os.ignoringEINTR
    at os/file_posix.go:245
 5  0x00000000004f9658 in os.statNolog
    at os/stat_unix.go:31
 6  0x00000000004f9194 in os.Stat
    at os/stat.go:13
 7  0x00000000005e9425 in net.stat
    at net/parse.go:76
 8  0x00000000005d69c5 in net.readHosts
    at net/hosts.go:59
 9  0x00000000005d71a6 in net.lookupStaticHost
    at net/hosts.go:107
10  0x00000000005d0773 in net.goLookupIPFiles
    at net/dnsclient_unix.go:551
11  0x00000000005d0c54 in net.(*Resolver).goLookupIPCNAMEOrder
    at net/dnsclient_unix.go:572
12  0x00000000005e47da in net.(*Resolver).lookupIP
    at net/lookup_unix.go:102
13  0x00000000005f945b in net.(*Resolver).lookupIP-fm
    at <autogenerated>:1
14  0x00000000005d673d in net.glob..func1
    at net/hook.go:23
15  0x00000000005e2e9f in net.(*Resolver).lookupIPAddr.func1
    at net/lookup.go:319
16  0x00000000005c391b in internal/singleflight.(*Group).doCall
    at internal/singleflight/singleflight.go:95
17  0x00000000005c38b6 in internal/single flight.(*Group).DoChan.func1
    at internal/singleflight/singleflight.go:88
18  0x00000000004692c1 in runtime.goexit
    at runtime/asm_amd64.s:1571

Hot syscall # 3

I lost this one unfortunately, but it was a syscall.Open for the client TCP connection open

Related Issue(s)

Checklist

  • "Backport to:" labels have been added if this change should be back-ported to release branches
  • If this change is to be back-ported to previous releases, a justification is included in the PR description
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on CI?
  • Documentation was added or is not required

Deployment Notes

Copy link
Contributor

vitess-bot bot commented Jan 27, 2024

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • Ensure there is a link to an issue (except for internal cleanup and flaky test fixes), new features should have an RFC that documents use cases and test cases.

Tests

  • Bug fixes should have at least one unit or end-to-end test, enhancement and new features should have a sufficient number of tests.

Documentation

  • Apply the release notes (needs details) label if users need to know about this change.
  • New features should be documented.
  • There should be some code comments as to why things are implemented the way they are.
  • There should be a comment at the top of each new or modified test to explain what the test does.

New flags

  • Is this flag really necessary?
  • Flag names must be clear and intuitive, use dashes (-), and have a clear help text.

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow needs to be marked as required, the maintainer team must be notified.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from vitess-operator and arewefastyet, if used there.
  • vtctl command output order should be stable and awk-able.

@vitess-bot vitess-bot bot added NeedsBackportReason If backport labels have been applied to a PR, a justification is required NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsIssue A linked issue is missing for this Pull Request NeedsWebsiteDocsUpdate What it says labels Jan 27, 2024
@github-actions github-actions bot added this to the v19.0.0 milestone Jan 27, 2024
@timvaillancourt timvaillancourt changed the title Healthcheck init concurrency Limit concurrent gRPC connections from healthcheck Jan 27, 2024
@timvaillancourt timvaillancourt changed the title Limit concurrent gRPC connections from healthcheck Limit concurrent gRPC connections for healthchecks Jan 27, 2024
@timvaillancourt timvaillancourt changed the title Limit concurrent gRPC connections for healthchecks Limit concurrent creation of healthcheck gRPC connections Jan 27, 2024
@timvaillancourt
Copy link
Contributor Author

timvaillancourt commented Jan 27, 2024

On 2nd thought, --healthcheck-dial-concurrency is probably a more accurate flag name. Feedback appreciated 🤔

Copy link

codecov bot commented Jan 27, 2024

Codecov Report

Attention: Patch coverage is 54.16667% with 11 lines in your changes are missing coverage. Please review.

Project coverage is 65.44%. Comparing base (696fe0e) to head (3016c96).
Report is 64 commits behind head on main.

Files Patch % Lines
go/vt/discovery/tablet_health_check.go 47.05% 9 Missing ⚠️
go/vt/discovery/healthcheck.go 33.33% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #15053      +/-   ##
==========================================
- Coverage   67.41%   65.44%   -1.98%     
==========================================
  Files        1560     1562       +2     
  Lines      192752   193677     +925     
==========================================
- Hits       129952   126751    -3201     
- Misses      62800    66926    +4126     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@harshit-gangal
Copy link
Member

should a VTGate be cell-bound then to connect to ~10k vttablets?

@frouioui frouioui modified the milestones: v19.0.0, v20.0.0 Feb 6, 2024
@timvaillancourt
Copy link
Contributor Author

timvaillancourt commented Feb 8, 2024

should a VTGate be cell-bound then to connect to ~10k vttablets?

@harshit-gangal the vtgates we're seeing crash do have a --cells_to_watch, however this is set to every potential cell the PRIMARY could failover to, as these pools require PRIMARY operations

Wherever possible our vtgates subscribe to the minimum-necessary --keyspaces_to_watch, however on the vtgates that crash this list is either very large, or in one case not set at all for a prod-debugging-tool use case. Perhaps with some changes to our architecture this could be improved, but not easily

@timvaillancourt timvaillancourt force-pushed the healthcheck-init-concurrency branch 2 times, most recently from 39e99ba to 223967b Compare February 9, 2024 17:54
@timvaillancourt timvaillancourt marked this pull request as ready for review February 9, 2024 17:56
}

func (thc *tabletHealthCheck) connectionLocked() queryservice.QueryService {
func healthCheckDialerFactory(hc *HealthCheckImpl) func(ctx context.Context, addr string) (net.Conn, error) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If other parts of the code could benefit from a similar limit, perhaps this should be moved to a helper/util library or baked into grpcclient? 🤔 cc @deepthi (and anyone else) for thoughts

@timvaillancourt timvaillancourt force-pushed the healthcheck-init-concurrency branch from 66555dc to fdae21d Compare February 10, 2024 00:46
@@ -39,6 +40,7 @@ import (
)

var (
grpcOptionsMu sync.Mutex
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added to fix data race:

WARNING: DATA RACE
Read at 0x000003e57f70 by goroutine 71:
  vitess.io/vitess/go/vt/grpcclient.RegisterGRPCDialOptions()
      /Users/tim/github/vitess/go/vt/grpcclient/client.go:89 +0xdd
  vitess.io/vitess/go/vt/discovery.(*tabletHealthCheck).connectionLocked()
      /Users/tim/github/vitess/go/vt/discovery/tablet_health_check.go:164 +0x68
  vitess.io/vitess/go/vt/discovery.(*tabletHealthCheck).Connection()
      /Users/tim/github/vitess/go/vt/discovery/tablet_health_check.go:144 +0xb5
  vitess.io/vitess/go/vt/discovery.(*tabletHealthCheck).stream()
      /Users/tim/github/vitess/go/vt/discovery/tablet_health_check.go:128 +0x44
  vitess.io/vitess/go/vt/discovery.(*tabletHealthCheck).checkConn()
      /Users/tim/github/vitess/go/vt/discovery/tablet_health_check.go:295 +0x6ee
  vitess.io/vitess/go/vt/discovery.(*HealthCheckImpl).AddTablet.func2()
      /Users/tim/github/vitess/go/vt/discovery/healthcheck.go:426 +0x44

Previous write at 0x000003e57f70 by goroutine 69:
  vitess.io/vitess/go/vt/grpcclient.RegisterGRPCDialOptions()
      /Users/tim/github/vitess/go/vt/grpcclient/client.go:89 +0x184
  vitess.io/vitess/go/vt/discovery.(*tabletHealthCheck).connectionLocked()
      /Users/tim/github/vitess/go/vt/discovery/tablet_health_check.go:164 +0x68
  vitess.io/vitess/go/vt/discovery.(*tabletHealthCheck).Connection()
      /Users/tim/github/vitess/go/vt/discovery/tablet_health_check.go:144 +0xb5
  vitess.io/vitess/go/vt/discovery.(*tabletHealthCheck).stream()
      /Users/tim/github/vitess/go/vt/discovery/tablet_health_check.go:128 +0x44
  vitess.io/vitess/go/vt/discovery.(*tabletHealthCheck).checkConn()
      /Users/tim/github/vitess/go/vt/discovery/tablet_health_check.go:295 +0x6ee
  vitess.io/vitess/go/vt/discovery.(*HealthCheckImpl).AddTablet.func2()
      /Users/tim/github/vitess/go/vt/discovery/healthcheck.go:426 +0x44

Goroutine 71 (running) created at:
  vitess.io/vitess/go/vt/discovery.(*HealthCheckImpl).AddTablet()
      /Users/tim/github/vitess/go/vt/discovery/healthcheck.go:426 +0xad5
  vitess.io/vitess/go/vt/discovery.TestHealthCheckErrorOnPrimaryAfterExternalReparent()
      /Users/tim/github/vitess/go/vt/discovery/healthcheck_test.go:358 +0x971
  testing.tRunner()
      /usr/local/Cellar/go/1.21.7/libexec/src/testing/testing.go:1595 +0x261
  testing.(*T).Run.func1()
      /usr/local/Cellar/go/1.21.7/libexec/src/testing/testing.go:1648 +0x44

Goroutine 69 (running) created at:
  vitess.io/vitess/go/vt/discovery.(*HealthCheckImpl).AddTablet()
      /Users/tim/github/vitess/go/vt/discovery/healthcheck.go:426 +0xad5
  vitess.io/vitess/go/vt/discovery.TestHealthCheckErrorOnPrimaryAfterExternalReparent()
      /Users/tim/github/vitess/go/vt/discovery/healthcheck_test.go:351 +0x5cc
  testing.tRunner()
      /usr/local/Cellar/go/1.21.7/libexec/src/testing/testing.go:1595 +0x261
  testing.(*T).Run.func1()
      /usr/local/Cellar/go/1.21.7/libexec/src/testing/testing.go:1648 +0x44

@timvaillancourt timvaillancourt force-pushed the healthcheck-init-concurrency branch from 4bda1f6 to a5316d7 Compare February 11, 2024 00:50
timvaillancourt referenced this pull request in slackhq/vitess Feb 12, 2024
@harshit-gangal harshit-gangal added release notes (needs details) This PR needs to be listed in the release notes in a dedicated section (deprecation notice, etc...) and removed NeedsBackportReason If backport labels have been applied to a PR, a justification is required labels Feb 23, 2024
@harshit-gangal
Copy link
Member

This PR is slowing down the parallel vttablet connections being opened but still has the 10k limit on the thread count.

I found that runtime/debug has SetMaxThreads which can be used to modify the limit
https://pkg.go.dev/runtime/debug#SetMaxThreads

Copy link
Member

@harshit-gangal harshit-gangal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look fine to me. An issue and adding release notes would be good.

@timvaillancourt
Copy link
Contributor Author

Changes look fine to me. An issue and adding release notes would be good.

@harshit-gangal thanks! vitessio/website#1696 adds this to the 3 x binaries that changed

@timvaillancourt
Copy link
Contributor Author

timvaillancourt commented Feb 23, 2024

I found that runtime/debug has SetMaxThreads which can be used to modify the limit
https://pkg.go.dev/runtime/debug#SetMaxThreads

@harshit-gangal we considered raising the max threads limit but tried to avoid that in case it allowed a different problem to spiral out of control, also it doesn't scale as easily with growth unless it's cranked to a very high value

We've been using the 1024 default from the PR on production gates opening 12,000+ healthcheck streams, but it's possible that we may revisit that number (maybe 2048-4096 at most)

Related, we've noticed a blip of no healthy tablets for... errors during vtgate rollout for shards that are seemingly healthy. I wanted to blame this PR but my dive through the logic that waits for tablets to healthcheck lead me to believe this patch wouldn't cause that issue and I'm maybe just paranoid of my recent change, unless we're still allowing query-serving RPCs before the waiting-for-tablets (I don't think so). If there is a race or bug in that logic, this PR could make matters worse, however. But any 💡s what might cause this is appreciated 🙇

Copy link
Collaborator

@vmg vmg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. There is actually no way in the GRPC options to limit the amount of concurrent dials so the custom semaphore seems fine.

@vmg vmg removed release notes (needs details) This PR needs to be listed in the release notes in a dedicated section (deprecation notice, etc...) NeedsIssue A linked issue is missing for this Pull Request labels Feb 26, 2024
@vmg
Copy link
Collaborator

vmg commented Feb 26, 2024

@harshit-gangal thanks! vitessio/website#1696 adds this to the 3 x binaries that changed

@timvaillancourt I don't think you need to update the website directly: you're supposed to add this to the release notes in this repository, as part of the PR, and the website will get updated once we release the new Vitess version.

@timvaillancourt
Copy link
Contributor Author

@timvaillancourt I don't think you need to update the website directly: you're supposed to add this to the release notes in this repository, as part of the PR, and the website will get updated once we release the new Vitess version.

Thanks @vmg, I'll add to the release notes in this PR 👍

I think the vitessio/vitess-website PR I made is still need because a new cli flag is added, or at least I've been asked to make one when this is the case in the past. The format isn't markdown anymore, however. Did that all get automated?

@timvaillancourt
Copy link
Contributor Author

@harshit-gangal / @vmg I believe the doc changes are ready 🙇

@vmg
Copy link
Collaborator

vmg commented Feb 27, 2024

Thanks Tim! LGTM. I'm 99% sure the flags section in the website is auto-generated now. @frouioui will know for sure.

Need +1 to merge @harshit-gangal

@harshit-gangal
Copy link
Member

@timvaillancourt now it needs a conflict resolution.

@vmg vmg merged commit f4debb9 into vitessio:main Feb 28, 2024
102 checks passed
@timvaillancourt timvaillancourt deleted the healthcheck-init-concurrency branch February 28, 2024 14:18
timvaillancourt referenced this pull request in slackhq/vitess Feb 28, 2024
timvaillancourt referenced this pull request in slackhq/vitess Mar 15, 2024
…izations (#227)

* Load `--grpc_auth_static_client_creds` file once (#15030)

Signed-off-by: Tim Vaillancourt <[email protected]>

* Filter by keyspace earlier in `tabletgateway`s `WaitForTablets(...)` (#15347)

Signed-off-by: Tim Vaillancourt <[email protected]>

* Limit concurrent creation of healthcheck gRPC connections (#15053)

Signed-off-by: Tim Vaillancourt <[email protected]>

* go mod tidy

Signed-off-by: Tim Vaillancourt <[email protected]>

* Update MySQL apt package and GPG signature (#14785)

Signed-off-by: Matt Lord <[email protected]>

* remove unrelated workflow files from v20

Signed-off-by: Tim Vaillancourt <[email protected]>

---------

Signed-off-by: Tim Vaillancourt <[email protected]>
Signed-off-by: Matt Lord <[email protected]>
Co-authored-by: Matt Lord <[email protected]>
timvaillancourt referenced this pull request in slackhq/vitess May 30, 2024
timvaillancourt referenced this pull request in slackhq/vitess Jul 9, 2024
* Make `Durabler` interface methods public (#15548)

Signed-off-by: Tim Vaillancourt <[email protected]>
Signed-off-by: Manan Gupta <[email protected]>
Co-authored-by: Manan Gupta <[email protected]>

* Load `--grpc_auth_static_client_creds` file once (#15030)

Signed-off-by: Tim Vaillancourt <[email protected]>

* Limit concurrent creation of healthcheck gRPC connections (#15053)

Signed-off-by: Tim Vaillancourt <[email protected]>

* Filter by keyspace earlier in `tabletgateway`s `WaitForTablets(...)` (#15347)

Signed-off-by: Tim Vaillancourt <[email protected]>

* Use slack-15.0 as previous release

Signed-off-by: Tim Vaillancourt <[email protected]>

* empty commit

Signed-off-by: Tim Vaillancourt <[email protected]>

* force ci to run

* Update GH Action runners

Signed-off-by: Tim Vaillancourt <[email protected]>

* test templates

Signed-off-by: Tim Vaillancourt <[email protected]>

* set GH access token in build

Signed-off-by: Tim Vaillancourt <[email protected]>

* Fix reparent old tests

Signed-off-by: Tim Vaillancourt <[email protected]>

* Remove CIs we don't need

Signed-off-by: Tim Vaillancourt <[email protected]>

* Remove CIs we don't need again

Signed-off-by: Tim Vaillancourt <[email protected]>

* Add private repo setup to upgrade_downgrade_test_backups_e2e.yml

Signed-off-by: Tim Vaillancourt <[email protected]>

* Add private repo setup to more CI

Signed-off-by: Tim Vaillancourt <[email protected]>

* remove CI skip logic for upstream stuff

Signed-off-by: Tim Vaillancourt <[email protected]>

* CODEOWNERS

Signed-off-by: Tim Vaillancourt <[email protected]>

* [release-19.0] Add timeout to all the contexts used for RPC calls in vtorc (#15991) (#16103)

Signed-off-by: Manan Gupta <[email protected]>

* `slack-vitess-r15.0.5`: forward-port consul topo limits PR #111 (#297)

* `slack-vitess-r14.0.5`: allow conn overrides in consul topo (#111)

* `slack-vitess-r14.0.5`: allow conn overrides in consul topo

Signed-off-by: Tim Vaillancourt <[email protected]>

* fix e2e test

Signed-off-by: Tim Vaillancourt <[email protected]>

---------

Signed-off-by: Tim Vaillancourt <[email protected]>

* Update flags tests that didn't exist in v14

Signed-off-by: Tim Vaillancourt <[email protected]>

---------

Signed-off-by: Tim Vaillancourt <[email protected]>

* update vtcombo e2e

Signed-off-by: Tim Vaillancourt <[email protected]>

* Fix err with installing percona-xtrabackup-24

Signed-off-by: Tim Vaillancourt <[email protected]>

* `slack-vitess-r15.0.5`: fix races in `Unit Test (Race)` CI, fix "old" reparent CIs (#356)

* update vtcombo e2e test

Signed-off-by: Tim Vaillancourt <[email protected]>

* Fix bad merge conflict fix

Signed-off-by: Tim Vaillancourt <[email protected]>

* go mod tidy

Signed-off-by: Tim Vaillancourt <[email protected]>

* update vtcombo e2e test again

Signed-off-by: Tim Vaillancourt <[email protected]>

* [release-19.0] Upgrade the Golang version to `go1.22.5` (#16322)

Signed-off-by: GitHub <[email protected]>
Signed-off-by: Florent Poinsard <[email protected]>
Co-authored-by: frouioui <[email protected]>
Co-authored-by: Florent Poinsard <[email protected]>

* merge conflict fixes

Signed-off-by: Tim Vaillancourt <[email protected]>

* make vtadmin_web_proto_types

Signed-off-by: Tim Vaillancourt <[email protected]>

---------

Signed-off-by: Tim Vaillancourt <[email protected]>
Signed-off-by: Manan Gupta <[email protected]>
Signed-off-by: GitHub <[email protected]>
Signed-off-by: Florent Poinsard <[email protected]>
Co-authored-by: Manan Gupta <[email protected]>
Co-authored-by: Manan Gupta <[email protected]>
Co-authored-by: vitess-bot <[email protected]>
Co-authored-by: frouioui <[email protected]>
Co-authored-by: Florent Poinsard <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Query Serving Type: Enhancement Logical improvement (somewhere between a bug and feature)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants