Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable gRPC Server Side Keepalive settings #14939

Merged
merged 2 commits into from
Feb 10, 2024

Conversation

GuptaManan100
Copy link
Member

@GuptaManan100 GuptaManan100 commented Jan 12, 2024

Description

Keepalive configurations in gRPC are documented here https://grpc.io/docs/guides/keepalive/#keepalive-configuration-specification and they say that they can be specified separately for server and client. The doc itself is not super clear on how to do this, but the go example is pretty good.

Specifically, I and @deepthi had seen the flags in vttablet and vtgate that set the client side keep-alives. These default to 10 seconds for grpc_keepalive_time and grpc_keepalive_timeout. What this ensures is that the client asks the server for KeepAlives every 10 seconds, and if it doesn't get a response in 10 seconds, then it closes the connection.

It looks like the problem is that we don't do the same for server connections! This is how we are configuring the keep-alives in the gRPC servers -

ka := keepalive.ServerParameters{
	MaxConnectionAge:      gRPCMaxConnectionAge,
	MaxConnectionAgeGrace: gRPCMaxConnectionAgeGrace,
}
opts = append(opts, grpc.KeepaliveParams(ka))

Notice no time, or timeout fields being specified. The go example from gRPC can be found here https://github.com/grpc/grpc-go/blob/master/examples/features/keepalive/server/main.go

This makes it so that if the client dies, then the server doesn't stop running until the underlying connection has been terminated, which can take arbitrarily long based on the connection being used. Moreover, we set the maximum connection age to the maximum possible integer value. A possible scenario for the issue that we are seeing can be so -

  1. A client connects to vtgate and issues a stream rpc call.
  2. vtgate initiates a streaming rpc against vttablet.
  3. Let's say that the client terminates due to an OOM (or its power source got pulled), such that it doesn't terminate the connection.
  4. vtgate is stuck writing to the client which is now gone. Since vtgate (as a server), is not requesting keep-alives, it takes very long (maybe infinite?) for it to notice that the client isn't really reading.
  5. vttablet consequently gets blocked too, since vtgate (which is stuck trying to write), isn't reading any packets anymore. But since the vttablet is not dead, the keepalives requested from vtgate (client) to vttablet (server) are still functioning and preventing the connection between them from being terminated.

This PR would probably fix #14760. I haven't been able to reproduce the issue in any reliable way, so I can't say for sure though. Irrespective, it might still be a good idea to enable gRPC serverside Keepalives.

Related Issue(s)

This PR potentially fixes #14760

Checklist

  • "Backport to:" labels have been added if this change should be back-ported to release branches
  • If this change is to be back-ported to previous releases, a justification is included in the PR description
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on CI?
  • Documentation was added or is not required

Deployment Notes

@GuptaManan100 GuptaManan100 added Type: Enhancement Logical improvement (somewhere between a bug and feature) Component: General Changes throughout the code base labels Jan 12, 2024
Copy link
Contributor

vitess-bot bot commented Jan 12, 2024

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • Ensure there is a link to an issue (except for internal cleanup and flaky test fixes), new features should have an RFC that documents use cases and test cases.

Tests

  • Bug fixes should have at least one unit or end-to-end test, enhancement and new features should have a sufficient number of tests.

Documentation

  • Apply the release notes (needs details) label if users need to know about this change.
  • New features should be documented.
  • There should be some code comments as to why things are implemented the way they are.
  • There should be a comment at the top of each new or modified test to explain what the test does.

New flags

  • Is this flag really necessary?
  • Flag names must be clear and intuitive, use dashes (-), and have a clear help text.

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow needs to be marked as required, the maintainer team must be notified.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from vitess-operator and arewefastyet, if used there.
  • vtctl command output order should be stable and awk-able.

@vitess-bot vitess-bot bot added NeedsBackportReason If backport labels have been applied to a PR, a justification is required NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsIssue A linked issue is missing for this Pull Request NeedsWebsiteDocsUpdate What it says labels Jan 12, 2024
@github-actions github-actions bot added this to the v19.0.0 milestone Jan 12, 2024
@GuptaManan100 GuptaManan100 removed NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsWebsiteDocsUpdate What it says NeedsIssue A linked issue is missing for this Pull Request NeedsBackportReason If backport labels have been applied to a PR, a justification is required labels Jan 12, 2024
Copy link

codecov bot commented Jan 12, 2024

Codecov Report

Attention: 4 lines in your changes are missing coverage. Please review.

Comparison is base (eddb39e) 47.29% compared to head (b03c474) 47.28%.
Report is 1 commits behind head on main.

Files Patch % Lines
go/vt/servenv/grpc_server.go 0.00% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #14939      +/-   ##
==========================================
- Coverage   47.29%   47.28%   -0.01%     
==========================================
  Files        1137     1137              
  Lines      238684   238651      -33     
==========================================
- Hits       112895   112858      -37     
- Misses     117168   117180      +12     
+ Partials     8621     8613       -8     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@GuptaManan100
Copy link
Member Author

I'm not entirely sure how to add unit tests for these changes. We don't have unit tests for verifying other gRPC server properties either.

@ajm188
Copy link
Contributor

ajm188 commented Jan 12, 2024

I'm not entirely sure how to add unit tests for these changes. We don't have unit tests for verifying other gRPC server properties either.

Best suggestion I have is to take all the bits where we create the opts slice in createGRPCServer to their own grpcOptions() func (similar to how we extracted interceptors() to its own function) and then unit test just that bit

@dbussink
Copy link
Contributor

Best suggestion I have is to take all the bits where we create the opts slice in createGRPCServer to their own grpcOptions() func (similar to how we extracted interceptors() to its own function) and then unit test just that bit

Imho this is a case where coverage is a tool to use as a potential signal and not as a hard truth. I think we should avoid adding tests just for coverage where those tests don't really test anything useful. Adding a bunch of tests with basically no value just for coverage doesn't really help improve code quality.

@frouioui frouioui modified the milestones: v19.0.0, v20.0.0 Feb 6, 2024
Copy link
Collaborator

@vmg vmg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry it took me so long to get to this PR!

I think this code is perfectly safe and overall enabling keepalives in the server will improve reliability for long lived connections.

HOWEVER: reviewing the issue in #14760, I also don't think that these change will help or improve the situation.

Here's a hunch I initially wrote:

The fact that the connection there is stuck in a write means that the issue is below the HTTP/2 protocol where these keepalives apply. I believe the stuck connections are being caused by the TCP connection state.

It appears we haven't been able to reproduce the issue reliably, but the best way forward is to continue trying: if we could catch one of these server processes stuck and look at the TCP state for its sockets in the kernel, I think that would give us a very big clue as what's going on.

This is actually not correct, but I'm leaving this here for posterity. The server is not stuck in a write, it's actually stuck in flow control (it's not obvious from looking at the trace unless you actually go to the sources).

This means that the TCP connection probably closed cleanly. So what could cause flow control to become stuck? The GRPC API docs give us some good clues:

	// CloseSend closes the send direction of the stream. It closes the stream
	// when non-nil error is met. It is also not safe to call CloseSend
	// concurrently with SendMsg.
	CloseSend() [error](https://pkg.go.dev/builtin#error)
	// Context returns the context for this stream.
	//
	// It should not be called until after Header or RecvMsg has returned. Once
	// called, subsequent client-side retries are disabled.
	Context() [context](https://pkg.go.dev/context).[Context](https://pkg.go.dev/context#Context)
	// SendMsg is generally called by generated code. On error, SendMsg aborts
	// the stream. If the error was generated by the client, the status is
	// returned directly; otherwise, io.EOF is returned and the status of
	// the stream may be discovered using RecvMsg.
	//
	// SendMsg blocks until:
	//   - There is sufficient flow control to schedule m with the transport, or
	//   - The stream is done, or
	//   - The stream breaks.
	//
	// SendMsg does not wait until the message is received by the server. An
	// untimely stream closure may result in lost messages. To ensure delivery,
	// users should ensure the RPC completed successfully using RecvMsg.
	//
	// It is safe to have a goroutine calling SendMsg and another goroutine
	// calling RecvMsg on the same stream at the same time, but it is not safe
	// to call SendMsg on the same stream in different goroutines. It is also
	// not safe to call CloseSend concurrently with SendMsg.
	//
	// It is not safe to modify the message after calling SendMsg. Tracing
	// libraries and stats handlers may use the message lazily.
	SendMsg(m [any](https://pkg.go.dev/builtin#any)) [error](https://pkg.go.dev/builtin#error)

Look at the documentation of SendMsg. In our case, this is breaking the API contract. The stream is either done or broken, but SendMsg is not returning. I believe it's behaving that way because we left the flow control in an inconsistent state... by breaking the API contract ourselves. "It is not safe to call SendMsg on the same stream in different goroutines. It is also not safe to call CloseSend concurrently with SendMsg."

@GuptaManan100 after merging this PR, please start reviewing these two corner cases. That's where my money is: a concurrent race that is leaving the flow control in GRPC in a bad state. Considering this is happening during shutdown, I think it's likely we could be closing the GRPC connection while sending to it at the same time. :)

@GuptaManan100
Copy link
Member Author

@vmg Thank-you for you looking at this! I'll follow up on the two corner cases next!!

@deepthi deepthi merged commit 9df1763 into vitessio:main Feb 10, 2024
102 of 107 checks passed
@deepthi deepthi deleted the stream-execute-stuck branch February 10, 2024 04:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: General Changes throughout the code base Type: Enhancement Logical improvement (somewhere between a bug and feature)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug Report: Demote Primary stuck
6 participants