-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable gRPC Server Side Keepalive settings #14939
Conversation
Signed-off-by: Manan Gupta <[email protected]>
Signed-off-by: Manan Gupta <[email protected]>
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Tests
Documentation
New flags
If a workflow is added or modified:
Backward compatibility
|
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #14939 +/- ##
==========================================
- Coverage 47.29% 47.28% -0.01%
==========================================
Files 1137 1137
Lines 238684 238651 -33
==========================================
- Hits 112895 112858 -37
- Misses 117168 117180 +12
+ Partials 8621 8613 -8 ☔ View full report in Codecov by Sentry. |
I'm not entirely sure how to add unit tests for these changes. We don't have unit tests for verifying other gRPC server properties either. |
Best suggestion I have is to take all the bits where we create the |
Imho this is a case where coverage is a tool to use as a potential signal and not as a hard truth. I think we should avoid adding tests just for coverage where those tests don't really test anything useful. Adding a bunch of tests with basically no value just for coverage doesn't really help improve code quality. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry it took me so long to get to this PR!
I think this code is perfectly safe and overall enabling keepalives in the server will improve reliability for long lived connections.
HOWEVER: reviewing the issue in #14760, I also don't think that these change will help or improve the situation.
Here's a hunch I initially wrote:
The fact that the connection there is stuck in a write means that the issue is below the HTTP/2 protocol where these keepalives apply. I believe the stuck connections are being caused by the TCP connection state.
It appears we haven't been able to reproduce the issue reliably, but the best way forward is to continue trying: if we could catch one of these server processes stuck and look at the TCP state for its sockets in the kernel, I think that would give us a very big clue as what's going on.
This is actually not correct, but I'm leaving this here for posterity. The server is not stuck in a write, it's actually stuck in flow control (it's not obvious from looking at the trace unless you actually go to the sources).
This means that the TCP connection probably closed cleanly. So what could cause flow control to become stuck? The GRPC API docs give us some good clues:
// CloseSend closes the send direction of the stream. It closes the stream
// when non-nil error is met. It is also not safe to call CloseSend
// concurrently with SendMsg.
CloseSend() [error](https://pkg.go.dev/builtin#error)
// Context returns the context for this stream.
//
// It should not be called until after Header or RecvMsg has returned. Once
// called, subsequent client-side retries are disabled.
Context() [context](https://pkg.go.dev/context).[Context](https://pkg.go.dev/context#Context)
// SendMsg is generally called by generated code. On error, SendMsg aborts
// the stream. If the error was generated by the client, the status is
// returned directly; otherwise, io.EOF is returned and the status of
// the stream may be discovered using RecvMsg.
//
// SendMsg blocks until:
// - There is sufficient flow control to schedule m with the transport, or
// - The stream is done, or
// - The stream breaks.
//
// SendMsg does not wait until the message is received by the server. An
// untimely stream closure may result in lost messages. To ensure delivery,
// users should ensure the RPC completed successfully using RecvMsg.
//
// It is safe to have a goroutine calling SendMsg and another goroutine
// calling RecvMsg on the same stream at the same time, but it is not safe
// to call SendMsg on the same stream in different goroutines. It is also
// not safe to call CloseSend concurrently with SendMsg.
//
// It is not safe to modify the message after calling SendMsg. Tracing
// libraries and stats handlers may use the message lazily.
SendMsg(m [any](https://pkg.go.dev/builtin#any)) [error](https://pkg.go.dev/builtin#error)
Look at the documentation of SendMsg
. In our case, this is breaking the API contract. The stream is either done or broken, but SendMsg
is not returning. I believe it's behaving that way because we left the flow control in an inconsistent state... by breaking the API contract ourselves. "It is not safe to call SendMsg on the same stream in different goroutines. It is also not safe to call CloseSend concurrently with SendMsg."
@GuptaManan100 after merging this PR, please start reviewing these two corner cases. That's where my money is: a concurrent race that is leaving the flow control in GRPC in a bad state. Considering this is happening during shutdown, I think it's likely we could be closing the GRPC connection while sending to it at the same time. :)
@vmg Thank-you for you looking at this! I'll follow up on the two corner cases next!! |
Description
Keepalive configurations in gRPC are documented here https://grpc.io/docs/guides/keepalive/#keepalive-configuration-specification and they say that they can be specified separately for server and client. The doc itself is not super clear on how to do this, but the go example is pretty good.
Specifically, I and @deepthi had seen the flags in vttablet and vtgate that set the client side keep-alives. These default to 10 seconds for
grpc_keepalive_time
andgrpc_keepalive_timeout
. What this ensures is that the client asks the server for KeepAlives every 10 seconds, and if it doesn't get a response in 10 seconds, then it closes the connection.It looks like the problem is that we don't do the same for server connections! This is how we are configuring the keep-alives in the gRPC servers -
Notice no time, or timeout fields being specified. The go example from gRPC can be found here https://github.com/grpc/grpc-go/blob/master/examples/features/keepalive/server/main.go
This makes it so that if the client dies, then the server doesn't stop running until the underlying connection has been terminated, which can take arbitrarily long based on the connection being used. Moreover, we set the maximum connection age to the maximum possible integer value. A possible scenario for the issue that we are seeing can be so -
This PR would probably fix #14760. I haven't been able to reproduce the issue in any reliable way, so I can't say for sure though. Irrespective, it might still be a good idea to enable gRPC serverside Keepalives.
Related Issue(s)
This PR potentially fixes #14760
Checklist
Deployment Notes