Enable gRPC Server Side Keepalive settings #14939

GuptaManan100 · 2024-01-12T07:48:49Z

Description

Keepalive configurations in gRPC are documented here https://grpc.io/docs/guides/keepalive/#keepalive-configuration-specification and they say that they can be specified separately for server and client. The doc itself is not super clear on how to do this, but the go example is pretty good.

Specifically, I and @deepthi had seen the flags in vttablet and vtgate that set the client side keep-alives. These default to 10 seconds for grpc_keepalive_time and grpc_keepalive_timeout. What this ensures is that the client asks the server for KeepAlives every 10 seconds, and if it doesn't get a response in 10 seconds, then it closes the connection.

It looks like the problem is that we don't do the same for server connections! This is how we are configuring the keep-alives in the gRPC servers -

ka := keepalive.ServerParameters{
	MaxConnectionAge:      gRPCMaxConnectionAge,
	MaxConnectionAgeGrace: gRPCMaxConnectionAgeGrace,
}
opts = append(opts, grpc.KeepaliveParams(ka))

Notice no time, or timeout fields being specified. The go example from gRPC can be found here https://github.com/grpc/grpc-go/blob/master/examples/features/keepalive/server/main.go

This makes it so that if the client dies, then the server doesn't stop running until the underlying connection has been terminated, which can take arbitrarily long based on the connection being used. Moreover, we set the maximum connection age to the maximum possible integer value. A possible scenario for the issue that we are seeing can be so -

A client connects to vtgate and issues a stream rpc call.
vtgate initiates a streaming rpc against vttablet.
Let's say that the client terminates due to an OOM (or its power source got pulled), such that it doesn't terminate the connection.
vtgate is stuck writing to the client which is now gone. Since vtgate (as a server), is not requesting keep-alives, it takes very long (maybe infinite?) for it to notice that the client isn't really reading.
vttablet consequently gets blocked too, since vtgate (which is stuck trying to write), isn't reading any packets anymore. But since the vttablet is not dead, the keepalives requested from vtgate (client) to vttablet (server) are still functioning and preventing the connection between them from being terminated.

This PR would probably fix #14760. I haven't been able to reproduce the issue in any reliable way, so I can't say for sure though. Irrespective, it might still be a good idea to enable gRPC serverside Keepalives.

Related Issue(s)

This PR potentially fixes #14760

Checklist

"Backport to:" labels have been added if this change should be back-ported to release branches
If this change is to be back-ported to previous releases, a justification is included in the PR description
Tests were added or are not required
Did the new or modified tests pass consistently locally and on CI?
Documentation was added or is not required

Deployment Notes

Signed-off-by: Manan Gupta <[email protected]>

vitess-bot · 2024-01-12T07:48:52Z

codecov · 2024-01-12T08:03:09Z

Codecov Report

Attention: 4 lines in your changes are missing coverage. Please review.

Comparison is base (eddb39e) 47.29% compared to head (b03c474) 47.28%.
Report is 1 commits behind head on main.

Files	Patch %	Lines
go/vt/servenv/grpc_server.go	0.00%	4 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #14939      +/-   ##
==========================================
- Coverage   47.29%   47.28%   -0.01%     
==========================================
  Files        1137     1137              
  Lines      238684   238651      -33     
==========================================
- Hits       112895   112858      -37     
- Misses     117168   117180      +12     
+ Partials     8621     8613       -8

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

GuptaManan100 · 2024-01-12T10:14:51Z

I'm not entirely sure how to add unit tests for these changes. We don't have unit tests for verifying other gRPC server properties either.

ajm188 · 2024-01-12T13:11:15Z

I'm not entirely sure how to add unit tests for these changes. We don't have unit tests for verifying other gRPC server properties either.

Best suggestion I have is to take all the bits where we create the opts slice in createGRPCServer to their own grpcOptions() func (similar to how we extracted interceptors() to its own function) and then unit test just that bit

dbussink · 2024-01-12T21:15:48Z

Best suggestion I have is to take all the bits where we create the opts slice in createGRPCServer to their own grpcOptions() func (similar to how we extracted interceptors() to its own function) and then unit test just that bit

Imho this is a case where coverage is a tool to use as a potential signal and not as a hard truth. I think we should avoid adding tests just for coverage where those tests don't really test anything useful. Adding a bunch of tests with basically no value just for coverage doesn't really help improve code quality.

vmg

Sorry it took me so long to get to this PR!

I think this code is perfectly safe and overall enabling keepalives in the server will improve reliability for long lived connections.

HOWEVER: reviewing the issue in #14760, I also don't think that these change will help or improve the situation.

Here's a hunch I initially wrote:

The fact that the connection there is stuck in a write means that the issue is below the HTTP/2 protocol where these keepalives apply. I believe the stuck connections are being caused by the TCP connection state.

It appears we haven't been able to reproduce the issue reliably, but the best way forward is to continue trying: if we could catch one of these server processes stuck and look at the TCP state for its sockets in the kernel, I think that would give us a very big clue as what's going on.

This is actually not correct, but I'm leaving this here for posterity. The server is not stuck in a write, it's actually stuck in flow control (it's not obvious from looking at the trace unless you actually go to the sources).

This means that the TCP connection probably closed cleanly. So what could cause flow control to become stuck? The GRPC API docs give us some good clues:

	// CloseSend closes the send direction of the stream. It closes the stream
	// when non-nil error is met. It is also not safe to call CloseSend
	// concurrently with SendMsg.
	CloseSend() [error](https://pkg.go.dev/builtin#error)
	// Context returns the context for this stream.
	//
	// It should not be called until after Header or RecvMsg has returned. Once
	// called, subsequent client-side retries are disabled.
	Context() [context](https://pkg.go.dev/context).[Context](https://pkg.go.dev/context#Context)
	// SendMsg is generally called by generated code. On error, SendMsg aborts
	// the stream. If the error was generated by the client, the status is
	// returned directly; otherwise, io.EOF is returned and the status of
	// the stream may be discovered using RecvMsg.
	//
	// SendMsg blocks until:
	//   - There is sufficient flow control to schedule m with the transport, or
	//   - The stream is done, or
	//   - The stream breaks.
	//
	// SendMsg does not wait until the message is received by the server. An
	// untimely stream closure may result in lost messages. To ensure delivery,
	// users should ensure the RPC completed successfully using RecvMsg.
	//
	// It is safe to have a goroutine calling SendMsg and another goroutine
	// calling RecvMsg on the same stream at the same time, but it is not safe
	// to call SendMsg on the same stream in different goroutines. It is also
	// not safe to call CloseSend concurrently with SendMsg.
	//
	// It is not safe to modify the message after calling SendMsg. Tracing
	// libraries and stats handlers may use the message lazily.
	SendMsg(m [any](https://pkg.go.dev/builtin#any)) [error](https://pkg.go.dev/builtin#error)

Look at the documentation of SendMsg. In our case, this is breaking the API contract. The stream is either done or broken, but SendMsg is not returning. I believe it's behaving that way because we left the flow control in an inconsistent state... by breaking the API contract ourselves. "It is not safe to call SendMsg on the same stream in different goroutines. It is also not safe to call CloseSend concurrently with SendMsg."

@GuptaManan100 after merging this PR, please start reviewing these two corner cases. That's where my money is: a concurrent race that is leaving the flow control in GRPC in a bad state. Considering this is happening during shutdown, I think it's likely we could be closing the GRPC connection while sending to it at the same time. :)

GuptaManan100 · 2024-02-10T04:07:21Z

@vmg Thank-you for you looking at this! I'll follow up on the two corner cases next!!

GuptaManan100 added 2 commits January 12, 2024 13:03

feat: add server side keepalive configurations in Vitess

f96c64f

Signed-off-by: Manan Gupta <[email protected]>

test: update help text output

b03c474

Signed-off-by: Manan Gupta <[email protected]>

GuptaManan100 added Type: Enhancement Logical improvement (somewhere between a bug and feature) Component: General Changes throughout the code base labels Jan 12, 2024

GuptaManan100 requested review from deepthi and ajm188 as code owners January 12, 2024 07:48

github-actions bot added this to the v19.0.0 milestone Jan 12, 2024

frouioui modified the milestones: v19.0.0, v20.0.0 Feb 6, 2024

vmg approved these changes Feb 9, 2024

View reviewed changes

deepthi approved these changes Feb 10, 2024

View reviewed changes

deepthi merged commit 9df1763 into vitessio:main Feb 10, 2024
102 of 107 checks passed

deepthi deleted the stream-execute-stuck branch February 10, 2024 04:33

This was referenced Feb 10, 2024

Bug Report: Demote Primary stuck #14760

Closed

Ensure we send all Stream messages in a single go routine #15266

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable gRPC Server Side Keepalive settings #14939

Enable gRPC Server Side Keepalive settings #14939

GuptaManan100 commented Jan 12, 2024 •

edited

Loading

vitess-bot bot commented Jan 12, 2024

codecov bot commented Jan 12, 2024

GuptaManan100 commented Jan 12, 2024

ajm188 commented Jan 12, 2024

dbussink commented Jan 12, 2024

vmg left a comment •

edited

Loading

GuptaManan100 commented Feb 10, 2024

Enable gRPC Server Side Keepalive settings #14939

Enable gRPC Server Side Keepalive settings #14939

Conversation

GuptaManan100 commented Jan 12, 2024 • edited Loading

Description

Related Issue(s)

Checklist

Deployment Notes

vitess-bot bot commented Jan 12, 2024

Review Checklist

General

Tests

Documentation

New flags

If a workflow is added or modified:

Backward compatibility

codecov bot commented Jan 12, 2024

Codecov Report

GuptaManan100 commented Jan 12, 2024

ajm188 commented Jan 12, 2024

dbussink commented Jan 12, 2024

vmg left a comment • edited Loading

Choose a reason for hiding this comment

GuptaManan100 commented Feb 10, 2024

GuptaManan100 commented Jan 12, 2024 •

edited

Loading

vmg left a comment •

edited

Loading