-
Notifications
You must be signed in to change notification settings - Fork 227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Streaming pull subscriber does not notify of loss of network #1764
Comments
Actually more extensive testing shows that after 15 minutes, there are 'debug' events generated, but after waiting more than 1 hour, there is no 'error' or 'close' generated, so there is still no clear indication of an issue. Is there a proper way to detect this scenario that I am not using? Edit: I think the 15 minutes is just the default behavior of GRPC and/or pubsub for restarting HTTP/2 streams so that is not directly related to the loss of network. |
Sorry, fumbly fingers on the checkboxes :D |
Here is what I've found so far: pubsub uses gRPC, which uses Node's HTTP2 API. Update: the default modern Linux TCP ACK timeout is about ~16 minutes after retries. Only then will the write() indicate an issue unless TCP_USER_TIMEOUT is set on the Socket. See here for some details. So when gRPC ultimately calls If I pass However, it seems the most viable (somewhat hack-ish) way to bubble that up to pubsub is code like this:
And then you can use the connectivity state values to know when it transitions from 2 ("READY") to 3 ("TRANSIENT FAILIURE") But I feel like using this undocumented internal API is a hack. I love the fact that pubsub, gRPC, and actually Node's HTTP2 APIs all try to seamlessly re-establish connectivity without failing, but when taken to this extreme in a situation where (let's say) the pubsub Publisher publishes infrequently and unpredictably, it becomes very difficult to detect network issues in a timely way. And if timeliness is important, I don't know of a solution other than this hack. Would it be not using Streaming Pull? Seems extreme. I have to use pubsub because it's the only mechanism provided to me (by Google in this case) to get my data (from Google). So I really think there should be documented public API support for this. End note: it took me many hours of frustrating debugging to figure this out due to documentations gaps in all of these APIs |
Another update: if I override the default "deadline" where streams are restarted every 15 minutes to be some very long time (e.g. hours), then I can get watchConnectivityState() to trigger upon TCP socket failure (at ~16 mins on Linux) even with grpc.keepalive_time_ms not set. I think what is happening is that if the deadline is shorter than the TCP ACK timeout, the HTTP2 streams are destroyed and re-created prior to the TCP timeout, and so no error is ever generated in those streams, even though they are being written to by the pubsub keepalive. It is mysterious though that I guess Node ignores these errors from the underlying socket and continues to try to write to it, never propagating them to the Http2Session level, just because the original Http2Stream that was being written to is no longer alive. Because in the end there is just a single socket that is present the whole time. If the Http2Session was destroyed correctly, then gRPC would have to re-create it, which would certainly fail because at this point no new socket can successfully be opened. |
Any update on this? We are seeing this quite regularly and it is causing a lot of issues. Mainly because the high volume of messages and a need for manual intervention to re-establish the connection is is a bad combination Currently using version 4.0.1 in GKE environment |
You can observe this behavior while using Cloud Run. Occasionally Cloud Run will reset the underlying TCP socket this is similar to what @surlyhacker describes. https://cloud.google.com/run/docs/container-contract#outbound-reset. |
As several have noticed, these stream timeout issues seem to be at a layer lower than Pub/Sub. We don't have a ton of visibility into it, and I doubt we can do much to mitigate it directly. What I'd like to do is get the Pub/Sub library detecting when it needs to close and restart connections, which seems like a reasonable fix at this layer. Papering over the problem, but... yeah. In the meantime, we have actually set the keepalives by default now. |
By the by, I don't know if any of this great low level analysis would be interesting to @murgatroid99 - passing it along in case: #1764 (comment) |
Thanks for working on this @feywind |
I maintain the gRPC library, so I am speaking from that perspective. I am unsure why you would need to do any additional work to "bubble up" connection failures detected by the keepalive functionality. When the gRPC client sees that a keepalive ping does not get a response within the configured timeout (default 20s), it closes that connection and ends all open streams on that connection with an error. The pubsub library should see that error without you taking any explicit action. As a general rule, you as a user should essentially never need to use |
Ah, okay. Then maybe what's needed here is to figure out why those errors might not be bubbling up. (Probably because we are retrying internally, but maybe it would be useful for some users to receive that as a sign of possible trouble.) |
In our workload, we use Cloud Run to listen to PubSub messages. As mentioned in this comment, Cloud Run occasionally reset the underlying TCP socket which causes the client to behave as others have mentioned in this issue. Our workaround in this case is we use 1 client per subscription and set a timer when opening a connection. If a subscriber doesn't receive any message within that period, unsubscribe and recreate a new PubSub client. We opened a support ticket and get to know that when we create a new PubSub client, a new TCP connection is established. And they recommend reinitializing the gRPC client (PubSub client) when detecting a network reset. However, Cloud Run does not send an indication of when the container does that so we decided to implement a timer. After that we dont have any connectivity issue anymore. So I spent some time looking into the message-stream module. As the client object is persistent throughout the lifetime of the message-stream instance, it will continue to use the same TCP connection even when a stream is ended and reset @feywind I believe the issue will be resolved, if we could have a way to reinitialize the PubSub client when resetting the stream or when we know the connection is lost |
@nguyen-vo Do you have keepalives enabled, as mentioned in the other comments? That should address the issue you are describing. |
I assume you meant the grpc keep alive option. We are still using v3.60 and this option is enabled. Though did get the issue sometimes, hence we implemented the reset timer |
Quick update on this, there is actually a way to get some debug info out of the streams to see when they connect and disconnect:
The error has this form, which is admittedly not super useful for programmatically catching disconnects. Would everyone prefer passing on some kind of explicit disconnect notification? It's a little weird because the subscriber "stream" actually contains multiple streams, so we'd need to pass on some info about the overall status for it to be useful, I think. Edit: I just re-read and saw that you mentioned 'debug' in the original post. 🤦🏻♀️ Sorry. I'll look at this a bit more. |
@google-cloud/pubsub
version: 3.7.1Steps to reproduce
The text was updated successfully, but these errors were encountered: