-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pubsub: process hanging forever when large number of subscription combined with NumGoroutines #2593
Comments
This seems to be an issue with In the meantime, to avoid this, please revert back to |
thank you Alex for getting back to us quickly. |
Looks like the PR with a fix was merged, i'm just waiting for a new release. 🙂 |
Thanks for your patience! This is now released in |
I think this is still a problem in I have an app that pulls 27 subscriptions. Here's our unack message count graph with default setting of Here's ack requests graph which follows the same pattern - As seen, there's a cyclical pattern of processing all the messages, then failing to process for ~ couple of hours, then processing all the messages again. The time to process an individual message doesn't change, neither does the publish rate. It doesn't completely stop, just comes to crawl (going from 30K rpm to ~500 rpm). Changing I'm not entirely sure how to debug this further. I did notice the same |
For what it's worth, I see a saw blade like pattern like @maroux notes with several of my go pubsub client programs. I'm not up to date with what is the "highest performance" set of options for the client, but I definitely see the jagged unack graph when I look at subscriptions. |
We observed a very similar problem. In our setup we have 22 subscriptions (exactly once delivery enabled) on a single pod with 10 go routines (default setting) and we saw the zig zag pattern in the graphs as well. The subscription basically received messages but didn't pass it to the actual subscription handler in the application. It took sometimes over 2h until the message was actually processed. Then suddenly all stuck messages were processed at once, just to be stuck shortly after again. All of these problems vanished after we decreased the number of goroutines from 10 to 1. We played around with the settings to determine the break point of the system. It turns out that our setup works with 4 goroutines but breaks with 5. Now the messages are processed pretty much instantly. We are of course hoping for a fix or at least an explanation of the bug, but we can live with the current setup for now. |
@userwerk I am running into the same issue, and it seemingly happened when we upgraded our client lib version from Not super certain how this could be causing this issue tbh-- it might be a red herring. I'm going to try your approach of decreasing our goroutines, and see if that works. EDIT: |
Hey @akallu and @userwerk, sorry for not commenting on this earlier. Sometimes closed issues falls off my radar as they are dependent on transient notifications that I don't see right away. This issue is something that we're aware of and decreasing I have a suspicion that the upgrade might be a red herring. I can't see how that change is related to more messages expiring. I would look into it further but decreasing streams seems to already unblocked this specific case for you. We're looking for a fix but some other insights I want to share is that
Hope this helps and I'll watch this issue a bit more closely for now. If you're experiencing behavior that isn't related to streams, please open another issue. I'll also try to get plugged into the internal issue if that helps. |
@hongalex Thanks for taking the time to respond, much appreciated! Your comment makes sense for the most part, my only remaining question is around what the level of concurrency is. When you say it's tied to EDIT: Nevermind, read the issue you linked and I understand now. Thanks! |
Client
PubSub v1.3.1
cloud.google.com/go v0.57.0
google.golang.org/grpc -> v1.29.1
Environment
GKE n1-standard-4
Go Environment
Ubuntu 16.04.4 LTS + go1.14.2 linux/amd64
Code
e.g.
Expected behavior
Pubsub subscriptions receive messages correctly
Actual behavior
When ReceiveSettings.NumGoroutines >=7 (this is our previous setting which works fine), the process upon starting will get stuck.
Additional context
We recently upgraded to google_cloud_go to 0.57.0 which comes with pubsub v1.3.1 and start seeing this issue
Our process is trying to receive 17 pubsub subscriptions (one pubsub client) from GCP pubsub
By digging further, e.g. when setting NumGoroutines=8, we found that the process will immediately hit grpc.tranposrt.http2client.streamQuota (max number of grpc streams)
which has default value = 100 and GCP pubsub servers seems to have it as 100 too (captured from handleSetting() handler)
When this limit is hit, new grpc stream can't be created and needs to wait for the old one to close. The issue is that the process hangs here forever, all streams are waiting and not proceed any more.
It seems the number of GRPC streams = ReceiveSettingsNumGoroutines * num_of_subscriptions
The old version of pubsub (comes with google_cloud_go 0.36.0) does not have this behavior. When we set NumGoroutines=8, pubsub/grpc will try to create about 40 streams
my question:
GRPC streams = ReceiveSettingsNumGoroutines * num_of_subscriptions, is this the expected behavior ?
Is our way of setting NumGoroutines correct? because when we have 17 subscriptions, 17*8 = 136 streams > 100 streamQuota (at least by default) seems to be out of stream limit. This is causing degraded performance or hanging issue.
What's the recommended way of setting this number when have relatively large number of subscriptions ?
Thanks for reading.
The text was updated successfully, but these errors were encountered: