Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pubsub: process hanging forever when large number of subscription combined with NumGoroutines #2593

Closed
Wangxishu opened this issue Jul 13, 2020 · 10 comments · Fixed by #2624
Closed
Assignees
Labels
api: pubsub Issues related to the Pub/Sub API. priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. 🚨 This issue needs some love. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@Wangxishu
Copy link

Client

PubSub v1.3.1
cloud.google.com/go v0.57.0
google.golang.org/grpc -> v1.29.1

Environment

GKE n1-standard-4

Go Environment

Ubuntu 16.04.4 LTS + go1.14.2 linux/amd64

Code

e.g.

sub := client.Subscription(subName)
if err != nil {
  client.Close()
  return nil, err
}
if !ok {
  client.Close()
  return nil, fmt.Errorf("pubsub subscription not exist: %s", path)
}
// maxOutstandingMessages could be 15, 100, 1000 for different subscription 
sub.ReceiveSettings.MaxOutstandingMessages = conf.maxOutstandingMessages 
// issue when numPullerGoroutines >= 7
sub.ReceiveSettings.NumGoroutines = conf.numPullerGoroutines
sub.ReceiveSettings.Synchronous = false
sub.Receive(ctx, func(_ context.Context, m *gpubsub.Message) {
  if err := handler(hctx, m); err == nil {
     m.Ack()
  } else {
     m.Nack()
  }
})

Expected behavior

Pubsub subscriptions receive messages correctly

Actual behavior

When ReceiveSettings.NumGoroutines >=7 (this is our previous setting which works fine), the process upon starting will get stuck.

Additional context
We recently upgraded to google_cloud_go to 0.57.0 which comes with pubsub v1.3.1 and start seeing this issue
Our process is trying to receive 17 pubsub subscriptions (one pubsub client) from GCP pubsub

By digging further, e.g. when setting NumGoroutines=8, we found that the process will immediately hit grpc.tranposrt.http2client.streamQuota (max number of grpc streams)
which has default value = 100 and GCP pubsub servers seems to have it as 100 too (captured from handleSetting() handler)
When this limit is hit, new grpc stream can't be created and needs to wait for the old one to close. The issue is that the process hangs here forever, all streams are waiting and not proceed any more.

It seems the number of GRPC streams = ReceiveSettingsNumGoroutines * num_of_subscriptions

The old version of pubsub (comes with google_cloud_go 0.36.0) does not have this behavior. When we set NumGoroutines=8, pubsub/grpc will try to create about 40 streams

my question:

  1. GRPC streams = ReceiveSettingsNumGoroutines * num_of_subscriptions, is this the expected behavior ?

  2. Is our way of setting NumGoroutines correct? because when we have 17 subscriptions, 17*8 = 136 streams > 100 streamQuota (at least by default) seems to be out of stream limit. This is causing degraded performance or hanging issue.
    What's the recommended way of setting this number when have relatively large number of subscriptions ?

Thanks for reading.

@Wangxishu Wangxishu added the triage me I really want to be triaged. label Jul 13, 2020
@product-auto-label product-auto-label bot added the api: pubsub Issues related to the Pub/Sub API. label Jul 13, 2020
@hongalex hongalex added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. and removed triage me I really want to be triaged. labels Jul 16, 2020
@yoshi-automation yoshi-automation added the 🚨 This issue needs some love. label Jul 20, 2020
@hongalex
Copy link
Member

This seems to be an issue with cloud.google.com/go/pubsub v1.3.0, where we updated the way our gRPC connections work. I can reproduce this on all versions v1.3.0 or higher (or on any modules that require google.golang.org/api v0.16.0 or higher).

In the meantime, to avoid this, please revert back to cloud.google.com/go/pubsub v1.2.0 as a temporary workaround. I'll be looking into what changes in the dependencies might have caused this and expect to have an update soon.

@hongalex hongalex removed the 🚨 This issue needs some love. label Jul 21, 2020
@yoshi-automation yoshi-automation added the 🚨 This issue needs some love. label Jul 21, 2020
@Wangxishu
Copy link
Author

thank you Alex for getting back to us quickly.

@iagomelanias
Copy link

Looks like the PR with a fix was merged, i'm just waiting for a new release. 🙂

@hongalex
Copy link
Member

hongalex commented Jul 22, 2020

Thanks for your patience! This is now released in pubsub/v1.6.0.

@maroux
Copy link

maroux commented Jun 24, 2021

I think this is still a problem in cloud.google.com/go/pubsub v1.11.0 / google.golang.org/grpc v1.38.0.

I have an app that pulls 27 subscriptions. Here's our unack message count graph with default setting of NumGoroutines (=10) -

Screen Shot 2021-06-24 at 12 35 09 PM

Here's ack requests graph which follows the same pattern -

Screen Shot 2021-06-24 at 12 32 20 PM

As seen, there's a cyclical pattern of processing all the messages, then failing to process for ~ couple of hours, then processing all the messages again. The time to process an individual message doesn't change, neither does the publish rate. It doesn't completely stop, just comes to crawl (going from 30K rpm to ~500 rpm).

Changing NumGoroutines=1 in every subscription leads to things processing just fine with ack requests graph going up to a roughly constant value (after 11.30) -

Screen Shot 2021-06-24 at 12 33 16 PM

I'm not entirely sure how to debug this further. I did notice the same streamQuota (100).

@willie
Copy link

willie commented Oct 20, 2021

For what it's worth, I see a saw blade like pattern like @maroux notes with several of my go pubsub client programs. I'm not up to date with what is the "highest performance" set of options for the client, but I definitely see the jagged unack graph when I look at subscriptions.

@userwerk
Copy link

userwerk commented Apr 12, 2024

We observed a very similar problem. In our setup we have 22 subscriptions (exactly once delivery enabled) on a single pod with 10 go routines (default setting) and we saw the zig zag pattern in the graphs as well. The subscription basically received messages but didn't pass it to the actual subscription handler in the application. It took sometimes over 2h until the message was actually processed. Then suddenly all stuck messages were processed at once, just to be stuck shortly after again.
Also we received a lot of automatic lease modack retry failed for following IDs and status 3 - PERMANENT_FAILURE_INVALID_ACK_ID and rpc error: code = Unauthenticated desc = Request had invalid authentication credentials. Expected OAuth 2 access token, login cookie or other valid authentication credential. errors.

All of these problems vanished after we decreased the number of goroutines from 10 to 1. We played around with the settings to determine the break point of the system. It turns out that our setup works with 4 goroutines but breaks with 5. Now the messages are processed pretty much instantly.

We are of course hoping for a fix or at least an explanation of the bug, but we can live with the current setup for now.

@akallu
Copy link

akallu commented May 31, 2024

@userwerk I am running into the same issue, and it seemingly happened when we upgraded our client lib version from 1.36.1 to 1.36.2. The only change in 1.36.2 (changelog here) was the PR here.

Not super certain how this could be causing this issue tbh-- it might be a red herring. I'm going to try your approach of decreasing our goroutines, and see if that works.

EDIT:
Decreasing goroutine count to 1 worked! Thank you so much for your comment, it saved me a lot of headache.
Will try pushing it up to 4 like you, but I'm not sure this is going to be an acceptable solution for my use-case, as it's for a pretty heavy workload. Will raise this issue further with Google. Thanks again!

@hongalex
Copy link
Member

Hey @akallu and @userwerk, sorry for not commenting on this earlier. Sometimes closed issues falls off my radar as they are dependent on transient notifications that I don't see right away.

This issue is something that we're aware of and decreasing NumGoroutines = 1 basically confirms it's a stream related issue. The root cause is described here and specifically is related on messages expiring and clogging up the client library when they aren't processed in time.

I have a suspicion that the upgrade might be a red herring. I can't see how that change is related to more messages expiring. I would look into it further but decreasing streams seems to already unblocked this specific case for you.

We're looking for a fix but some other insights I want to share is that

  1. NumGoroutines=1 is actually too high for a lot of use cases. This opens up 1 stream, but each stream can deliver 10 MB/s of messages. If you aren't hitting this throughput, I recommend decreasing to just 1.
  2. You could consider more horizontal sharding (increasing number of subscriber clients each with NumGoroutines=1
  3. NumGoroutines isn't actually indicative of concurrency. The number of callbacks / messages being processed at once is tied to MaxOutstandingMessages. We're planning to rework this soon, since this value is also currently tied to stream flow control, which is what is ultimately causing this poor behavior in the first place.

Hope this helps and I'll watch this issue a bit more closely for now. If you're experiencing behavior that isn't related to streams, please open another issue. I'll also try to get plugged into the internal issue if that helps.

@akallu
Copy link

akallu commented May 31, 2024

@hongalex Thanks for taking the time to respond, much appreciated!

Your comment makes sense for the most part, my only remaining question is around what the level of concurrency is. When you say it's tied to MaxOutstandingMessages, does that mean it's exactly that value or some multiplier on top of it?

EDIT: Nevermind, read the issue you linked and I understand now. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: pubsub Issues related to the Pub/Sub API. priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. 🚨 This issue needs some love. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants