pubsub: process hanging forever when large number of subscription combined with NumGoroutines #2593

Wangxishu · 2020-07-13T05:42:30Z

Client

PubSub v1.3.1
cloud.google.com/go v0.57.0
google.golang.org/grpc -> v1.29.1

Environment

GKE n1-standard-4

Go Environment

Ubuntu 16.04.4 LTS + go1.14.2 linux/amd64

Code

e.g.

sub := client.Subscription(subName)
if err != nil {
  client.Close()
  return nil, err
}
if !ok {
  client.Close()
  return nil, fmt.Errorf("pubsub subscription not exist: %s", path)
}
// maxOutstandingMessages could be 15, 100, 1000 for different subscription 
sub.ReceiveSettings.MaxOutstandingMessages = conf.maxOutstandingMessages 
// issue when numPullerGoroutines >= 7
sub.ReceiveSettings.NumGoroutines = conf.numPullerGoroutines
sub.ReceiveSettings.Synchronous = false
sub.Receive(ctx, func(_ context.Context, m *gpubsub.Message) {
  if err := handler(hctx, m); err == nil {
     m.Ack()
  } else {
     m.Nack()
  }
})

Expected behavior

Pubsub subscriptions receive messages correctly

Actual behavior

When ReceiveSettings.NumGoroutines >=7 (this is our previous setting which works fine), the process upon starting will get stuck.

Additional context
We recently upgraded to google_cloud_go to 0.57.0 which comes with pubsub v1.3.1 and start seeing this issue
Our process is trying to receive 17 pubsub subscriptions (one pubsub client) from GCP pubsub

By digging further, e.g. when setting NumGoroutines=8, we found that the process will immediately hit grpc.tranposrt.http2client.streamQuota (max number of grpc streams)
which has default value = 100 and GCP pubsub servers seems to have it as 100 too (captured from handleSetting() handler)
When this limit is hit, new grpc stream can't be created and needs to wait for the old one to close. The issue is that the process hangs here forever, all streams are waiting and not proceed any more.

It seems the number of GRPC streams = ReceiveSettingsNumGoroutines * num_of_subscriptions

The old version of pubsub (comes with google_cloud_go 0.36.0) does not have this behavior. When we set NumGoroutines=8, pubsub/grpc will try to create about 40 streams

my question:

GRPC streams = ReceiveSettingsNumGoroutines * num_of_subscriptions, is this the expected behavior ?
Is our way of setting NumGoroutines correct? because when we have 17 subscriptions, 17*8 = 136 streams > 100 streamQuota (at least by default) seems to be out of stream limit. This is causing degraded performance or hanging issue.
What's the recommended way of setting this number when have relatively large number of subscriptions ?

Thanks for reading.

hongalex · 2020-07-21T05:09:36Z

This seems to be an issue with cloud.google.com/go/pubsub v1.3.0, where we updated the way our gRPC connections work. I can reproduce this on all versions v1.3.0 or higher (or on any modules that require google.golang.org/api v0.16.0 or higher).

In the meantime, to avoid this, please revert back to cloud.google.com/go/pubsub v1.2.0 as a temporary workaround. I'll be looking into what changes in the dependencies might have caused this and expect to have an update soon.

Wangxishu · 2020-07-21T23:18:30Z

thank you Alex for getting back to us quickly.

iagomelanias · 2020-07-22T01:26:49Z

Looks like the PR with a fix was merged, i'm just waiting for a new release. 🙂

hongalex · 2020-07-22T17:29:23Z

Thanks for your patience! This is now released in pubsub/v1.6.0.

maroux · 2021-06-24T19:34:09Z

I think this is still a problem in cloud.google.com/go/pubsub v1.11.0 / google.golang.org/grpc v1.38.0.

I have an app that pulls 27 subscriptions. Here's our unack message count graph with default setting of NumGoroutines (=10) -

Here's ack requests graph which follows the same pattern -

As seen, there's a cyclical pattern of processing all the messages, then failing to process for ~ couple of hours, then processing all the messages again. The time to process an individual message doesn't change, neither does the publish rate. It doesn't completely stop, just comes to crawl (going from 30K rpm to ~500 rpm).

Changing NumGoroutines=1 in every subscription leads to things processing just fine with ack requests graph going up to a roughly constant value (after 11.30) -

I'm not entirely sure how to debug this further. I did notice the same streamQuota (100).

- "fix" for googleapis/google-cloud-go#2593

willie · 2021-10-20T19:59:11Z

For what it's worth, I see a saw blade like pattern like @maroux notes with several of my go pubsub client programs. I'm not up to date with what is the "highest performance" set of options for the client, but I definitely see the jagged unack graph when I look at subscriptions.

userwerk · 2024-04-12T06:29:46Z

We observed a very similar problem. In our setup we have 22 subscriptions (exactly once delivery enabled) on a single pod with 10 go routines (default setting) and we saw the zig zag pattern in the graphs as well. The subscription basically received messages but didn't pass it to the actual subscription handler in the application. It took sometimes over 2h until the message was actually processed. Then suddenly all stuck messages were processed at once, just to be stuck shortly after again.
Also we received a lot of automatic lease modack retry failed for following IDs and status 3 - PERMANENT_FAILURE_INVALID_ACK_ID and rpc error: code = Unauthenticated desc = Request had invalid authentication credentials. Expected OAuth 2 access token, login cookie or other valid authentication credential. errors.

All of these problems vanished after we decreased the number of goroutines from 10 to 1. We played around with the settings to determine the break point of the system. It turns out that our setup works with 4 goroutines but breaks with 5. Now the messages are processed pretty much instantly.

We are of course hoping for a fix or at least an explanation of the bug, but we can live with the current setup for now.

akallu · 2024-05-31T16:10:18Z

@userwerk I am running into the same issue, and it seemingly happened when we upgraded our client lib version from 1.36.1 to 1.36.2. The only change in 1.36.2 (changelog here) was the PR here.

Not super certain how this could be causing this issue tbh-- it might be a red herring. I'm going to try your approach of decreasing our goroutines, and see if that works.

EDIT:
Decreasing goroutine count to 1 worked! Thank you so much for your comment, it saved me a lot of headache.
Will try pushing it up to 4 like you, but I'm not sure this is going to be an acceptable solution for my use-case, as it's for a pretty heavy workload. Will raise this issue further with Google. Thanks again!

hongalex · 2024-05-31T17:48:28Z

Hey @akallu and @userwerk, sorry for not commenting on this earlier. Sometimes closed issues falls off my radar as they are dependent on transient notifications that I don't see right away.

This issue is something that we're aware of and decreasing NumGoroutines = 1 basically confirms it's a stream related issue. The root cause is described here and specifically is related on messages expiring and clogging up the client library when they aren't processed in time.

I have a suspicion that the upgrade might be a red herring. I can't see how that change is related to more messages expiring. I would look into it further but decreasing streams seems to already unblocked this specific case for you.

We're looking for a fix but some other insights I want to share is that

NumGoroutines=1 is actually too high for a lot of use cases. This opens up 1 stream, but each stream can deliver 10 MB/s of messages. If you aren't hitting this throughput, I recommend decreasing to just 1.
You could consider more horizontal sharding (increasing number of subscriber clients each with NumGoroutines=1
NumGoroutines isn't actually indicative of concurrency. The number of callbacks / messages being processed at once is tied to MaxOutstandingMessages. We're planning to rework this soon, since this value is also currently tied to stream flow control, which is what is ultimately causing this poor behavior in the first place.

Hope this helps and I'll watch this issue a bit more closely for now. If you're experiencing behavior that isn't related to streams, please open another issue. I'll also try to get plugged into the internal issue if that helps.

akallu · 2024-05-31T17:56:33Z

@hongalex Thanks for taking the time to respond, much appreciated!

Your comment makes sense for the most part, my only remaining question is around what the level of concurrency is. When you say it's tied to MaxOutstandingMessages, does that mean it's exactly that value or some multiplier on top of it?

EDIT: Nevermind, read the issue you linked and I understand now. Thanks!

Wangxishu added the triage me I really want to be triaged. label Jul 13, 2020

product-auto-label bot added the api: pubsub Issues related to the Pub/Sub API. label Jul 13, 2020

blunderbuss-gcf bot assigned hongalex Jul 13, 2020

hongalex added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. and removed triage me I really want to be triaged. labels Jul 16, 2020

yoshi-automation added the 🚨 This issue needs some love. label Jul 20, 2020

hongalex removed the 🚨 This issue needs some love. label Jul 21, 2020

yoshi-automation added the 🚨 This issue needs some love. label Jul 21, 2020

hongalex mentioned this issue Jul 21, 2020

fix(pubsub): use grpc conn pool option for subscriber client #2624

Merged

hongalex closed this as completed in #2624 Jul 21, 2020

iagomelanias mentioned this issue Jul 22, 2020

pubsub: pull message slow after update #2540

Closed

ghost mentioned this issue Aug 25, 2020

pubsub: Message receive rate is slower when NumGoroutines is >1 after updating to 1.6.1 #2775

Closed

maroux mentioned this issue Jun 24, 2021

More linters, some cleanup and possible fix for pubsub hang cloudchacho/hedwig-go#5

Merged

maroux added a commit to cloudchacho/hedwig-go that referenced this issue Jun 24, 2021

More linters, some cleanup and possible fix for pubsub hang (#5)

d96a4fb

- "fix" for googleapis/google-cloud-go#2593

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pubsub: process hanging forever when large number of subscription combined with NumGoroutines #2593

pubsub: process hanging forever when large number of subscription combined with NumGoroutines #2593

Wangxishu commented Jul 13, 2020

hongalex commented Jul 21, 2020

Wangxishu commented Jul 21, 2020

iagomelanias commented Jul 22, 2020

hongalex commented Jul 22, 2020 •

edited

Loading

maroux commented Jun 24, 2021 •

edited

Loading

willie commented Oct 20, 2021

userwerk commented Apr 12, 2024 •

edited

Loading

akallu commented May 31, 2024 •

edited

Loading

hongalex commented May 31, 2024

akallu commented May 31, 2024 •

edited

Loading

pubsub: process hanging forever when large number of subscription combined with NumGoroutines #2593

pubsub: process hanging forever when large number of subscription combined with NumGoroutines #2593

Comments

Wangxishu commented Jul 13, 2020

hongalex commented Jul 21, 2020

Wangxishu commented Jul 21, 2020

iagomelanias commented Jul 22, 2020

hongalex commented Jul 22, 2020 • edited Loading

maroux commented Jun 24, 2021 • edited Loading

willie commented Oct 20, 2021

userwerk commented Apr 12, 2024 • edited Loading

akallu commented May 31, 2024 • edited Loading

hongalex commented May 31, 2024

akallu commented May 31, 2024 • edited Loading

hongalex commented Jul 22, 2020 •

edited

Loading

maroux commented Jun 24, 2021 •

edited

Loading

userwerk commented Apr 12, 2024 •

edited

Loading

akallu commented May 31, 2024 •

edited

Loading

akallu commented May 31, 2024 •

edited

Loading