-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pubsub: Significant redelivery while existing messages are still being "processed" #10437
Comments
Thanks as always for the detailed logs.
I think this might be the other way around, where pulled messages = more modacks issued at the time they are pulled in (to acknowledge receip of these messages).
There's no guarantee that increasing the number of streams means you'll hit another server. On top of that, this level of processing shouldn't be OOMing the servers so that doesn't seem right either. I'll check to see if there are any changes around then that might've resulted in this. Just to confirm, this region was processing the data at the same speed (around 400k messages per client outstanding) without issue prior to 6/20? |
So it seems like the issue is because of messages expiring. Specifically, messages are being pulled into the client library but not being processed (because concurrency control, MaxOutstandingMessages, is reached). This is supported by the fact that your end to end latency is very high around the time of these expirations (>1 hour). I think server side flow control is working most of the time, but when the stream is torn down and recreated, the server is erroneously delivering more messages. My current hypothesis goes something like
These steps happen randomly and eventually you build up a large number of outstanding messages in your client that can't be processed in time. Since your I'll try to craft a version of the library that includes a separate setting for #callbacks, as promised in #9727, which should let you set num_callbacks to > 400k and thus your application won't get bogged down by messages received but not being processed in these off cases. The long term solution is to fix the behavior in step 4, but that requires a lot more planning. |
I was thinking something like this was happening but I shrugged it off initially because I assumed the new server would not have any messages to deliver because all of the existing messages in the subscription were already leased by a client (and being modack'd). Or does the stream being killed/restarted invalidate all of the leases being held by the outstanding messages delivered on that stream?
I'm also a bit confused why it's working for several hours of the day then suddenly decides to redeliver a bunch of messages despite the stream (seemingly) being restarted constantly throughout the day. Could it be that it's reconnecting to the same forwarder most of the time but when it connects to a different one (in step 4) that causes the issue? Or I'm trying to see if it's somehow related to |
I was assuming you have a constant flow of messages coming into your subscription. Is your publish throughput less sustained and done more in bursts?
Yeah that's what I'm thinking is happening. I'm not sure if increasing MaxOutstandingMessages will help here. I think it might be best to try the new version of the library that tweaks callbacks independently (maybe setting that to 500k) and seeing if this issue is better. I'll update here once that's ready. |
@jameshartig When you get the chance, can you pull in this version of the library: This introduces a new setting to For your case, I recommend keeping |
Apologies for the delay.
Yes, we do have a constant stream. I see what you meant now but I still wouldn't expect to see a huge spike in redelivered messages like we do see.
I'll work on getting that out today and report back. |
Separately, I'm not sure how best to set the 2 fields in conjunction for a variety of applications.
We try to hide most of this complexity from our developers, except when necessary (which is maybe only the 3rd case above) so I'm trying to figure out how to support 1 & 2 automatically. Currently our framework exposes a knob that is now going to be |
Yes, that's currently the behavior in the main branch. MaxOutstandingMessages is the same as the number of callbacks to start, and configures server side flow control (per stream). You might run into the issue where the streams "forget" about the current flow control when reconnecting, but this is a current limitation with the server that we're actively looking to address.
That sounds like a decent tradeoff. Most of the time, setting MaxOutstandingMessages = 10 means that when there is no redelivery (and streams = 1), client side concurrency will never kick in, and there will only be 10 messages processed simultaneously.
I think the upper bound of RAM should determine MaxCallbacks, since most of the time, the client will only be server side flow controlled, or in other words, only have |
Alternatively, what if we set |
Thanks for trying that fix. In my dashboards, I can see that the ack latencies for messages (time between when server delivers a message until a message is acked) is very high, consistently around 2 hours, with a spike in expired messages around the time of your last screenshots. In addition, the messages held in memory are being lease extended. If your original statement about processing times being about 10s at p99 still stands, this points to the fact the library is pulling in messages as they are available, but the limiting factor continues to be number of simultaneous callbacks in memory.
I was actually going to suggest this. Given that your ack latencies are high to the server, it's possible your client isn't actually able to process 400k messages simultaneously.
Pub/Sub delivers the oldest messages on a best-effort basis, yes. |
Yeah that's probably what it is actually. There's a pretty clear correlation between nack and ack expirations (since they are the same).
I could try this, though I'm a bit skeptical this is happening since the expiration rate doesn't seem to be that high relative to your processing rate. I think the expirations seem to be closely correlated with nacks. Just to confirm, from this comment, the problem is that your in-memory process rate is lower than before the change right? Did you keep the settings since then, or did you switch to different settings? |
Yeah that makes sense. To confirm, for the last several hours there hasn't been many expirations but the number of messages we are holding onto keeps going up. Assuming that the expired graph represents modack expirations then I agree that the modack theory probably isn't what's happening.
I'm hoping that the sum of the messages we keep in-memory (i.e. waiting for 2h sleep) across all of the instances is approximately the number of unacked messages in the subscription. Ideally each instance would get a distinct subset of the messages. The problem I outlined in that comment is that after the change we almost immediately got way more messages sent to our handler than we should have. The publish rate for this subscription is 50-70 messages a second. So we expect to have approximately 60*7200=432k messages unacked at any given time but we should be able to process significantly more than the publish rate (except for the 2 hour sleep). Also, we are only experiencing this issue in
We initially rolled back after that earlier comment but have since rolled back out the changes you requested. We haven't seen much improvement with or without the |
We just put |
There definitely are a good amount of timers in I think the question that's bugging me now is why you are pulling in more messages when server side flow control should have stopped delivering messages after the first 150k. Could you try setting If that doesn't work, I would next suggest opening a ticket so the backend team could look at this too. |
Right, I think this tracks with my current understanding of the situation. When streams are reopened, there's a chance that the server will deliver more messages because it currently does not have a way of knowing how many messages are already in memory by each client. This is something we're working on fixing. Re: bytes, this might not be necessary anymore. From your recent "Client Outstanding Messages" graph, it does seem like each client holds around 150k messages, with some temporary increases beyond that around when there are stream opens. This doesn't align with your previous comment where you were holding 900k messages in memory. Perhaps I'm misunderstanding something, or do you think the opencensus metrics are incorrect? |
The above screenshots were after a deploy during which it does seem to work but then the next day... |
Yeah, in that case, could you try setting MaxOutstandingBytes to 150,000 * average size of message in bytes. If this fixes the issue, it would point to an issue with the logic of how we do server side flow control. |
We recently eclipsed 450k messages in the subscription backlog which could be significant since 150k * 3 VMs = 450k so I just deployed |
It seems suspicious that the message expiration tends to happen when your backlog increases. I'm curious to see if the behavior is better when total number of messages > backlog (with your recent change). Also wanted to double check: your application only nacks messages if 1) the message is currently held in memory or 2) you are in the middle of shutting down. Is that right? I just wanted to make sure the expirations are resulting in nacks (with redeliveries) rather than the other way around (nacks gets recorded as expiration). |
Exactly. I'll report back with the 200k later today but so far things seem to be working well.
That used to be the case. Now we nack messages in 3 situations:
We are debating disabling the rate limit that we recently added but need to first be confident in the rate of messages. This is part of issue of us configuring |
I believe the issue is that when I configured
So I think I need to undo setting Also another OpenCensus metric that would be helpful is how many bytes and messages are blocked in |
Sorry, just catching up on the information you've provided (thanks for doing so).
Yeah that's a good point. It seems like server side flow control is being exceeded anyway, which is frustrating.
I added to the If you need something which is a LastValue view, similar to OutstandingMessages, I could probably add it in as a temporary fix, though I'm not sure if it would add it as a permanent feature into the library.
So, this mean does seem to me (as you noted) that messages aren't being processed quickly after the ~2 hour sleep, and might be timing out after even 45 minutes. This might be something that can be tracked using the OpenTelemetry tracing view (since user callback time is tracked there) to verify. If it would be helpful, I can create another branch that merges the otel branch + callbacks + opencensus metrics. Side note, unfortunately the otel tracing is delayed just a bit longer since we're waiting to bump the min go version from go1.20->go1.21 (which is the min version that otel requires now).
I believe there should be more information we can gather, though unfortunately I don't have access to this and you would need to open a ticket to look into this more. |
Unfortunately, there's a What's the difference between
I'm interested to hear why a LastValue view wouldn't be added but a Count view would be? Or were you saying that probably neither will make it into the library permanently?
I thought so too but I've added a lot of metrics around that and I can't seem to find any messages that are taking longer than a few seconds. But it could be that messages are being stuck in flow control and are never making it to my handler. I'll be able to confirm that once I disable the
I was about to suggest that. I do think that would help. Thanks!
That's a shame but at least for us we run 1.22 internally so this is fine. |
Removed the printf, but yes PullCount is essentially the same as FlowControlledMessages currently (minus when they were recorded). My original idea was to use FlowControlledMessages as a proxy for when messages were added to flow control, and then
Originally I was thinking neither would be added since we deprecated OpenCensus a while back, but given that OpenTelemetry metrics is still being worked on, I could see adding these metrics if they prove to be useful. I'll focus on getting those changes as promised and will update later today. |
Created Something I realized when I was adding this is that the number of messages stuck by the flow controller when streams=1 should be limited to the batch of messages returned by |
I deployed that to production. My plan is to filter traces to
Now that we're back live with |
Yeah that's how it should behave. If |
That's very curious considering there shouldn't be anything new to that branch besides adding metrics. If it continues to behave well, it would point to the issue being a temporary server side issue, but I'm not that sure. On the chance that the issue resurfaces, I'm also wondering if you can scale up the number of subscriber clients. I think you currently have 3 and I'm wondering if creating more could alleviate the issue too. |
Separately, I'm going to be deploying a change that sets |
Yeah the ticket is a good idea since then someone can drill down into the message ID level and see what's happening there.
Yeah I can see this being a problem. As I understand it, if an Related, |
That seems like a good idea. I'm going chase down a CPU profile to see if I can understand if that's causing the missed modacks. The |
The ticket is 52368550. They provided a list of the calls related to messageID 11704992227715245 which is from the first trace above (abdbc06f25197b8b0015c9853788bd76). The server agrees with the client that there was a modack at 4:26 so I've asked for clarification why it was redelivered then 3 minutes later.
I added back the timer batching to reduce the number of overall timers that the Go runtime has to manage and that seems to have helped eliminate the CPU saturation spikes but it hasn't helped reduce the redeliveries. I just deployed a change swapping out the otel span batch processor with our own to help reduce the dropped spans to get a clearer picture of the issue when it happens. |
Just following up here since it's been a while since I responded. I'm following the ticket closely and after the investigations are complete, they'll update on the ticket. If the suggestion is that the modacks are not being issued in time, I'll implement the change described in my previous comment. Thanks for clarifying that the CPU usage doesn't necessarily correlate with redeliveries. |
I heard that there's been some changes on the server-side to improve things. What is your recommendation as far as the max-callbacks branch/code? We've been running this whole time with |
We plan to support configuring callbacks as a feature, but I'm working on making sure this is designed consistently with Pub/Sub libraries in other languages. If the server side changes fix the problem of overdelivering and you're ok with setting |
We are okay with setting MaxGoroutines to 1 since we don't need more throughput. We also plan on testing exactly-once delivery to see if that will help with the redeliveries as well. I believe this can be closed unless you want to leave it open to track the NumCallbacks feature. |
Yeah I'll close this for now. We're tracking callbacks in another location (across languages, so not specific to this Github issue yet). Thanks for your patience as the issue was resolved. |
Client
PubSub v1.39.0
Environment
CentOS on GCE
Go Environment
N/A
Code
I'll try to include the relevant code below as it pertains to this issue.
e.g.
Expected behavior
I expect there to be minimal redeliveries while a message is being handled and should still be modacked by the client.
Actual behavior
We see messages being redelivered in spikes frequently. Since we only keep track of "already processing" messages on a per VM basis we can't accurately track when we get redeliveries to different VMs.
Screenshots
There are around 400k-450k unacked messages at any given time.
That said, the messages are each pretty small and only total 50MB-60MB:
We let messages get as old as ~2 hours before we start to actually "process" them:
We are generally "processing" the message in less than 10 seconds at the 90th percentile:
This is a chart of all of the times we had to cancel an existing "pending" message because we received another one.
You'll notice it spikes as high as 300k which means we are getting redeliveries of most of the outstanding messages.
This chart shows the total messages being held in memory per datacenter:
You'll notice that it's pretty normal until 6/22/2024 at 20:00 UTC where it then started to increase dramatically. Then it returned to normal on 6/24 but then started to be erratic again on 6/24 at 20:00 UTC. Another thing to notice is that it spikes above 400k which means we have significantly more messages being held in-memory than are even outstanding in the subscription.
Even worse is if you look at the messages being held by the pub/sub client itself (this comes from the
OustandingMessagesView
) (green is sum (across 2-3 servers) and yellow is average):When things were normal the client was holding onto 400k-500k messages but when things go wrong the client is holding onto 1 million+ messages (presumably in flow control) which seems insane since there shouldn't be more than 400k messages in the subscription.
We set
MinExtensionPeriod
to 10 minutes since we expect to hold onto the messages for a long time and don't need to modAck all of the time. This seems to be working since we see a spike in modAck's every 5 minutes:What's odd though is that around when we submit the modAck's we see the spike in new pulled messages and spike in duplicate messages:
Additional context
NumGoroutines
was previously set to2
but after the issues on 6/22, we lowered it to1
without much of a difference.WithGRPCConnectionPool
was previously set to2
(default) but we raised it to 4 at the same time as lowering NumGoroutines and again, there wasn't much difference. We also tried lowering the number of VMs supporting this workload from 3 -> 2 but that seemed to have made things worse.The
Acknowledgement deadline
on the subscription is set to the default (10 seconds). We haven't changed this but could if you think that would help.We were previously "abusing" the backoff delay and kept nack'ing messages until the ~2 hours was up but after talking to a Pub/Sub PM we were told that the backoff delay isn't a guarantee and that only a certain number of messages will respect that backoff delay. This approach (holding the messages in memory in our handler) was suggested as a better approach. Things were working pretty normally until recently so I'm not sure if there was some change on the Pub/Sub side. We didn't make any changes to our code on 6/22. It's also a bit interesting that things were okay on 6/24 until 20:00 UTC again like on 6/22.
I'm wondering if there's some memory/message limit on the forwarders that we might be hitting which causes them to crash and then another forwarder redelivers the messages. My assumption is based on us not having issues when the number of undelivered messages total stays under ~420k because on both 6/22, 6/23, and 6/24 we hit new peaks of ~430k, ~451k, and ~445k outstanding messages, respectively. If that's the case then would actually increasing the number of goroutines help so that messages could be potentially spread across more forwarders? We could also try to decrease the size of the messages but they're already pretty small.
Finally, related to above, I limited this to only europe-west1 because we have the most traffic there and the other regions seem relatively unaffected.
@hongalex I apologize in advance for the complicated issues.
The text was updated successfully, but these errors were encountered: