-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Elevated Latency in pipeline SET and MGET calls with version 9.5.2 or higher #3282
Comments
@tim-zhu-reddit thank you for reporting this. Have any changes to the redis server itself happened while this latency increase was observed? Was the change in latency solely based on the change of go-redis version? |
Also, if you are correct that the bug is present in |
@ndyakov thank you for taking a look! |
@ndyakov : we started to notice the issue when we upgraded client lib from v9.2.0 to v9.7.0. after noticed the issue, we did a "bisect" on the releases between them, and found out that v9.5.1 is the last good one and v9.5.2 is the first bad one. if you think #2961 is the only one could be relevant in 9.5.2 release, then we can try testing 9.7.0 with revert of that one applied (assuming it's still a clean revert) to see if we can reproduce the bug. |
Update: We tested 9.7.0 with #2961 reverted (see https://github.com/fishy/go-redis/commits/38b8f52b4d563cb0c07557c254e4a9b16d25674e/), and verified that version does not have the same issue. So the bug should be somewhere inside #2961. |
@fishy: that is interesting. Will look into it, see if I can prepare benchmark test that will catch the degradation of performance and see how to proceed. I expect this to take some time, in the meantime I hope the approach with the reverted PR works for you. By the way, do you have timeouts on the context that you are passing or are you canceling the context by any chance ? |
Understood. @fishy are you observing something additional in the node logs or go-redis logs? I assume we should add more logging around this piece of code with the |
@ndyakov I looked into the server-side metrics for two tests we ran with the new version yesterday.
below is a screenshot of the read and write request rate and latency by redis nodes on https://pkg.go.dev/github.com/go-redis/redis#ClusterOptions, we set readonly = true and routeRandomly = true. could #2961 cause a behaviour change in routing read requests? |
Hm.. that is quite strange that you are observing. It is interesting to see more equal distribution but higher latency. I am still focused on another feature, but would like to dig deeper into the issue since we have pinpointed the exact patch that triggers it. It looks like the package is marking the nodes as unhealthy and rerouting requests to other nodes. What is quite strange for me is that during normal periods it doesn't seem that you have equal distribution of read requests. Maybe there is something in the application code that is reusing a sticky connection? (cc @tim-zhu-reddit , @fishy ) |
@ndyakov Server side metrics look similar to #3190 PR. It may be the case when error (may be context error) happens in pipelineReadCmds method and isBadConnection() is marking node as failed. May be @fishy or @tim-zhu-reddit can share more information about the client configuration options and timeouts set. |
@EXPEbdodla thank you for helping, here is our config |
@ndyakov, good question. I double check our code, we don't have any logic around sticky connections. since we set |
I reviewed the code in #2961, and it only handles ERR. I noticed that you confirmed the issue was alleviated after reverting to the #2961 code, which suggests that in your example, there were a large number of request errors (possibly Redis errors or network errors). |
Alternatively, you could use the process of elimination by reverting the code from #2961 line by line, which could help pinpoint the location of the issue. |
@monkey92t reverting the last line of change in #2961 fixed the issue for us. we also tested latest master, confirmed that #is still problematic. so it is very likely it is that line caused our issue. |
Thank you for your help. Now we can almost confirm the existence of a Redis-ERROR, and it might even be triggered by ReadOnly. We need to conduct further troubleshooting:
From the symptoms, it seems to be a READONLY error. You can confirm this in the isBadConn function. However, what’s puzzling is that in cluster mode, write commands should not be sent to ReadOnly nodes. If possible, could you provide a reproducible example? |
unfortunately we cannot even reproduce it on our staging environment, so we have to use our production to verify whether the different versions of go-redis have or don't have the bug (when they do have the bug, the consequence is mild, fortunately) tim will update on other questions. |
I tested it out with Problematic version (readonly turned off)
no issue version (one-line reverted)
|
our pipelin is consist of a batch of 25 SET command with NX and TTL
|
Based on my understanding, I’m quite puzzled. From your test, it seems that even when you set opt.ReadOnly to false, the issue persists. However, from your comments, it appears that when you reverted that line of code, an error occurred: if c.opt.ReadOnly && isBadConn(err, false, node.Client.getAddr()) {
node.MarkAsFailing()
} There’s an obvious conflict here. According to the code, in the PR, the new addition You executed a large number of SET commands (write operations) through a pipeline, which should be routed to the master node. go-redis assumes that the first node returned by the |
@monkey92t I did provide the
agree, potentially turning |
So, @tim-zhu-reddit is there a log of the errors that are marking some of the nodes as failing? I will look more into this in the following week. Could it be that you are receiving a lot of moved errors? The timeouts that you have set seem high, so I wouldn't worry about that, but you can see if setting the second argument of |
I reviewed the output results of your cluster nodes and cluster slots commands, and I didn’t find any obvious issues. I think, to resolve this problem, we should conduct some additional tests to pinpoint where the error is occurring. Previously, we were merely speculating and predicting that it might be a ReadOnly or MOVED error. We should do some debugging and log output. Lines 90 to 121 in 1c9309f
In the code above, when a context timeout or cancellation, a ReadOnly error, a Moved error, or other unknown errors occur (this is important—did an unknown error arise?), it returns true. Can we add logging, error counts, or some debugging work at these points? |
thanks for the advice, will add log and test it out on Monday |
@monkey92t , added log to each of the cases in Lines 109 to 111 in 1c9309f
redis: nil .
there are also some were not redis errors, reached here, Lines 113 to 114 in 1c9309f
the error was like read tcp 10.99.153.22:55070->10.24.94.166:6379: i/o timeout . I think they would count as timeout errors and returned false.
|
If I’m not mistaken (responding from my phone, so can’t check the code
right now) the timeouts were not allowed in the said check that was added.
…On Mon, 17 Mar 2025 at 17:05, tim-zhu-reddit ***@***.***> wrote:
@monkey92t <https://github.com/monkey92t> , added log to each of the
cases in isBadConn.
I didn't see any read-only or moved error logs
the vast majority of the errors went into the default case
https://github.com/redis/go-redis/blob/1c9309fdc214c77d11953a15113ef6cbcfe49501/error.go#L109-L111
, and the error is redis: nil.
there are also some were not redis errors, reached here,
https://github.com/redis/go-redis/blob/1c9309fdc214c77d11953a15113ef6cbcfe49501/error.go#L113-L114
the error was like read tcp 10.99.153.22:55070->10.24.94.166:6379: i/o
timeout. I think they would count as timeout errors and returned false.
—
Reply to this email directly, view it on GitHub
<#3282 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AALZXMUQNFGGQEW6HIZG5KD2U3QCNAVCNFSM6AAAAABXM5JOK6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDOMRZHA3TKMRQGI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
[image: tim-zhu-reddit]*tim-zhu-reddit* left a comment
(redis/go-redis#3282)
<#3282 (comment)>
@monkey92t <https://github.com/monkey92t> , added log to each of the
cases in isBadConn.
I didn't see any read-only or moved error logs
the vast majority of the errors went into the default case
https://github.com/redis/go-redis/blob/1c9309fdc214c77d11953a15113ef6cbcfe49501/error.go#L109-L111
, and the error is redis: nil.
there are also some were not redis errors, reached here,
https://github.com/redis/go-redis/blob/1c9309fdc214c77d11953a15113ef6cbcfe49501/error.go#L113-L114
the error was like read tcp 10.99.153.22:55070->10.24.94.166:6379: i/o
timeout. I think they would count as timeout errors and returned false.
—
Reply to this email directly, view it on GitHub
<#3282 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AALZXMUQNFGGQEW6HIZG5KD2U3QCNAVCNFSM6AAAAABXM5JOK6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDOMRZHA3TKMRQGI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Based on the tests we conducted above, I feel very confused. From the discussion above, it can be seen that when the following code is restored, the issue disappears: if c.opt.ReadOnly && isBadConn(err, false, node.Client.getAddr()) {
node.MarkAsFailing()
} However, when you set ReadOnly, RouteRandomly, and RouteByLatency to false, the problem still persists. From this phenomenon, it shouldn’t be the case that simply restoring the above code would resolve it, because it first checks if c.opt.ReadOnly is true. Are we missing something here? Under normal circumstances, if ReadOnly, RouteRandomly, and RouteByLatency are set to false, the effect of restoring this line of code should be consistent. I’ve reread all the comments again, and I think we should assume that multiple failures together caused this issue. As mentioned above, if it were just a single failure, it would be contradictory. From the beginning, your operations on the Redis cluster included commands like set, ttl, and mget, with ReadOnly and RouteRandomly enabled as true. This would cause write commands (set/ttl) to be sent to the master, while read commands (mget) would be sent to both the master and slaves. Given your QPS of 350K, in this scenario, when encountering the new error marking introduced in #2961—such as the read ... io/timeout error you described in the comments—the slave would become unavailable. At this point, node.MarkAsFailing() would be triggered, rendering the slave node unavailable for 10 seconds. During this period, set, ttl, and mget commands would all default to the master, causing a surge in traffic to the master node and subsequently triggering more failures. You can perform the following actions: log the locations where node.MarkAsFailing() is triggered as introduced in #2961, along with the error types and associated node information, and compile a statistical summary in the following format: node(10.24.94.166:6379) - node.Client.getConn(err info...) - count 10
node(10.24.70.53:6379) - WithWriter(err info...) - count 21
node(10.24.92.75:6379) - Read(err info...) - count 28 |
@monkey92t The test I did yesterday was with |
@monkey92t I'm not very familiar with redis internals so what I'm saying here might be totally wrong. But from my understanding, let's say we have N redis nodes (1 master, N-1 slaves), what happened here is basically that as long as we have N-1 timeouts from the slaves within 10s, that will mark all the slaves as bad and all requests will go to the master, which will likely overload the master and cause higher latency. (and before we hit N-1 timeouts, any timeout will remove the slave from future requests for 10s, and likely overload other slaves and make them more likely to timeout) And here, with 350k qps (which is 3.5 million requests for 10s), and usually very small N, this is almost bound to happen. |
@tim-zhu-reddit if we look again at:
the flag for Can you try increasing the timeouts in the client options and see if the traffic shifts again to the master? |
I am also suspecting something like this happening. |
@fishy , @tim-zhu-reddit please try with bigger timeouts and report back so we can determine if we should either:
And mark this as resolved :) |
I would propose that we make the behavior introduced by #2961 optional. Basically, for setup with enough redundancy, #2961 is mostly good, as it helps them to avoid "bad" nodes for a bit. But, for high through put use cases like ours, any transient network error (it could be timeout or other) will remove one slave for 10s, which will cascade to more timeouts (as every remaining node handling more requests), and quickly leads to all slaves being marked as bad and master node being overloaded. @ndyakov what do you think? |
@fishy , @tim-zhu-reddit , @monkey92t I don't think our analysis is correct. Let's take a step back.
In any case, I do think we should gain as much information as possible and document the findings. There may be something in the way the client initiates connections that results in this increase in latency. |
Expected Behavior
We use redis pipelining (https://pkg.go.dev/github.com/go-redis/redis#Pipeline.Exec) to execute a series of SET commands.
We expected the Pipeline.Exec() call to take less than 200ms at p99, 2ms at p50
We use MGET(https://pkg.go.dev/github.com/go-redis/redis#Ring.MGet) to get keys from redis, expect latency at 1ms p50, 100ms p99 at peak times.
Current Behavior
With Version 9.5.2(https://github.com/redis/go-redis/releases/tag/v9.5.2) and higher, P99 latency spiked above 4s for pipeline SET, causing timeout. and 2s for MGET, while p50 latency remains steady under 2ms and 1ms respectively.
We first noticed the issue after upgrading the version from 9.2.0 to 9.7.0. We bisect-tested different versions and determined 9.5.2 was the first bad version with increased latency.
Possible Solution
Steps to Reproduce
Context (Environment)
Our use redis as a simple dudupe layer, acting like a LRU cache. When we see a new name, we write to Redis, then check redis to see if we have seen this name.
peak hours QPS are 350k for MGET and 25k for pipeline request(max 25 SET command in each pipeline request)
This latency issue seem to impact high QPS service like this, as we didn't hear other internal teams using the same go-redis version reporting a similar issue.
Detailed Description
Possible Implementation
The text was updated successfully, but these errors were encountered: