-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[tmail.linagora.com] Redis outage makes JMAP unusable #1247
Comments
Casual suspects:
|
(and 100% agree with the issue) |
After trying with integration tests, local distributed tmailapp and staging env, I see that redis being down significantly reduce the response time of jmap requests. However, I did not see any hanging issue. Here is the perf test result after shuting down redis on staging (5000 users, 1 instance tmail jmap) When I checked log, I saw that a lot of warnings coming from this code in TMailEventDispatcher:
To improve response time when redis is down. My idea is removing redis call from the chain:
|
Being slow is ok. Being down is not.
That's the issue, not being slow. As far as I can tel we failed reproducing. Is there della between testing env and prod? Is there a delta between the workload during the tests and the actual workload? @quantranhong1999 which environment was this noticed in? |
tmail.linagora.com. Rspamd VMs (with Redis deployed) were down for a few minutes has caused users to almost can not use TMail Web (not responsive for very long waiting JMAP requests). |
The tmail-apisix-plugin-runner also checks revoked tokens in Redis. One more point,
which phase the timeout occurs? client HTTP request JMAP and get timeout? |
You open the TMail web, and you are stuck at the first JMAP well-known request (which does not require any Redis on the TMail side). It could be related to the Apisix plugin. |
I have conducted performance test to check if the root cause is Apisix plugin (RevokedTokenPlugin):
Test result (duration: 7 min): We can see that the average latency is 45s and max latency is 99s. Therefore, I agree that the root cause is Apisix plugin Solution: |
Well this dis-symetry between response times and timouts of the underlying system is classical of poll-based blocking-ios systems. A curcuit breaker is just an additional layer for hiding blocking IOs in a thread pool. Sure it would work but it won't solve the underlying fudamental flows in the plugin and would needlessly downgrade the service. How about making the filter asynchronous / reactive? My little finger tels me it should be easy: the API is callback compatible. My be is |
We can't, as the tmail plugin depends on the core apisix plugin.
And this is blocking web service |
God damn gosh it you boys It's not that there no future in an interface that it means you can't make it reactive. Callback base interface are naturaly compatible with future / reactive code. |
Done in 30 mins BTW if you go that way Netty API is not reactive either if you see what I mean! |
OMG we were blocking on the event loop of the UNIX socker of the APISIX runner |
Rectification:
Should be: Well this dis-symetry between response times and timouts of the underlying system is classical of blocking-on-the-io-threads systems! |
Ok, with this new image I did:
I could not...
I propose to close this issue, keep tracking upstream PR, and open follow up tickets for remaining envs. |
We enabled
eventBus.redis.failure.ignore=true
, but when Redis is down, TMail seems still to be totally timeout.Even a JMAP well-known request (that requires no Redis usage) is timeout too. My feeling is that Redis timeout somehow blocked our schedulers afterward sometimes.
Note that I still can do webadmin
/healthcheck
normally.TODO: investigate the issue.
Suggestion: Let Redis be down for a good enough time, and fire many JMAP requests to TMail in the meantime to try to reproduce the total timeout issue.
Related materials that we may revise:
cc @Arsnael @vttranlina
The text was updated successfully, but these errors were encountered: