[tmail.linagora.com] Redis outage makes JMAP unusable #1247

quantranhong1999 · 2024-10-16T09:14:59Z

We enabled eventBus.redis.failure.ignore=true, but when Redis is down, TMail seems still to be totally timeout.

Even a JMAP well-known request (that requires no Redis usage) is timeout too. My feeling is that Redis timeout somehow blocked our schedulers afterward sometimes.

Note that I still can do webadmin /healthcheck normally.

TODO: investigate the issue.
Suggestion: Let Redis be down for a good enough time, and fire many JMAP requests to TMail in the meantime to try to reproduce the total timeout issue.

Related materials that we may revise:

cc @Arsnael @vttranlina

The text was updated successfully, but these errors were encountered:

chibenwa · 2024-10-16T10:09:41Z

Casual suspects:

blocking code on .parrallel or drivers rendering those unusable
Suggestion: do a thread dump

chibenwa · 2024-10-16T10:09:55Z

(and 100% agree with the issue)

hungphan227 · 2024-12-12T13:33:41Z

After trying with integration tests, local distributed tmailapp and staging env, I see that redis being down significantly reduce the response time of jmap requests. However, I did not see any hanging issue.

Here is the perf test result after shuting down redis on staging (5000 users, 1 instance tmail jmap)

When I checked log, I saw that a lot of warnings coming from this code in TMailEventDispatcher:

return Flux.fromIterable(routingKeys)
      .flatMap(routingKey -> getTargetChannels(routingKey)
          .flatMap(channel -> redisPublisher.publish(channel, KeyChannelMessage.from(eventBusId, routingKey, eventAsJson).serialize()))
          .timeout(redisEventBusConfiguration.durationTimeout())
          .onErrorResume(REDIS_ERROR_PREDICATE.and(e -> redisEventBusConfiguration.failureIgnore()), e -> {
              LOGGER.warn("Error while dispatching event to remote listeners", e);
              return Flux.empty();
          })
          .then())
      .then();

To improve response time when redis is down. My idea is removing redis call from the chain:


return Flux.fromIterable(routingKeys)
    .flatMap(routingKey -> getTargetChannels(routingKey)
        .flatMap(channel -> {
            redisPublisher.publish(channel, KeyChannelMessage.from(eventBusId, routingKey, eventAsJson).serialize())
                .timeout(redisEventBusConfiguration.durationTimeout())
                .onErrorResume(REDIS_ERROR_PREDICATE.and(e -> redisEventBusConfiguration.failureIgnore()), e -> {
                    LOGGER.warn("Error while dispatching event to remote listeners", e);
                    return Mono.empty();
                }).subscribe();
            return Mono.empty();
        })
        .then())
    .then();

chibenwa · 2024-12-12T15:02:02Z

Being slow is ok.

Being down is not.

Even a JMAP well-known request (that requires no Redis usage) is timeout too. My feeling is that Redis timeout somehow blocked our schedulers afterward sometimes.

That's the issue, not being slow.

As far as I can tel we failed reproducing. Is there della between testing env and prod? Is there a delta between the workload during the tests and the actual workload?

@quantranhong1999 which environment was this noticed in?

quantranhong1999 · 2024-12-13T04:51:04Z

@quantranhong1999 which environment was this noticed in?

tmail.linagora.com.

Rspamd VMs (with Redis deployed) were down for a few minutes has caused users to almost can not use TMail Web (not responsive for very long waiting JMAP requests).
(TMail did not respond to those JMAP requests, not slow).

vttranlina · 2024-12-13T05:16:36Z

The tmail-apisix-plugin-runner also checks revoked tokens in Redis.
This plugin already supports query timeouts and error ignore options, but it hasn't been tested with Gatling for performance. Hung tested the environment with basic authentication, while we use OIDC with APISIX on tmail.linagora.com.

One more point, tmail-apisix-plugin-runner using Spring "non-reactive".

but when Redis is down, TMail seems still to be totally timeout.

which phase the timeout occurs? client HTTP request JMAP and get timeout?

quantranhong1999 · 2024-12-13T05:20:26Z

which phase the timeout occurs? client HTTP request JMAP and get timeout?

You open the TMail web, and you are stuck at the first JMAP well-known request (which does not require any Redis on the TMail side).

It could be related to the Apisix plugin.

hungphan227 · 2024-12-16T10:34:48Z

I have conducted performance test to check if the root cause is Apisix plugin (RevokedTokenPlugin):

Local env: tmail-backend-memory, apisix, redis,...
Shutdown redis then using 150 threads to send JMAP requests to apisix (redis timeout in RevokedTokenPlugin is 5s). Each thread will send requests repeatedly after receiving response from apisix.

Test result (duration: 7 min):

We can see that the average latency is 45s and max latency is 99s. Therefore, I agree that the root cause is Apisix plugin

Solution:
Use circuit breaker library (https://resilience4j.readme.io/docs/circuitbreaker) to check if the number of failed redis requests exceeds the defined value, any further calls to redis would be skipped (the library would retry after defined duration)

chibenwa · 2024-12-16T10:48:17Z

Well this dis-symetry between response times and timouts of the underlying system is classical of poll-based blocking-ios systems.

A curcuit breaker is just an additional layer for hiding blocking IOs in a thread pool.

Sure it would work but it won't solve the underlying fudamental flows in the plugin and would needlessly downgrade the service.

How about making the filter asynchronous / reactive?

My little finger tels me it should be easy: the API is callback compatible.

My be is 500ms timeout + asynchronous and pain is gone!

vttranlina · 2024-12-16T10:57:28Z

How about making the filter asynchronous / reactive?

We can't, as the tmail plugin depends on the core apisix plugin.

		<dependency>
			<groupId>org.springframework.boot</groupId>
			<artifactId>spring-boot-starter-web</artifactId>
		</dependency>
		<dependency>
			<groupId>org.apache.apisix</groupId>
			<artifactId>apisix-runner-starter</artifactId>
			<version>0.4.0</version>
		</dependency>

And this is blocking web service

chibenwa · 2024-12-16T11:16:00Z

God damn gosh it you boys

It's not that there no future in an interface that it means you can't make it reactive.

Callback base interface are naturaly compatible with future / reactive code.

chibenwa · 2024-12-16T11:18:10Z

Done in 30 mins

#1412

BTW if you go that way Netty API is not reactive either if you see what I mean!

chibenwa · 2024-12-16T11:32:31Z

OMG we were blocking on the event loop of the UNIX socker of the APISIX runner

https://github.com/apache/apisix-java-plugin-runner/blob/main/runner-core/src/main/java/org/apache/apisix/plugin/runner/server/ApplicationRunner.java

chibenwa · 2024-12-16T11:37:15Z

Rectification:

Well this dis-symetry between response times and timouts of the underlying system is classical of poll-based blocking-ios systems.

Should be:

Well this dis-symetry between response times and timouts of the underlying system is classical of blocking-on-the-io-threads systems!

chibenwa · 2025-01-03T08:50:18Z

Ok, with this new image I did:

Update CNB OVH preprod https://ci.linagora.com/linagora/lrs/saas/deployments/cnb/cnb-stg-ovh-apps/-/merge_requests/23
Update tmail-apisix chart https://ci.linagora.com/linagora/lrs/saas/tools/helm-charts/tmail-apisix/-/merge_requests/17
Upgrade the govmu deployment manually

I could not...

Upgrade the Linagora deployment (3.2.0 -> 3.9.1 migration breaking changes) https://github.com/linagora/james-project-private/issues/956
Upgrade the CNB prod deployment (3.2.0 -> 3.9.1 migration breaking changes) - let's not touch it: we are gonna migrate it to OVH
TWP https://github.com/linagora/james-project-private/issues/955

I propose to close this issue, keep tracking upstream PR, and open follow up tickets for remaining envs.

quantranhong1999 added the bug Something isn't working label Oct 16, 2024

hungphan227 self-assigned this Dec 9, 2024

chibenwa mentioned this issue Dec 16, 2024

[ENHANCEMENT] Make the Apisix plugin reactive #1412

Draft

3 tasks

chibenwa mentioned this issue Dec 16, 2024

Documentation: clarify the threading model apache/apisix-java-plugin-runner#312

Closed

chibenwa closed this as completed Jan 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tmail.linagora.com] Redis outage makes JMAP unusable #1247

[tmail.linagora.com] Redis outage makes JMAP unusable #1247

quantranhong1999 commented Oct 16, 2024 •

edited

Loading

chibenwa commented Oct 16, 2024

chibenwa commented Oct 16, 2024

hungphan227 commented Dec 12, 2024 •

edited

Loading

chibenwa commented Dec 12, 2024

quantranhong1999 commented Dec 13, 2024

vttranlina commented Dec 13, 2024

quantranhong1999 commented Dec 13, 2024

hungphan227 commented Dec 16, 2024

chibenwa commented Dec 16, 2024

vttranlina commented Dec 16, 2024

chibenwa commented Dec 16, 2024

chibenwa commented Dec 16, 2024

chibenwa commented Dec 16, 2024

chibenwa commented Dec 16, 2024

chibenwa commented Jan 3, 2025 •

edited

Loading

[tmail.linagora.com] Redis outage makes JMAP unusable #1247

[tmail.linagora.com] Redis outage makes JMAP unusable #1247

Comments

quantranhong1999 commented Oct 16, 2024 • edited Loading

chibenwa commented Oct 16, 2024

chibenwa commented Oct 16, 2024

hungphan227 commented Dec 12, 2024 • edited Loading

chibenwa commented Dec 12, 2024

quantranhong1999 commented Dec 13, 2024

vttranlina commented Dec 13, 2024

quantranhong1999 commented Dec 13, 2024

hungphan227 commented Dec 16, 2024

chibenwa commented Dec 16, 2024

vttranlina commented Dec 16, 2024

chibenwa commented Dec 16, 2024

chibenwa commented Dec 16, 2024

chibenwa commented Dec 16, 2024

chibenwa commented Dec 16, 2024

chibenwa commented Jan 3, 2025 • edited Loading

quantranhong1999 commented Oct 16, 2024 •

edited

Loading

hungphan227 commented Dec 12, 2024 •

edited

Loading

chibenwa commented Jan 3, 2025 •

edited

Loading