Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tmail.linagora.com] Redis outage makes JMAP unusable #1247

Closed
quantranhong1999 opened this issue Oct 16, 2024 · 15 comments
Closed

[tmail.linagora.com] Redis outage makes JMAP unusable #1247

quantranhong1999 opened this issue Oct 16, 2024 · 15 comments
Assignees
Labels
bug Something isn't working

Comments

@quantranhong1999
Copy link
Member

quantranhong1999 commented Oct 16, 2024

We enabled eventBus.redis.failure.ignore=true, but when Redis is down, TMail seems still to be totally timeout.

Even a JMAP well-known request (that requires no Redis usage) is timeout too. My feeling is that Redis timeout somehow blocked our schedulers afterward sometimes.

Note that I still can do webadmin /healthcheck normally.

TODO: investigate the issue.
Suggestion: Let Redis be down for a good enough time, and fire many JMAP requests to TMail in the meantime to try to reproduce the total timeout issue.

Related materials that we may revise:

cc @Arsnael @vttranlina

@quantranhong1999 quantranhong1999 added the bug Something isn't working label Oct 16, 2024
@chibenwa
Copy link
Member

Casual suspects:

  • blocking code on .parrallel or drivers rendering those unusable
  • Suggestion: do a thread dump

@chibenwa
Copy link
Member

(and 100% agree with the issue)

@hungphan227 hungphan227 self-assigned this Dec 9, 2024
@hungphan227
Copy link
Contributor

hungphan227 commented Dec 12, 2024

After trying with integration tests, local distributed tmailapp and staging env, I see that redis being down significantly reduce the response time of jmap requests. However, I did not see any hanging issue.

Here is the perf test result after shuting down redis on staging (5000 users, 1 instance tmail jmap)

Screenshot from 2024-12-12 20-27-37

When I checked log, I saw that a lot of warnings coming from this code in TMailEventDispatcher:

return Flux.fromIterable(routingKeys)
      .flatMap(routingKey -> getTargetChannels(routingKey)
          .flatMap(channel -> redisPublisher.publish(channel, KeyChannelMessage.from(eventBusId, routingKey, eventAsJson).serialize()))
          .timeout(redisEventBusConfiguration.durationTimeout())
          .onErrorResume(REDIS_ERROR_PREDICATE.and(e -> redisEventBusConfiguration.failureIgnore()), e -> {
              LOGGER.warn("Error while dispatching event to remote listeners", e);
              return Flux.empty();
          })
          .then())
      .then();

To improve response time when redis is down. My idea is removing redis call from the chain:


return Flux.fromIterable(routingKeys)
    .flatMap(routingKey -> getTargetChannels(routingKey)
        .flatMap(channel -> {
            redisPublisher.publish(channel, KeyChannelMessage.from(eventBusId, routingKey, eventAsJson).serialize())
                .timeout(redisEventBusConfiguration.durationTimeout())
                .onErrorResume(REDIS_ERROR_PREDICATE.and(e -> redisEventBusConfiguration.failureIgnore()), e -> {
                    LOGGER.warn("Error while dispatching event to remote listeners", e);
                    return Mono.empty();
                }).subscribe();
            return Mono.empty();
        })
        .then())
    .then();

@chibenwa
Copy link
Member

Being slow is ok.

Being down is not.

Even a JMAP well-known request (that requires no Redis usage) is timeout too. My feeling is that Redis timeout somehow blocked our schedulers afterward sometimes.

That's the issue, not being slow.

As far as I can tel we failed reproducing. Is there della between testing env and prod? Is there a delta between the workload during the tests and the actual workload?

@quantranhong1999 which environment was this noticed in?

@quantranhong1999
Copy link
Member Author

@quantranhong1999 which environment was this noticed in?

tmail.linagora.com.

Rspamd VMs (with Redis deployed) were down for a few minutes has caused users to almost can not use TMail Web (not responsive for very long waiting JMAP requests).
(TMail did not respond to those JMAP requests, not slow).

@vttranlina
Copy link
Member

The tmail-apisix-plugin-runner also checks revoked tokens in Redis.
This plugin already supports query timeouts and error ignore options, but it hasn't been tested with Gatling for performance. Hung tested the environment with basic authentication, while we use OIDC with APISIX on tmail.linagora.com.

One more point, tmail-apisix-plugin-runner using Spring "non-reactive".

but when Redis is down, TMail seems still to be totally timeout.

which phase the timeout occurs? client HTTP request JMAP and get timeout?

@quantranhong1999
Copy link
Member Author

which phase the timeout occurs? client HTTP request JMAP and get timeout?

You open the TMail web, and you are stuck at the first JMAP well-known request (which does not require any Redis on the TMail side).

It could be related to the Apisix plugin.

@hungphan227
Copy link
Contributor

I have conducted performance test to check if the root cause is Apisix plugin (RevokedTokenPlugin):

  • Local env: tmail-backend-memory, apisix, redis,...
  • Shutdown redis then using 150 threads to send JMAP requests to apisix (redis timeout in RevokedTokenPlugin is 5s). Each thread will send requests repeatedly after receiving response from apisix.

Test result (duration: 7 min):
Image

We can see that the average latency is 45s and max latency is 99s. Therefore, I agree that the root cause is Apisix plugin

Solution:
Use circuit breaker library (https://resilience4j.readme.io/docs/circuitbreaker) to check if the number of failed redis requests exceeds the defined value, any further calls to redis would be skipped (the library would retry after defined duration)

@chibenwa
Copy link
Member

Well this dis-symetry between response times and timouts of the underlying system is classical of poll-based blocking-ios systems.

A curcuit breaker is just an additional layer for hiding blocking IOs in a thread pool.

Sure it would work but it won't solve the underlying fudamental flows in the plugin and would needlessly downgrade the service.

How about making the filter asynchronous / reactive?

My little finger tels me it should be easy: the API is callback compatible.

My be is 500ms timeout + asynchronous and pain is gone!

@vttranlina
Copy link
Member

How about making the filter asynchronous / reactive?

We can't, as the tmail plugin depends on the core apisix plugin.

		<dependency>
			<groupId>org.springframework.boot</groupId>
			<artifactId>spring-boot-starter-web</artifactId>
		</dependency>
		<dependency>
			<groupId>org.apache.apisix</groupId>
			<artifactId>apisix-runner-starter</artifactId>
			<version>0.4.0</version>
		</dependency>

And this is blocking web service

@chibenwa
Copy link
Member

God damn gosh it you boys

It's not that there no future in an interface that it means you can't make it reactive.

Callback base interface are naturaly compatible with future / reactive code.

@chibenwa
Copy link
Member

Done in 30 mins

#1412

BTW if you go that way Netty API is not reactive either if you see what I mean!

@chibenwa
Copy link
Member

@chibenwa
Copy link
Member

Rectification:

Well this dis-symetry between response times and timouts of the underlying system is classical of poll-based blocking-ios systems.

Should be:

Well this dis-symetry between response times and timouts of the underlying system is classical of blocking-on-the-io-threads systems!

@chibenwa
Copy link
Member

chibenwa commented Jan 3, 2025

Ok, with this new image I did:

I could not...

I propose to close this issue, keep tracking upstream PR, and open follow up tickets for remaining envs.

@chibenwa chibenwa closed this as completed Jan 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants