You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Brooklin takes long time to recover from errors in the destination cluster. Sometimes multiple cycles of rebalances and complete halt of replication during that time.
Operating System 4.9.0-11-amd64 #1 SMP Debian 4.9.189-3+deb9u1 (2019-09-20) x86_64 Linux
Brooklin version 1.0.2
Java version openjdk version "1.8.0_212"
Kafka version 2.1.0
ZooKeeper version 3.4.10
Steps to reproduce
As described in this document https://github.com/AppsFlyer/kafka-mirror-tester/blob/master/results-brooklin.md, there are several scenarios in which that may happen.
One such scenario is to restart a broker at the destination cluster.
This results in errors for as long as the broker is down, which is understandable. But even long after the broker is back up - brooklin continues to err, up to a complete halt of replication to the entire cluster (not only to that failed broker).
Expected behaviour
We expect brooklin to recover gracefully and not halt replication during the rebalance cycle.
We expect to see just a single, hopefully short, rebalance, instead we multiple cycles that sometimes take quite long (10-15 minutes).
Actual behaviour
Brooklin takes a long time (10-15 minutes, sometimes more) to recover. During that time we see cycles of replication and then a complete halt of replication and then again, replication and then again a halt.
The text was updated successfully, but these errors were encountered:
Subject of the issue
Brooklin takes long time to recover from errors in the destination cluster. Sometimes multiple cycles of rebalances and complete halt of replication during that time.
I have run some tests, they are documented here: https://github.com/AppsFlyer/kafka-mirror-tester/blob/master/results-brooklin.md
Your environment
4.9.0-11-amd64 #1 SMP Debian 4.9.189-3+deb9u1 (2019-09-20) x86_64 Linux
1.0.2
openjdk version "1.8.0_212"
2.1.0
3.4.10
Steps to reproduce
As described in this document https://github.com/AppsFlyer/kafka-mirror-tester/blob/master/results-brooklin.md, there are several scenarios in which that may happen.
One such scenario is to restart a broker at the destination cluster.
This results in errors for as long as the broker is down, which is understandable. But even long after the broker is back up - brooklin continues to err, up to a complete halt of replication to the entire cluster (not only to that failed broker).
Expected behaviour
We expect brooklin to recover gracefully and not halt replication during the rebalance cycle.
We expect to see just a single, hopefully short, rebalance, instead we multiple cycles that sometimes take quite long (10-15 minutes).
Actual behaviour
Brooklin takes a long time (10-15 minutes, sometimes more) to recover. During that time we see cycles of replication and then a complete halt of replication and then again, replication and then again a halt.
The text was updated successfully, but these errors were encountered: