-
Notifications
You must be signed in to change notification settings - Fork 54
Guidelines For Development of Kafka Producers and Consumers
Make sure you track the ‘lag’ between the Kakfa retention window (of messages in a topic) and and the current offset of the slowest consumer… If slowest consumer falls behind too much, then you will miss messages (and the consumer might even start acting strange when it requests messages at offsets for which no message exists)
(From Kafka Mailing List)
Case 1: If a node fails that wasn't a leader for any partitions No impact on consumers and producers
Case 2: If a leader node fails but another in sync node can be become a leader All publishing to and consumption from the partition whose leader failed will momentarily stop until a new leader is elected. Producers should implement retry logic in such cases (and in fact in all kinds of errors from Kafka) and consumers can (depending on your use case) either continue to other partitions after retrying decent number of times (in case you are fetching from partitions in round robin fashion) or keep retrying until leader is available.
Case 3: If a leader node goes down and no other in sync nodes are available In this case, publishing to and consumption from the partition will halt and will not resume until the faulty leader node recovers. In this case, producers should fail the publish request after retrying decent number of times and provide a callback to the client of the producer to take corrective action. Consumers again have a choice to continue to other partitions after retrying decent number of times (in case you are fetching from partitions in round robin fashion) or keep retrying until leader is available. In case of latter, the entire consumer process will halt until the faulty node recovers. In other words: when a leader dies, the preference is to pick a leader from the ISR.If not, the leader is picked from any other available replica. But if no replicas are alive, the partition goes offline and all production and consumption halts, until at least one replica is brought online.
Open Question:
Does the Storm/Kafka Bolt take care of these issues ?