You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While instance A is consuming/producing records from topic full of records, instance B is started, the following sequence takes place:
time 0:
Instance B requests joining group, at this time several timeout clock start.
Instance A stops consuming and produces message Request joining group due to: group is already rebalancing every 3sec.
time 0 + max.poll.interval.ms:
Instance B produces message JoinGroup failed: The coordinator is not aware of this member
Instance A produces message: consumer poll timeout has expired. This means the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms
Instance B continues requesting joining group and then produces this sequence of messages:
Successfully joined group with generation Generation{...}
Successfully synced group in generation Generation{...}
Notifying assignor about the new Assignment(partitions=[...])
Assigned 4 total (4 new) partition(s) [...]
Setting offset for partition [...]
Instance B starts consuming
time 0 + request.timeout.ms:
Instance A outputs message Resetting the last seen epoch of partition
time 0 + commitLockAcquisitionTimeout:
Instance A finally produces stack trace with following messages and is definitely stuck
User provided listener io.confluent.parallelconsumer.ParallelEoSStreamProcessor failed on invocation of onPartitionsRevoked for partitions
Caused by: java.util.concurrent.TimeoutException: Timeout getting commit lock (which was set to PT5M). Slow processing or too many records being ack'd? Try increasing the commit lock timeout (commitLockAcquisitionTimeout), or reduce your record processing time.
Error from poll control thread, will attempt controlled shutdown, then rethrow. Error: There is a newer producer with the same transactionalId which fences the current one
Is the issue consistently reproducible?
Just starting 2nd instance when 1 instance is active is enough to observe it?
Would it be possible for you to create a reproducible example ? Either as a simple app or integration test - as that would help a lot to identify potential bug.
Yep - i have found a race condition - it is between commitOffsets on normal work and commitOffsets due to partition revocation.
We already had added additional checks / guards there - but apparently - still have synchronisation issue somewhere leading to deadlock.
Will investigate further to flush it out.
Glad you found a clue! It does not appear every time, might be 50% occurrences. We set in place health endpoint associated with kubernetes liveness probe to thwart the issue (it needs sometimes 2 or 3 restarts to stabilize). ParallelEosStreamProcessor.getFailureCause() helped a lot for this! Failure message: Error from poll control thread: There is a newer producer with the same transactionalId which fences the current one.
Hi,
While instance A is consuming/producing records from topic full of records, instance B is started, the following sequence takes place:
The used configuration is:
The text was updated successfully, but these errors were encountered: