Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transactional Producer instance gets timeout getting commit lock while second instance starts #803

Open
PatrickChauveau opened this issue Jun 12, 2024 · 3 comments
Labels
verified bug Something isn't working

Comments

@PatrickChauveau
Copy link

PatrickChauveau commented Jun 12, 2024

Hi,

While instance A is consuming/producing records from topic full of records, instance B is started, the following sequence takes place:

  1. time 0:
  • Instance B requests joining group, at this time several timeout clock start.
  • Instance A stops consuming and produces message Request joining group due to: group is already rebalancing every 3sec.
  1. time 0 + max.poll.interval.ms:
  • Instance B produces message JoinGroup failed: The coordinator is not aware of this member
  • Instance A produces message: consumer poll timeout has expired. This means the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms
  • Instance B continues requesting joining group and then produces this sequence of messages:

Successfully joined group with generation Generation{...}
Successfully synced group in generation Generation{...}
Notifying assignor about the new Assignment(partitions=[...])
Assigned 4 total (4 new) partition(s) [...]
Setting offset for partition [...]

  • Instance B starts consuming
  1. time 0 + request.timeout.ms:
  • Instance A outputs message Resetting the last seen epoch of partition
  1. time 0 + commitLockAcquisitionTimeout:
  • Instance A finally produces stack trace with following messages and is definitely stuck
User provided listener io.confluent.parallelconsumer.ParallelEoSStreamProcessor failed on invocation of onPartitionsRevoked for partitions
Caused by: java.util.concurrent.TimeoutException: Timeout getting commit lock (which was set to PT5M). Slow processing or too many records being ack'd? Try increasing the commit lock timeout (commitLockAcquisitionTimeout), or reduce your record processing time.
Error from poll control thread, will attempt controlled shutdown, then rethrow. Error: There is a newer producer with the same transactionalId which fences the current one

The used configuration is:

  • PC version : 0.5.2.8
  • number of partitions : 4
  • max.poll.records: 150
  • max.poll.interval.ms: 180000
  • request.timeout.ms: 210000
  • linger.ms = 65
  • delivery.timeout.ms: 240000
  • allowEagerProcessingDuringTransactionCommit=false,
  • commitLockAcquisitionTimeout=PT5M,
  • produceLockAcquisitionTimeout=PT1M,
  • commitInterval=PT0.1S,
  • ordering=UNORDERED,
  • commitMode=PERIODIC_TRANSACTIONAL_PRODUCER,
  • maxConcurrency=64,
  • invalidOffsetMetadataPolicy=FAIL,
  • defaultMessageRetryDelay=PT1S,
  • retryDelayProvider=null,
  • sendTimeout=PT10S,
  • offsetCommitTimeout=PT10S,
  • batchSize=1,
  • thresholdForTimeSpendInQueueWarning=PT10S,
  • maxFailureHistory=10,
  • shutdownTimeout=PT10S,
  • drainTimeout=PT30S,
  • messageBufferSize=50,
  • initialLoadFactor=2,
  • maximumLoadFactor=100
@rkolesnev
Copy link
Contributor

Hi @PatrickChauveau - thanks for report.

Is the issue consistently reproducible?
Just starting 2nd instance when 1 instance is active is enough to observe it?

Would it be possible for you to create a reproducible example ? Either as a simple app or integration test - as that would help a lot to identify potential bug.

@rkolesnev
Copy link
Contributor

rkolesnev commented Jun 14, 2024

Yep - i have found a race condition - it is between commitOffsets on normal work and commitOffsets due to partition revocation.
We already had added additional checks / guards there - but apparently - still have synchronisation issue somewhere leading to deadlock.
Will investigate further to flush it out.

@PatrickChauveau
Copy link
Author

Glad you found a clue! It does not appear every time, might be 50% occurrences. We set in place health endpoint associated with kubernetes liveness probe to thwart the issue (it needs sometimes 2 or 3 restarts to stabilize). ParallelEosStreamProcessor.getFailureCause() helped a lot for this! Failure message: Error from poll control thread: There is a newer producer with the same transactionalId which fences the current one.

@rkolesnev rkolesnev added the verified bug Something isn't working label Jun 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
verified bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants