Add discard buffer to prevent unsychronized access when RingBuffer full #1410
+130
−19
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
A service that we maintain occasionally hangs when the rate of incoming logs is higher than the throughput of writing logs to a disk. We have a discard policy at ERROR level and use default
SynchronizeEnqueueWhenQueueFull
. We believe that this is due to corrupted LMAX disruptor. We did some initial attempts to produce minimal reproducer (using jcstress) but were not able to succeed so far. This change also 'prioritizes' non-discarded logs, which in theory should reduce the logging back-pressure when buffer almost full.While reviewing the log4j code we noticed that there are code paths that avoid SynchronizeEnqueueWhenQueueFull and may be responsible for RingBuffer corruption, I prepared a small change that should prevent any unsynchronized access and wanted to get initial feedback from log4j maintainers. I also wanted to use this pull request to pick you brain on what else we can do to troubleshoot the issue futher.
Writer threads hang on following stack trace:
While the event processor is runnable, but is not making any progress, i.e. no logs are produced
Benchmark results
I was a bit surprised, but comparing to previous results of those benchmarks, I didn't see a significant drop in performance for the
ENQUEUE_UNSYNCHRONIZED
. Benchmarks were run on m5.4xl instance (16 vCPU)Checklist
2.x
branch if you are targeting Log4j 2; usemain
otherwiseDONE
./mvnw verify
succeeds (if it fails due to code formatting issues reported by Spotless, simply runspotless:apply
and retry)DONE
src/changelog/.2.x.x
directoryTBD
DONE