Add discard buffer to prevent unsychronized access when RingBuffer full #1410

xendo · 2023-04-06T11:33:48Z

A service that we maintain occasionally hangs when the rate of incoming logs is higher than the throughput of writing logs to a disk. We have a discard policy at ERROR level and use default SynchronizeEnqueueWhenQueueFull. We believe that this is due to corrupted LMAX disruptor. We did some initial attempts to produce minimal reproducer (using jcstress) but were not able to succeed so far. This change also 'prioritizes' non-discarded logs, which in theory should reduce the logging back-pressure when buffer almost full.

While reviewing the log4j code we noticed that there are code paths that avoid SynchronizeEnqueueWhenQueueFull and may be responsible for RingBuffer corruption, I prepared a small change that should prevent any unsynchronized access and wanted to get initial feedback from log4j maintainers. I also wanted to use this pull request to pick you brain on what else we can do to troubleshoot the issue futher.

Writer threads hang on following stack trace:

   java.lang.Thread.State: TIMED_WAITING (parking)                                                                                                                                                                                                                        
        at sun.misc.Unsafe.park(Native Method)                                                                                                                                                                                                                            
        at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:338)                                                                                                                                                                                         
        at com.lmax.disruptor.MultiProducerSequencer.next(MultiProducerSequencer.java:136)                                                                                                                                                                                
        at com.lmax.disruptor.MultiProducerSequencer.next(MultiProducerSequencer.java:105)                                                                                                                                                                                
        at com.lmax.disruptor.RingBuffer.publishEvent(RingBuffer.java:465)                                                                                                                                                                                                
        at com.lmax.disruptor.dsl.Disruptor.publishEvent(Disruptor.java:326)                                                                                                                                                                                              
        at org.apache.logging.log4j.core.async.AsyncLoggerDisruptor.enqueueLogMessageWhenQueueFull(AsyncLoggerDisruptor.java:236)                                                                                                                                         
        - locked <0x000000052818a2c0> (a java.lang.Object)                                                                                                                                                                                                                
        at org.apache.logging.log4j.core.async.AsyncLogger.handleRingBufferFull(AsyncLogger.java:246)                                                                                                                                                                     
        at org.apache.logging.log4j.core.async.AsyncLogger.publish(AsyncLogger.java:230)                                                                                                                                                                                  
        at org.apache.logging.log4j.core.async.AsyncLogger.logWithThreadLocalTranslator(AsyncLogger.java:225)                                                                                                                                                             
        at org.apache.logging.log4j.core.async.AsyncLogger.access$000(AsyncLogger.java:67)                                                                                                                                                                                
        at org.apache.logging.log4j.core.async.AsyncLogger$1.log(AsyncLogger.java:152)                                                                                                                                                                                    
        at org.apache.logging.log4j.core.async.AsyncLogger.log(AsyncLogger.java:136)                                                                                                                                                                                      
        at org.apache.logging.log4j.spi.AbstractLogger.tryLogMessage(AbstractLogger.java:2205)                                                                                                                                                                            
        at org.apache.logging.log4j.spi.AbstractLogger.logMessageTrackRecursion(AbstractLogger.java:2159)                                                                                                                                                                 
        at org.apache.logging.log4j.spi.AbstractLogger.logMessageSafely(AbstractLogger.java:2142)                                                                                                                                                                         
        at org.apache.logging.log4j.spi.AbstractLogger.logMessage(AbstractLogger.java:2022)                                                                                                                                                                               
        at org.apache.logging.log4j.spi.AbstractLogger.logIfEnabled(AbstractLogger.java:1875)                                                                                                                                                                             
        at org.apache.logging.slf4j.Log4jLogger.error(Log4jLogger.java:299)

While the event processor is runnable, but is not making any progress, i.e. no logs are produced

   java.lang.Thread.State: RUNNABLE                                                                                                                                                                                                                                       
        at com.lmax.disruptor.BatchEventProcessor.processEvents(BatchEventProcessor.java:159)                                                                                                                                                                             
        at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:125)                                                                                                                                                                                       
        at java.lang.Thread.run(Thread.java:750)

Benchmark results

I was a bit surprised, but comparing to previous results of those benchmarks, I didn't see a significant drop in performance for the ENQUEUE_UNSYNCHRONIZED . Benchmarks were run on m5.4xl instance (16 vCPU)

Benchmark                                                      (asyncLoggerType)                           (queueFullPolicy)   Mode  Cnt        Score         Error  Units
ConcurrentAsyncLoggerToFileBenchmark.concurrentLoggingThreads      ASYNC_CONTEXT                                     ENQUEUE  thrpt    3  1473717.926 ±  910941.806  ops/s
ConcurrentAsyncLoggerToFileBenchmark.concurrentLoggingThreads      ASYNC_CONTEXT                      ENQUEUE_UNSYNCHRONIZED  thrpt    3  1265437.867 ±  184317.376  ops/s
ConcurrentAsyncLoggerToFileBenchmark.concurrentLoggingThreads      ASYNC_CONTEXT                                 SYNCHRONOUS  thrpt    3  1726704.651 ±  579862.302  ops/s
ConcurrentAsyncLoggerToFileBenchmark.concurrentLoggingThreads      ASYNC_CONTEXT                 ENQUEUE_WITH_DISCARD_BUFFER  thrpt    3  1483961.304 ±  536377.123  ops/s
ConcurrentAsyncLoggerToFileBenchmark.concurrentLoggingThreads      ASYNC_CONTEXT  ENQUEUE_UNSYNCHRONIZED_WITH_DISCARD_BUFFER  thrpt    3  1270870.061 ±  749036.480  ops/s
ConcurrentAsyncLoggerToFileBenchmark.concurrentLoggingThreads       ASYNC_CONFIG                                     ENQUEUE  thrpt    3  1327449.173 ±  295188.828  ops/s
ConcurrentAsyncLoggerToFileBenchmark.concurrentLoggingThreads       ASYNC_CONFIG                      ENQUEUE_UNSYNCHRONIZED  thrpt    3  1113508.048 ±  127087.930  ops/s
ConcurrentAsyncLoggerToFileBenchmark.concurrentLoggingThreads       ASYNC_CONFIG                                 SYNCHRONOUS  thrpt    3  1762200.783 ±  491492.622  ops/s
ConcurrentAsyncLoggerToFileBenchmark.concurrentLoggingThreads       ASYNC_CONFIG                 ENQUEUE_WITH_DISCARD_BUFFER  thrpt    3  1332089.400 ± 1275770.505  ops/s
ConcurrentAsyncLoggerToFileBenchmark.concurrentLoggingThreads       ASYNC_CONFIG  ENQUEUE_UNSYNCHRONIZED_WITH_DISCARD_BUFFER  thrpt    3  1126223.983 ±  464606.799  ops/s
ConcurrentAsyncLoggerToFileBenchmark.singleLoggingThread           ASYNC_CONTEXT                                     ENQUEUE  thrpt    3  1466317.665 ±  522556.954  ops/s
ConcurrentAsyncLoggerToFileBenchmark.singleLoggingThread           ASYNC_CONTEXT                      ENQUEUE_UNSYNCHRONIZED  thrpt    3  1477131.094 ±  371409.396  ops/s
ConcurrentAsyncLoggerToFileBenchmark.singleLoggingThread           ASYNC_CONTEXT                                 SYNCHRONOUS  thrpt    3  1429398.368 ±  426180.449  ops/s
ConcurrentAsyncLoggerToFileBenchmark.singleLoggingThread           ASYNC_CONTEXT                 ENQUEUE_WITH_DISCARD_BUFFER  thrpt    3  1459565.150 ± 1185274.825  ops/s
ConcurrentAsyncLoggerToFileBenchmark.singleLoggingThread           ASYNC_CONTEXT  ENQUEUE_UNSYNCHRONIZED_WITH_DISCARD_BUFFER  thrpt    3  1502960.667 ±  380188.217  ops/s
ConcurrentAsyncLoggerToFileBenchmark.singleLoggingThread            ASYNC_CONFIG                                     ENQUEUE  thrpt    3  1579091.646 ±  623213.747  ops/s
ConcurrentAsyncLoggerToFileBenchmark.singleLoggingThread            ASYNC_CONFIG                      ENQUEUE_UNSYNCHRONIZED  thrpt    3  1480321.360 ±  524851.514  ops/s
ConcurrentAsyncLoggerToFileBenchmark.singleLoggingThread            ASYNC_CONFIG                                 SYNCHRONOUS  thrpt    3  1656552.128 ±  543691.433  ops/s
ConcurrentAsyncLoggerToFileBenchmark.singleLoggingThread            ASYNC_CONFIG                 ENQUEUE_WITH_DISCARD_BUFFER  thrpt    3  1580294.283 ±  436382.888  ops/s
ConcurrentAsyncLoggerToFileBenchmark.singleLoggingThread            ASYNC_CONFIG  ENQUEUE_UNSYNCHRONIZED_WITH_DISCARD_BUFFER  thrpt    3  1550912.637 ±  148770.858  ops/s

Checklist

Base your changes on 2.x branch if you are targeting Log4j 2; use main otherwise
DONE
./mvnw verify succeeds (if it fails due to code formatting issues reported by Spotless, simply run spotless:apply and retry)
DONE
Changes contain an entry file in the src/changelog/.2.x.x directory
TBD
Tests for the changes are provided
DONE
Commits are signed (optional, but highly recommended)

…most full

carterkozak · 2023-04-06T13:10:37Z

we noticed that there are code paths that avoid SynchronizeEnqueueWhenQueueFull and may be responsible for RingBuffer corruption

The ringbuffer we use is configured for multi-producer-single-consumer, where concurrent requests to add data should be safe. We added the synchronization for a slightly different reason: When the buffer is full, each time an event is processed, the background thread will notify all waiting threads. In a large web-server, this can be a lot of threads, and the system degrades into a state where most cpu time is spent notifying threads. The synchronization allows us to queue waiting threads efficiently when the buffer is full.

What is the rate at which the corrupted state reproduces in your environment? I ran into a similar issue at one point which reproduced once every few months, using the SYNCHRONOUS queue-full-policy (writing logevents from the current thread rather than blocking waiting to enqueue) for non-discarded events seemed to resolve it (if it does occur, data would be logged synchronously, but my metrics indicate that the queue is never entirely filled either).

xendo · 2023-04-07T11:08:38Z

We can reproduce this pretty reliably. There may be a secret ingredient in our setup (we use custom LogEventPatternConverters), but as I said before, we were not able to produce minimal reproducer that I could share. We initially confirmed that additional discard buffer proposed here mitigates the issue.

We looked into SYNCHRONIZED policy, but the problem we have is that it can make the performance worse when the traffic is the highest. The service we own is latency sensitive and the tradeoff we want to make is to discard the logs if we can't keep up. As far as I understand the DISCARD and SYNCHRONIZED policies are mutually exclusive. SYNCHRONIZED_DISCARD policy may be another way to solve this, although I'm not 100% sure it will work since it still does the tryPublish .

When the buffer is full, each time an event is processed, the background thread will notify all waiting threads.

Yes, that was also something I noticed while running benchmarks, although the impact was not nearly as big as it was when you initially worked on this .

The ringbuffer we use is configured for multi-producer-single-consumer, where concurrent requests to add data should be safe.

From what I can tell it not always is. Ideally that's what should be fixed but, to be honest, I don't know how.

Jerzy Zagorski added 3 commits April 5, 2023 17:04

Add discard buffer to prevent unsychronized access when RinbBuffer al…

afd852e

…most full

Apply spotless

89b26f7

Add tests for DiscardBuffer

a5c0b1c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add discard buffer to prevent unsychronized access when RingBuffer full #1410

Add discard buffer to prevent unsychronized access when RingBuffer full #1410

xendo commented Apr 6, 2023 •

edited

Loading

carterkozak commented Apr 6, 2023

xendo commented Apr 7, 2023 •

edited

Loading

Add discard buffer to prevent unsychronized access when RingBuffer full #1410

Are you sure you want to change the base?

Add discard buffer to prevent unsychronized access when RingBuffer full #1410

Conversation

xendo commented Apr 6, 2023 • edited Loading

Benchmark results

Checklist

carterkozak commented Apr 6, 2023

xendo commented Apr 7, 2023 • edited Loading

xendo commented Apr 6, 2023 •

edited

Loading

xendo commented Apr 7, 2023 •

edited

Loading