-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When serving large payloads under heavy load, the "OutOfMemoryError: direct memory" error is raised #5443
Comments
This stacktrace is from the client side, correct? As it seems your client verticle is implemented with the Vert.x Web Client, it is expected that the client buffers the whole server response. Either use the https://vertx.io/docs/vertx-web-client/java/#_decoding_responses |
Hey, thanks for the quick response. The stack trace is actually from the server component. The server essentially stalls at some point under the heavy load and just reports these exceptions and drops the clients connections. |
Then you got some information about how many requests the server can handle with this implementation before it fails with an OOME. Vert.x won't refuse connections looking at how much memory is available. You have two options:
|
I guess my intent here is not for vertx to refuse connections based on memory usage, but rather to not allocate too much data on the direct memory when sending the data back to the clients. So something like "oh, I tried allocating the direct memory for the buffer that I'm trying to send back as a response to the client and got OutOfMemoryError - I think I need to signal that upstream so that it slows down the rate at which it produces the data". Below is the backpressure implementation that I have in the semaphore branch of the reproducer project. There is a Java Semaphore there that contains the number of permits equivalent to 1 GiB expressed in bytes. Before sending the response back to the client (and in this case I use response.end(Buffer) as response.send(Buffer) did not work for some reason), I try acquiring as many permits as there are bytes in the payload and if I'm successful, I call response.end(Buffer) and release the permits in the AsyncResult handler. If I can't acquire the permits, I schedule one off timer to retry with exponential delay. It works, I don't get any OoME, but the mean response time obviously increases. The thing is that this solution doesn't feel (perhaps it's just my subjective feeling) a Vertx-native solution. I also tried implementing a custom ReadStream that would wrap around the byte[] that I'm sending as the response body and actually deliver the byte[] in chunks, but as the chunk size grows, I'm starting observing the OoME errors again. If no other hints on how I could tackle this problem then I'd stick to the Semaphore-based implementation, but honestly, I would count on at least some hints regarding the direction I could follow to solve the problem in a Vertx-native way. Thanks private void sendResponse(final RoutingContext context, final byte[] payload) {
final var response = context.response();
if (context.get("id") == null) {
context.put("id", UUID.randomUUID().toString());
}
final String id = context.get("id");
if (context.get("attempt") == null) {
context.put("attempt", 0);
}
final int attempt = context.get("attempt");
context.put("attempt", attempt + 1);
if (semaphore.tryAcquire(payload.length)) {
System.out.println(id + " - permits acquired, available permits: " + semaphore.availablePermits());
response.putHeader(HttpHeaders.CONTENT_TYPE, HttpHeaderValues.APPLICATION_OCTET_STREAM);
response.putHeader(HttpHeaders.CONTENT_LENGTH, String.valueOf(payload.length));
// Flag to track if permits have been released
final var permitsReleased = new AtomicBoolean(false);
response.end(Buffer.buffer(Unpooled.wrappedBuffer(payload)), endResult -> {
if (permitsReleased.compareAndSet(false, true)) {
semaphore.release(payload.length);
final var message = "%s - response sent, permits released, attempt: %d, available permits: %d"
.formatted(id, attempt, semaphore.availablePermits());
System.out.println(message);
}
});
} else {
System.out.println(id + " - semaphore not acquired, available permits: " + semaphore.availablePermits());
// exponential backoff
final var baseDelay = 10;
final var maxDelay = 1000;
final var delay = Math.min(baseDelay * (1L << attempt), maxDelay);
context.vertx().setTimer(delay, timerId -> sendResponse(context, payload));
}
} |
We just started seeing this as well after upgrading to Vert.x 4.5 we have been running on 3.9 for a couple years and we just pushed our upgrade of 4.5 to prod and immediately started having this same problem. |
@franz1981 does it ring a bell ? |
Have you checked if, on server side, you actually release the received buffers?
Foe the former, just enable leak detection and see what happens, for the second one, I have no idea what the server does (I suppose it uses Vertx) and probably @vietj can help more, knowing how it is configured. The latter can still be the reason since the full request is not sent chunked (it isn't but I could be wrong) - and even without SSL, if you send tons of big payload to the server, you risk to overload it. |
Just FYI, we are running our app in Kubernetes using an image built with paketo. With this setup the memory available for off heap is limited to 10mb as outlined here. https://paketo.io/docs/reference/java-reference/ We decided to set the -XX:MaxDirectMemorySize to ensure the process had adequate memory for direct allocations. We are testing now to see if this solves our problem. I am curious though if anyone on the vertx team knows if anything has changed between 3.9 and 4.5 that would make netty allocate off heap vs on heap memory? |
Also we are not using vertx with SSL enabled. Here is the stack from our error. java.lang.OutOfMemoryError: Cannot reserve 2097152 bytes of direct buffer memory (allocated: 8521715, limit: 10485760) at java.base/java.nio.Bits.reserveMemory(Unknown Source) ~[na:na] at java.base/java.nio.DirectByteBuffer.(Unknown Source) ~[na:na] at java.base/java.nio.ByteBuffer.allocateDirect(Unknown Source) ~[na:na] at io.netty.buffer.UnpooledDirectByteBuf.allocateDirect(UnpooledDirectByteBuf.java:104) ~[netty-buffer-4.1.116.Final.jar:4.1.116.Final] at io.netty.buffer.UnpooledDirectByteBuf.capacity(UnpooledDirectByteBuf.java:156) ~[netty-buffer-4.1.116.Final.jar:4.1.116.Final] at io.netty.buffer.AbstractByteBuf.ensureWritable0(AbstractByteBuf.java:305) ~[netty-buffer-4.1.116.Final.jar:4.1.116.Final] at io.netty.buffer.AbstractByteBuf.ensureWritable(AbstractByteBuf.java:280) ~[netty-buffer-4.1.116.Final.jar:4.1.116.Final] at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1103) ~[netty-buffer-4.1.116.Final.jar:4.1.116.Final] at io.netty.handler.codec.ByteToMessageDecoder$1.cumulate(ByteToMessageDecoder.java:105) ~[netty-codec-4.1.116.Final.jar:4.1.116.Final] at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:288) ~[netty-codec-4.1.116.Final.jar:4.1.116.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) ~[netty-transport-4.1.116.Final.jar:4.1.116.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[netty-transport-4.1.116.Final.jar:4.1.116.Final] at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[netty-transport-4.1.116.Final.jar:4.1.116.Final] at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1357) ~[netty-transport-4.1.116.Final.jar:4.1.116.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440) ~[netty-transport-4.1.116.Final.jar:4.1.116.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[netty-transport-4.1.116.Final.jar:4.1.116.Final] at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:868) ~[netty-transport-4.1.116.Final.jar:4.1.116.Final] at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166) ~[netty-transport-4.1.116.Final.jar:4.1.116.Final] at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788) ~[netty-transport-4.1.116.Final.jar:4.1.116.Final] at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724) ~[netty-transport-4.1.116.Final.jar:4.1.116.Final] at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650) ~[netty-transport-4.1.116.Final.jar:4.1.116.Final] at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562) ~[netty-transport-4.1.116.Final.jar:4.1.116.Final] at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) ~[netty-common-4.1.116.Final.jar:4.1.116.Final] at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[netty-common-4.1.116.Final.jar:4.1.116.Final] at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[netty-common-4.1.116.Final.jar:4.1.116.Final] at java.base/java.lang.Thread.run(Unknown Source) ~[na:na] |
I have myself pushed #5316 and #5292 but they are disabled by default - and should do the opposite actually (the former).
I'm a bit confused here, since the stack in #5443 (comment) is showing some If I trust this last stack you sent something weird is going on:
If it goes unpooled it is clearly where the problem is.
This doesn't seems to have worked as the OOM message report (basically going OOM if beyond 10 MB). |
Returning to the original stack trace I see @jigga it looks like it is using TCNATIVE and I'm not aware of changes in Netty nor Vertx which could affect this SSL engine.
In short I would like to understand if something we do in the server is not correctly behaving or is something different we do in the client which cause the sever to be overloaded (as it seems) regardless its version. |
@franz1981 I am not the same person that originally created this issue. I just found it when researching since I am seeing the same problem. I do not have SSL enabled but according to their examples they do. As for the setting change, to increase direct memory. We have just rolled that out. The stack I provided is before we made the change to the -XX:MaxDirectMemorySize setting. I am sorry I was not more clear about this. As for Netty disabling off-heap buffers maybe that calculation is not working correctly when running inside a container. So far it does look like increasing the XX:MaxDirectMemorySize has solved our problem. But we probably won't know for sure until Monday. |
That's the part I've mentioned: https://github.com/netty/netty/blob/d0406d77f77c9b5da6279afdb4c801244fef12d9/buffer/src/main/java/io/netty/buffer/PooledByteBufAllocator.java#L112 Netty has changed it's chunk size, "recentlish" (see netty/netty@f650303) which probably has turned the number of direct arenas to become 0 instead of 1 (or more?). And this is likely what has made it happens. |
@franz1981 Thank you for your help. It looks like we were using netty 4.1.113.Final with vertx 3.9.16 before our upgrade to vertx 4.5. It looks like the change you mentioned happened long before that. I am going to dig into the code some more to see if I can find anything else that may have changed. |
@NavidMitchell you can enable debug logging and report the number of direct arenas Netty decides to allocate; that would unveil If this is what has happened. Even if you increase the max direct memory, you need to make sure you are not using unpooled direct memory since, with the standard glibc malloc, the application can die due, after long time running, due to memory fragmentation. |
@franz1981 @vietj We can enable debug logging as you suggested. Additionally, should I consider downgrading my app to Vertx 3.9? We have not changed anything netty related and your last comment is concerning. This is a production application serving 10k+ users daily. We did not see this problem until upgrading our app to Vertx 4.5. |
I won't suggest such; running with more direct memory is the way to go, regardless, since most netty operations really need it (especially I/O). |
@franz1981 I was able to confirm this is not related to vertx 3.9 -> 4.5 upgrade. It looks like our data volumes have increased around the same time we rolled out vertx 4.5 last week. On both the old and new versions of our app we were seeing numDirectArenas 0. Once we changed our settings we now have numDirectArenas 32. Also I was able to recreate the error on vertx 3.9.16 by sending really large payloads to our app. Thank you for all of your help! |
as far as I can see, vertx and netty are not capable to apply back pressure in this case when you send a large buffer like this, the original buffer should be fragmented into chunks (I think you figured it out) and sent separately, this way vertx/netty would use the number of direct buffers that is related to the amount of data the TCP connection can ingest. I other words, from the vertx/netty perspective, writing a very large buffers is a single Netty pipeline write that makes Netty's outbound buffer (on which back-pressure is based) grow from 0 to max of your size. I see you tried using a read stream (which I think is a decent way to go) but still it failed, did you try to use various smaller chunk sizes ? Perhaps we could use a netty handler for such case that would submit slices of buffers (something like |
Wow, that thread exploded unexpectedly, but I'm very glad about it, and thanks all for your comments. My current solution is indeed to chunk the response, but there's a chunk size threshold that when reached, I immediately start seeing the issue again, and hence I consider this solution a bit fragile. Also, I'm not sure if the load on my API increases further, could that cause the issue to re-appear. I'd like to have a solution where even if there's a sudden, unexpected spike in the traffic to the API, the server remains stable and responsive and most importantly does not stall. I'm leaning towards the Semaphore-based solution as that way I have full control over the amount of data flowing through the server at any given time, but then again I feel that there must be a more Vertx-native approach, however, I'm not sure I'm fluent enough with the Vertx internals to correctly implement one. Will share the final solution when I have it. |
did you look at vertx traffic shapping option support ? |
Version
Which version(s) did you encounter this bug?
4.5.11
Context
I encounter the below exception pretty quickly when serving large payloads via HTTP GET under heavy load.
Do you have a reproducer?
I have a simple reproducer repo listed below. It's a Gradle project consisting of the client and server modules. There are dedicated Gradle tasks to start the server and client. Once both are started, it takes several seconds to reproduce the issue.
Steps to reproduce
./gradlew :server:run
./gradlew :client:run
Extra
This is roughly the whole server logic.
The code uses the send(Buffer) method to send the response. My expectation was that this method would apply backpressure if necessary, but that's not the case.
I have another branch in the reproducer project that applies the backpressure using the Java's Semaphore, however, I'd like a Vert.x native solution.
The text was updated successfully, but these errors were encountered: