Investigate the OOM in the comet stack when sending large data #1548

rach-id · 2024-12-06T11:28:41Z

As described in #1385 (comment), we have an OOM happening when we send large data over the p2p stack.

To reproduce this, run congest test as specified in the above comment using the mock reactor and start sending 5mb data, for example. Then, watch the RAM usage in any of the servers.

initial hypothesis:
After running the node and OOMing, I took a pprof to see what's taking all the memory:

So, we thought it has to do with proto marshalling/umarshalling (which still could be another issue).

To verify this, we disabled any proto encoding when sending/receiving data. But still the OOM kept happening.

second hypothesis:

The OOM can be fixed if we reduce the channel buffer sizes but if you want to send large data, you will need a lot of RAM. For example, for sending 10mb data constantly at max speed, 32GB of RAM is not enough.
This is happening because of how the P2P stack is built: msg -> channel -> connection
So if you want to send 10mb data, you will need to send it to the channel. The channel, during creation, allocates the fixed sized buffer.
So, if you want to have a send/receive buffer of 10msgs, you will create buffers of 100mb for each peer. These are allocated during the creation of the node.
The problem that happens is that whenever the node connects to a new peer, more buffers get created, until the it OOMs.

I did an experiment where we fixed the number of peers to 10, even the persistent ones. Then, we decreased the buffer sizes to only include 2 msgs. But still, the nodes OOM. Worth noting that they take a longer time to OOM the smaller the data or the smaller the buffers.

A dump for such an experiment:

Decreasing the buffers to only contain 2 messages gave the following dump when using 30GB of RAM moments before OOMing:

rach-id mentioned this issue Dec 6, 2024

Prototype using QUIC instead of tcp and proprietary handshakes #1385

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate the OOM in the comet stack when sending large data #1548

Investigate the OOM in the comet stack when sending large data #1548

rach-id commented Dec 6, 2024 •

edited

Loading

Investigate the OOM in the comet stack when sending large data #1548

Investigate the OOM in the comet stack when sending large data #1548

Comments

rach-id commented Dec 6, 2024 • edited Loading

rach-id commented Dec 6, 2024 •

edited

Loading