You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As described in #1385 (comment), we have an OOM happening when we send large data over the p2p stack.
To reproduce this, run congest test as specified in the above comment using the mock reactor and start sending 5mb data, for example. Then, watch the RAM usage in any of the servers.
initial hypothesis:
After running the node and OOMing, I took a pprof to see what's taking all the memory:
So, we thought it has to do with proto marshalling/umarshalling (which still could be another issue).
To verify this, we disabled any proto encoding when sending/receiving data. But still the OOM kept happening.
second hypothesis:
The OOM can be fixed if we reduce the channel buffer sizes but if you want to send large data, you will need a lot of RAM. For example, for sending 10mb data constantly at max speed, 32GB of RAM is not enough.
This is happening because of how the P2P stack is built: msg -> channel -> connection
So if you want to send 10mb data, you will need to send it to the channel. The channel, during creation, allocates the fixed sized buffer.
So, if you want to have a send/receive buffer of 10msgs, you will create buffers of 100mb for each peer. These are allocated during the creation of the node.
The problem that happens is that whenever the node connects to a new peer, more buffers get created, until the it OOMs.
I did an experiment where we fixed the number of peers to 10, even the persistent ones. Then, we decreased the buffer sizes to only include 2 msgs. But still, the nodes OOM. Worth noting that they take a longer time to OOM the smaller the data or the smaller the buffers.
A dump for such an experiment:
Decreasing the buffers to only contain 2 messages gave the following dump when using 30GB of RAM moments before OOMing:
The text was updated successfully, but these errors were encountered:
As described in #1385 (comment), we have an OOM happening when we send large data over the p2p stack.
To reproduce this, run congest test as specified in the above comment using the mock reactor and start sending 5mb data, for example. Then, watch the RAM usage in any of the servers.
initial hypothesis:
After running the node and OOMing, I took a pprof to see what's taking all the memory:
So, we thought it has to do with proto marshalling/umarshalling (which still could be another issue).
To verify this, we disabled any proto encoding when sending/receiving data. But still the OOM kept happening.
second hypothesis:
I did an experiment where we fixed the number of peers to 10, even the persistent ones. Then, we decreased the buffer sizes to only include 2 msgs. But still, the nodes OOM. Worth noting that they take a longer time to OOM the smaller the data or the smaller the buffers.
A dump for such an experiment:
Decreasing the buffers to only contain 2 messages gave the following dump when using 30GB of RAM moments before OOMing:
The text was updated successfully, but these errors were encountered: