-
Notifications
You must be signed in to change notification settings - Fork 803
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The get_array_memory_size()
get wrong result(with different compression method) after deconde record from ipc format
#6363
Comments
Hi @alamb, the use case is that we use DataFusion for distributed search and Arrow Flight for data transmission. When dealing with large tables (e.g., 1000 columns), DataFusion reports "resource exhausted" errors, even with a small amount of data(e.g., a few MB). After investigation, we found that the issue stems from DataFusion's use of |
Hi @haohuaijin -- the symptoms you describe certainly sounds non ideal I wonder if the problem is with the reporting of memory used ( For example, if something in the IPC decoder was using 1MB buffers by default, then 1000 columns * 1MB buffers would result in 1GB of memory used. So I agree this sounds like a bug, but my guess is that it is a real memory allocation issue rather than a reporting issue (but I haven't confirmed this) Perhaps you can use something like https://github.com/KDE/heaptrack to track the actual allocations |
Sorry for the delay, @alamb. Recently, I have had some travels. I set the paraments of
Then, I used heaptrack to visualize the data and get the following pictures. |
Hi @alamb, I think I found the reason after reading the code. While decoding the The code shows that the arrow-rs/arrow-ipc/src/reader.rs Lines 562 to 569 in d05cf6d
arrow-rs/arrow-ipc/src/reader.rs Lines 404 to 406 in d05cf6d
arrow-rs/arrow-ipc/src/reader.rs Lines 51 to 63 in d05cf6d
arrow-rs/arrow-buffer/src/buffer/immutable.rs Lines 223 to 237 in d05cf6d
arrow-rs/arrow-buffer/src/buffer/immutable.rs Lines 166 to 168 in d05cf6d
|
See related discussion: #6439 |
Describe the bug
We use the ipc format to transfer recordbatch between nodes, and then we find that using lz4 compression or no compression will cause the result returned by the get_array_memory_size() method of the recordbatch after transmission to be particularly large. And use zstd compression the result is smaller.
To Reproduce
check this repo https://github.com/haohuaijin/ipc-bug, or the code is below, the arrow version is
52.2.0
the output with
LZ4_FRAME
the output with
ZSTD
the output whtiout compression
Expected behavior
decoded recordbatch size should be similar regardless what compression type was used during encoding.
Additional context
The text was updated successfully, but these errors were encountered: