Memory consumption spikes when joining two parquet tables #7459
Replies: 2 comments 7 replies
-
Hi @shcheklein . Thanks for the lovely screenshare. It's rare for something to fail when getting quantiles. My first recommendation would be to increase the size of workers to something larger, especially if your partition size is anywhere near that level. Is this easy for you to try? Most folks run Dask workers with 16-128 GB workers. You could also try not setting the index ahead of time, but just joining cc'ing also @hendrikmakait who is working on our next-generation sort/join algorithms. He might be interested in this and looking for a beta user if you're interested. |
Beta Was this translation helpful? Give feedback.
-
Sorting / shuffling workloads benefit from having access to all of the data
in RAM. If this isn't the case then things will still work, but they'll
engage disk and so be slower. Hendrik's new implementation is much more
intelligent about handling disk well.
However, what you're seeing isn't even getting to the sorting part of the
workflow, it's just reading through the dataset once to get a sense for how
to split up the data. I don't typically see things failing at that stage.
I don't have much intuition that explains your current situation. My guess
is that something is odd about your data. For example, maybe your
partitions are large relative to your working memory (partitions *do *have
to comfortably fit in memory)
…On Fri, Jan 6, 2023 at 4:39 PM Ivan Shcheklein ***@***.***> wrote:
A good "small" Dask worker is usually 4 cores and 16 GB of RAM. I wouldn't
use more cores, I would just re-partition your cores/memory into processes
differently.
Trying it - no set_index and/or change workers memory (no luck so far,
but at least it gives a bit more information ...) ...I'll share an update
if I get somewhere
@mrocklin <https://github.com/mrocklin> what is your intuition / best
practice - can it process a file that doesn't fit into distributed memory
after reading it, even if it's split into smaller chunks? I was thinking to
use a lot of small workers (or threads) to consume parquet in parallel
faster, split into chunks and I was expecting it would be saving some of
the chunks to disk.
Or it's more like all-in-memory mode usually? Or at least it better to
expect that everything more or less fits into memory ...
(sorry, if my question is not clear, I'm trying to build some intuition
around dask atm)
—
Reply to this email directly, view it on GitHub
<#7459 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTCGYX6XWP4GEKJH373WRC3MPANCNFSM6AAAAAATTJDVGY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
Hi folks, I would appreciate some advice on where should I look next with this memory spike issue, I can't get any useful signals from the logs anymore, I can't tell why that would be causing spikes. I'm getting:
And:
While running this code:
Each worker is 5GB memory, but I also tried 10GB memory (RAM). And while reading parquet is running memory consumption is quite low:
Screen.Recording.2023-01-05.at.11.57.41.AM.mov
Only at the very end, when it runs
re-quantiles
it fails abruptly.Beta Was this translation helpful? Give feedback.
All reactions