Memory consumption spikes when joining two parquet tables #7459

shcheklein · 2023-01-06T17:22:53Z

shcheklein
Jan 6, 2023

Hi folks, I would appreciate some advice on where should I look next with this memory spike issue, I can't get any useful signals from the logs anymore, I can't tell why that would be causing spikes. I'm getting:

Worker tcp://192.168.131.4:34537 (pid=31) exceeded 95% memory budget. Restarting..

And:

distributed.scheduler.KilledWorker: Attempted to run task ('re-quantiles-3-f1655c6f66a07867e95bca77bdf076f3', 0) on 3 different workers, but all those workers died while running it. 
The last worker that attempt to run the task was tcp://192.168.131.4:34537. Inspecting worker logs is often a good next step to diagnose what went wrong. For more information see https://distributed.dask.org/en/stable/killed.html.

While running this code:

WORKERS=400

cluster = HelmCluster("dask")
client = Client(cluster)
cluster.scale(WORKERS)
client.wait_for_workers(WORKERS, timeout=300)

urls = dd.read_parquet("s3://50k-parque-file", columns="url").to_frame().set_index('url')
embd = dd.read_parquet("s3://2B-parquet-file", split_row_groups=10).set_index('url')

result = urls.merge(embd, how="left", left_index=True, right_index=True).compute()

print(result.shape)

Each worker is 5GB memory, but I also tried 10GB memory (RAM). And while reading parquet is running memory consumption is quite low:

Screen.Recording.2023-01-05.at.11.57.41.AM.mov

Only at the very end, when it runs re-quantiles it fails abruptly.

mrocklin · 2023-01-06T17:37:12Z

mrocklin
Jan 6, 2023
Maintainer

Hi @shcheklein . Thanks for the lovely screenshare. It's rare for something to fail when getting quantiles. My first recommendation would be to increase the size of workers to something larger, especially if your partition size is anywhere near that level. Is this easy for you to try? Most folks run Dask workers with 16-128 GB workers.

You could also try not setting the index ahead of time, but just joining on="url". That might simplify things a bit.

cc'ing also @hendrikmakait who is working on our next-generation sort/join algorithms. He might be interested in this and looking for a beta user if you're interested.

5 replies

shcheklein Jan 6, 2023
Author

Thanks @mrocklin for a prompt response! I really appreciate it.

if your partition size is anywhere near that level. Is this easy for you to try? Most folks run Dask workers with 16-128 GB workers.

2B dataset is split into 2300 files, each has ~90 row groups (10K each). I would assume that even the whole file is not very large? 10K partition is small enough to fit into 100Mb I guess.
So, when you are saying about 16-128GB workers. Do you mean to just make less workers and give more memory to each? It's clear that I can probably run 400x128Gb but that would be expensive to do. I was trying specifically (and let me know if that is wrong) to split it into smaller chucks, smaller workers to increase throughput. So, if we do 40x50Gb that would slow down read parquet, right? wdyt?

One more thing to add. I haven't changed the worker memory limits, threshold, etc (memory-limit, etc) yet. Tbh, I was assuming that it would be using disk if needed automatically. Do you think playing with those might help (force to spill when spike happens) or is it happening too fast in this case?

He might be interested in this and looking for a beta user if you're interested.

Yes, totally open to that

shcheklein Jan 6, 2023
Author

You could also try not setting the index ahead of time, but just joining on="url". That might simplify things a bit.

Yes, I was reading this page and it was saying that Large to Large Unsorted Joins are slow. 50K in my case I'm using as a test, it will be at least 100M x 2B next, or even 2B x 2B eventually. I wanted to see if it's possible to do with dask in a massively parallel way (or may it's not the best scenario at the moment for the tool, which is totally fine and fair answer as well btw! I know that Dask is doing a lot of other great things).

mrocklin Jan 6, 2023
Maintainer

Large to Large Unsorted Joins are slow

Right, but so too is sorting dataframes. You're going to pay for the sort one way or the other. I recommend starting without any additional optimizations (remove the set_index calls) and see how that does.

Do you mean to just make less workers and give more memory to each

Yes. Just importing Pandas and other libraries takes up a decent fraction of a gigabyte. A good "small" Dask worker is usually 4 cores and 16 GB of RAM. I wouldn't use more cores, I would just re-partition your cores/memory into processes differently.

Really though, for all of this it's hard to give good advice without seeing the data, watching the dashboard, and so on. Good luck! I'm just guessing with a lot of this.

shcheklein Jan 7, 2023
Author

A good "small" Dask worker is usually 4 cores and 16 GB of RAM. I wouldn't use more cores, I would just re-partition your cores/memory into processes differently.

Trying it - no set_index and/or change workers memory (no luck so far, but at least it gives a bit more information ...) ...I'll share an update if I get somewhere

@mrocklin what is your intuition / best practice - can it process a file that doesn't fit into distributed memory after reading it, even if it's split into smaller chunks? I was thinking to use a lot of small workers (or threads) to consume parquet in parallel faster, split into chunks and I was expecting it would be saving some of the chunks to disk.

Or it's more like all-in-memory mode usually? Or at least it better to expect that everything more or less fits into memory ...

(sorry, if my question is not clear, I'm trying to build some intuition around dask atm)

shcheklein Jan 7, 2023
Author

This is 4 cores / 80GB / 100 workers. No "set_index" optimization. Doesn't even come even to the half when runs out of memory. I wonder if there is something about data in this case indeed (it's URL + embedding (vector 800 number)). Parquet size itself is 3.5TB on S3.

Screen.Recording.2023-01-06.at.5.28.43.PM.mov

mrocklin · 2023-01-07T00:58:19Z

mrocklin
Jan 7, 2023
Maintainer

Sorting / shuffling workloads benefit from having access to all of the data in RAM. If this isn't the case then things will still work, but they'll engage disk and so be slower. Hendrik's new implementation is much more intelligent about handling disk well. However, what you're seeing isn't even getting to the sorting part of the workflow, it's just reading through the dataset once to get a sense for how to split up the data. I don't typically see things failing at that stage. I don't have much intuition that explains your current situation. My guess is that something is odd about your data. For example, maybe your partitions are large relative to your working memory (partitions *do *have to comfortably fit in memory)

…

On Fri, Jan 6, 2023 at 4:39 PM Ivan Shcheklein ***@***.***> wrote: A good "small" Dask worker is usually 4 cores and 16 GB of RAM. I wouldn't use more cores, I would just re-partition your cores/memory into processes differently. Trying it - no set_index and/or change workers memory (no luck so far, but at least it gives a bit more information ...) ...I'll share an update if I get somewhere @mrocklin <https://github.com/mrocklin> what is your intuition / best practice - can it process a file that doesn't fit into distributed memory after reading it, even if it's split into smaller chunks? I was thinking to use a lot of small workers (or threads) to consume parquet in parallel faster, split into chunks and I was expecting it would be saving some of the chunks to disk. Or it's more like all-in-memory mode usually? Or at least it better to expect that everything more or less fits into memory ... (sorry, if my question is not clear, I'm trying to build some intuition around dask atm) — Reply to this email directly, view it on GitHub <#7459 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTCGYX6XWP4GEKJH373WRC3MPANCNFSM6AAAAAATTJDVGY> . You are receiving this because you were mentioned.Message ID: ***@***.***>

2 replies

shcheklein Jan 7, 2023
Author

Right. What does re-requantile specifically do? (high level). Since it's a column with URLs (long strings) and a lot of chunks / splits - can it be that some aggregation fails when it needs to collect something from all workers?

(also, can't make a version w/o sort work - it fails on memory as well, I've tried 4 core x 40GB x 100 workers and that was not enough.)

shcheklein Jan 7, 2023
Author

Data - it's two columns URL + emb (vector ~800 floats). So, each row is pretty heavy.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory consumption spikes when joining two parquet tables #7459

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Memory consumption spikes when joining two parquet tables #7459

shcheklein Jan 6, 2023

Replies: 2 comments · 7 replies

mrocklin Jan 6, 2023 Maintainer

shcheklein Jan 6, 2023 Author

shcheklein Jan 6, 2023 Author

mrocklin Jan 6, 2023 Maintainer

shcheklein Jan 7, 2023 Author

shcheklein Jan 7, 2023 Author

mrocklin Jan 7, 2023 Maintainer

shcheklein Jan 7, 2023 Author

shcheklein Jan 7, 2023 Author

shcheklein
Jan 6, 2023

Replies: 2 comments 7 replies

mrocklin
Jan 6, 2023
Maintainer

shcheklein Jan 6, 2023
Author

shcheklein Jan 6, 2023
Author

mrocklin Jan 6, 2023
Maintainer

shcheklein Jan 7, 2023
Author

shcheklein Jan 7, 2023
Author

mrocklin
Jan 7, 2023
Maintainer

shcheklein Jan 7, 2023
Author

shcheklein Jan 7, 2023
Author