Best practice for handing off persisted collection partitions #988

rjzamora · 2024-03-19T18:02:10Z

While working on rapidsai/dask-cuda#1311, I noticed that a common practice used in down-stream libraries no longer works (cleanly) with the move to dask-expr.

The common practice:

Persist a collection (df = df.persist())
Find the worker-to-partition mapping for the persisted collection using mapping = client.who_has() and df.__dask_keys__()

The problem with dask-expr:

In dask-expr, calling df.persist() will change the "name" (and therefore the keys) of the collection. The name change is a result of both expression optimization, and the creation of a new FromGraph expression. Therefore, you cannot call df = df.persist(), and then search for the keys of df in the cluster.

The question: What is the new "best practice" for patterns like this?

For reference, here is something that seems to work for now:

    df = df.persist()
    try:
        # Only works for FromGraph-backed collection
        persisted_keys = df.keys
    except AttributeError:
        # Only works for a legacy collection
        persisted_keys = df.__dask_keys__()

The text was updated successfully, but these errors were encountered:

mrocklin · 2024-03-19T18:04:51Z

Maybe you want futures_of?

…

On Tue, Mar 19, 2024, 1:02 PM Richard (Rick) Zamora < ***@***.***> wrote: While working on rapidsai/dask-cuda#1311 <rapidsai/dask-cuda#1311>, I noticed that a common practice used in down-stream libraries no longer works (cleanly) with the move to dask-expr. *The common practice*: 1. Persist a collection (df = df.persist()) 2. Find the worker-to-partition mapping for the persisted collection using mapping = client.who_has() and df.__dask_keys__() *The problem with dask-expr*: In dask-expr, calling df.persist() will change the "name" (and therefore the keys) of the collection. The name change is a result of both expression optimization, and the creation of a new FromGraph expression. Therefore, you cannot call df = df.persist(), and then search for the keys of df in the cluster. *The question*: What is the new "best practice" for patterns like this? For reference, here is something that seems to work for now: df = df.persist() try: # Only works for FromGraph-backed collection persisted_keys = df.keys except AttributeError: # Only works for a legacy collection persisted_keys = df.__dask_keys__() — Reply to this email directly, view it on GitHub <#988>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTDJQZ3R7HWIUN7SWXDYZB4TPAVCNFSM6AAAAABE6CAPNGVHI2DSMVQWIX3LMV43ASLTON2WKOZSGE4TKNRTGMYDGOA> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

rjzamora · 2024-03-20T03:24:56Z

Okay, thanks - I suppose this approach is backward compatible:

df = df.persist()
persisted_keys = [f.key for f in c.client.futures_of(df)]

fjetter · 2024-03-20T09:23:36Z

Could you provide a little more context for what you're doing? This feels to me like an abstraction leak that bites us whenever we touch this API. I am touching this API with the scheduler integration again and this shortcoming could be fixed but it would be helpful to know a little about the application

mrocklin · 2024-03-20T12:12:56Z

I recommend just using the dask.distributed.futures_of function. It's been around for a while and genearally how this probem gets solved.

rjzamora · 2024-03-20T13:39:53Z

This feels to me like an abstraction leak that bites us whenever we touch this API.

By "this" API, are you referring to futures_of or who_has? I'm happy to use whatever you all recommend moving forward.

it would be helpful to know a little about the application

I've seen this used in a few down-stream libraries. The specific application I am looking at right now is just a custom shuffling algorithm that I am very comfortable experimenting with. However, other down-stream libraries (e.g. cugraph, nemo) also use who_has to temporarily hand-off execution and communication to something other than dask. For example, cugraph will persist the collection, figure out where all the data is, and then execute a collective operation in C++/NCCL land. This is a very common pattern in rapids.

I recommend just using the dask.distributed.futures_of function. It's been around for a while and genearally how this probem gets solved.

Great. I'm not familiar with this API, but happy to use it and recommend it if it works.

rjzamora closed this as completed Mar 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best practice for handing off persisted collection partitions #988

Best practice for handing off persisted collection partitions #988

rjzamora commented Mar 19, 2024

mrocklin commented Mar 19, 2024 via email

rjzamora commented Mar 20, 2024

fjetter commented Mar 20, 2024

mrocklin commented Mar 20, 2024

rjzamora commented Mar 20, 2024 •

edited

Loading

Best practice for handing off persisted collection partitions #988

Best practice for handing off persisted collection partitions #988

Comments

rjzamora commented Mar 19, 2024

mrocklin commented Mar 19, 2024 via email

rjzamora commented Mar 20, 2024

fjetter commented Mar 20, 2024

mrocklin commented Mar 20, 2024

rjzamora commented Mar 20, 2024 • edited Loading

rjzamora commented Mar 20, 2024 •

edited

Loading