-
-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Best practice for handing off persisted collection partitions #988
Comments
Maybe you want futures_of?
…On Tue, Mar 19, 2024, 1:02 PM Richard (Rick) Zamora < ***@***.***> wrote:
While working on rapidsai/dask-cuda#1311
<rapidsai/dask-cuda#1311>, I noticed that a
common practice used in down-stream libraries no longer works (cleanly)
with the move to dask-expr.
*The common practice*:
1. Persist a collection (df = df.persist())
2. Find the worker-to-partition mapping for the persisted collection
using mapping = client.who_has() and df.__dask_keys__()
*The problem with dask-expr*:
In dask-expr, calling df.persist() will change the "name" (and therefore
the keys) of the collection. The name change is a result of both expression
optimization, and the creation of a new FromGraph expression. Therefore,
you cannot call df = df.persist(), and then search for the keys of df in
the cluster.
*The question*: What is the new "best practice" for patterns like this?
For reference, here is something that seems to work for now:
df = df.persist()
try:
# Only works for FromGraph-backed collection
persisted_keys = df.keys
except AttributeError:
# Only works for a legacy collection
persisted_keys = df.__dask_keys__()
—
Reply to this email directly, view it on GitHub
<#988>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTDJQZ3R7HWIUN7SWXDYZB4TPAVCNFSM6AAAAABE6CAPNGVHI2DSMVQWIX3LMV43ASLTON2WKOZSGE4TKNRTGMYDGOA>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Okay, thanks - I suppose this approach is backward compatible: df = df.persist()
persisted_keys = [f.key for f in c.client.futures_of(df)] |
Could you provide a little more context for what you're doing? This feels to me like an abstraction leak that bites us whenever we touch this API. I am touching this API with the scheduler integration again and this shortcoming could be fixed but it would be helpful to know a little about the application |
I recommend just using the |
By "this" API, are you referring to
I've seen this used in a few down-stream libraries. The specific application I am looking at right now is just a custom shuffling algorithm that I am very comfortable experimenting with. However, other down-stream libraries (e.g. cugraph, nemo) also use
Great. I'm not familiar with this API, but happy to use it and recommend it if it works. |
While working on rapidsai/dask-cuda#1311, I noticed that a common practice used in down-stream libraries no longer works (cleanly) with the move to dask-expr.
The common practice:
df = df.persist()
)mapping = client.who_has()
anddf.__dask_keys__()
The problem with dask-expr:
In dask-expr, calling
df.persist()
will change the "name" (and therefore the keys) of the collection. The name change is a result of both expression optimization, and the creation of a newFromGraph
expression. Therefore, you cannot calldf = df.persist()
, and then search for the keys ofdf
in the cluster.The question: What is the new "best practice" for patterns like this?
For reference, here is something that seems to work for now:
The text was updated successfully, but these errors were encountered: