-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Add compute_with_dask utility for down-stream Merlin libriaries #70
Comments
This looks good! My only comment here is that I wonder if this function might belong in the |
Computing a Dask object/graph is definitely used extensively outside IO, so my gut feeling is "no". Also, all other utilities related to distributed/serial Dask execution are already defined in |
Probably fine to add it there for now then. I would like to organize the utils into other more clearly named sub-packages, but that's out of scope for this issue. |
Thinking about it, I wonder if having a |
Sure - Do you have an intuition for what would qualify a piece of code as something that belongs under |
I don't have a clear line in mind, but having looked at what's in |
Okay - I definitely agree that it makes sense to add a new merlin module for dask-specific code (as long as we all understand that a lot of dask-related code will still need to live in many other places. Perhaps I'll try to throw up a PR with a prospective re-org today or tomorrow. |
Proposal
I propose that we add a single
compute_dask_object
(orcompute_with_dask
) utility tomerlin.core.utils
, and use that utility for all Dask computation within Merlin. The purpose of this utility would be to check theglobal_dask_client()
utility, and utilize the appropriate default client/scheduler for Merlin (and not the default for dask/distributed).More Background
While starting to look into merlin-models#339, I noticed that there is at least one place where Merlin-models uses a bare
compute()
statement to compute a Dask collection. I suspect that this is also done in several other places across the Merlin ecosystem.When there is no global Dask client in the current python context, using
compute()
will typically result in execution with Dask's "multi-threaded" scheduler. This may be fine for CPU-backed data, but will result in many python threads thrashing the same GPU (device 0) when the data is GPU backed.For
compute
operations in NVTabular (which only operate onDelayed
Dask objects), themerlin.core.utils.global_dask_client
utility is used to query to current Dask client. If this function returnsNone
, the convention is to use the "sychronous" scheduler (compute(scheduler="sychronous")
), otherwise the distributed client is used. I propose that this same convention be used everywhere in Merlin (besides special cases wherescheduler="synchronous"
can be hard coded.Note that these changes are also required for Merlin's
Serial
andDistributed
context managers to work correctly.Proposed Implementation
The text was updated successfully, but these errors were encountered: