create multiple dask workers per gpu #571

alex-rakowski · 2021-04-15T04:05:13Z

I'm using a LocalCUDACluster to process a dask dataframe, some pseudo code below. The GPU func is much quicker than CPU but has relatively low memory requirements (2-4GB) so it would be good to have more workers per GPU. Is there a way to create multiple dask workers per GPU or create logical GPUs out of physical GPUs (similar what is possible in tensorflow tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024)). Or is there some other obvious route to improving this?

Hardware this will be used on:
Local machine 4X RTX A6000 (48GB)
Scaled to multinode HPC runs with V100 (16GB) and A100(40GB)

All running inside a Docker/Singularity/Shifter container solution.

Psuedo code:

import dask.dataframe as dd
from dask_cuda import LocalCudaCluster
from dask.distributed import Client
import 
cluster = LocalCUDACluster()
client = Client(cluster)

df = pd.read_parquet('file.parq')
ddf = dd.from_pandas(df, npartitions=4) # local machine has 4 GPUS

ddf.map_partitions(lambda df: df.apply(lambda x: my_gpu_func(x.main), axis=1)).compute(scheduler=client)

The text was updated successfully, but these errors were encountered:

pentschev · 2021-04-15T07:59:57Z

It's possible to start multiple compute threads per GPU passing threads_per_worker (defaults to 1) to `LocalCUDACluster. However, libraries such as cuDF today only support using the GPU's default stream, meaning parallelism can't be achieved as all threads would be enqueuing tasks all on the same stream and synchronizing the entire device when it needs to, in most situations leading to no performance gain whatsoever. Besides libraries being able to use multiple streams, we also need to ensure Dask-CUDA itself can take advantage of multiple streams, for that we have an effort for using Per Thread Default Streams being discussed in #517 , but for now performance gains are limited to some particular workflows, so there's much more work left until performance gains generalize.

alex-rakowski · 2021-04-16T20:48:52Z

In my case, the performance per worker appears adequate with 1 thread, although I may need to play with the number of threads in the future, ideally, I need more workers. Currently, I can't assign the number of workers beyond the number of physical GPUs, I was hoping there was a workaround inside dask_cuda similar to that in TensorFlow.

pentschev · 2021-04-16T21:29:44Z

Yes, today Dask-CUDA is limited to one worker per GPU as a design choice and it's not in our plans to extend that currently. Extending that would entail lots of different complications, such as handling memory pools and spilling efficiently, so our goal is to keep on working with one worker per GPU but multiple threads.

If you definitely want to go ahead and test that on your own, the alternative you have is to launch dask-scheduler and dask-cuda-workers via the command line. By default, dask-cuda-worker will also launch one worker per GPU, but you could run multiple instances of that, effectively achieving multiple workers per GPU. But as I mentioned above, we don't officially support nor test this and you will definitely have to, for example, balance memory utilization on your own, as each worker won't know it's sharing the GPU with other processes.

github-actions · 2021-05-16T22:03:32Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

pentschev · 2021-05-20T18:42:44Z

Since this is currently out-of-scope for Dask-CUDA, I'm closing this. If there's more you would like to discuss, please feel free to reopen this or open a new issue.

bartbkr · 2022-02-04T01:21:53Z

@pentschev I'm absolutely not advocating for adding this functionality. Fooling around with this setting has led to some hard to track down CUDA errors. Is there a way that setting threads_per_worker to anything > 1 can raise a warning?

pentschev · 2022-02-04T08:25:30Z

@bartbkr what kind of warning? Running with more than one thread per worker should work fine, although we don't expect any reasonable performance gains with it, plus you're likely to end up with OOM errors faster than you would with the default one thread per worker. If these are OOM errors you're seeing, I wouldn't be surprised.

github-actions bot added the inactive-30d label May 16, 2021

pentschev closed this as completed May 20, 2021

jakirkham mentioned this issue Jan 12, 2022

[FEA] Expose rmm maximum_pool_size to LocalCUDACluster and dask-cuda-worker API #826

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

create multiple dask workers per gpu #571

create multiple dask workers per gpu #571

alex-rakowski commented Apr 15, 2021

pentschev commented Apr 15, 2021

alex-rakowski commented Apr 16, 2021

pentschev commented Apr 16, 2021

github-actions bot commented May 16, 2021

pentschev commented May 20, 2021

bartbkr commented Feb 4, 2022

pentschev commented Feb 4, 2022

create multiple dask workers per gpu #571

create multiple dask workers per gpu #571

Comments

alex-rakowski commented Apr 15, 2021

pentschev commented Apr 15, 2021

alex-rakowski commented Apr 16, 2021

pentschev commented Apr 16, 2021

github-actions bot commented May 16, 2021

pentschev commented May 20, 2021

bartbkr commented Feb 4, 2022

pentschev commented Feb 4, 2022