Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

create multiple dask workers per gpu #571

Closed
alex-rakowski opened this issue Apr 15, 2021 · 7 comments
Closed

create multiple dask workers per gpu #571

alex-rakowski opened this issue Apr 15, 2021 · 7 comments

Comments

@alex-rakowski
Copy link

I'm using a LocalCUDACluster to process a dask dataframe, some pseudo code below. The GPU func is much quicker than CPU but has relatively low memory requirements (2-4GB) so it would be good to have more workers per GPU. Is there a way to create multiple dask workers per GPU or create logical GPUs out of physical GPUs (similar what is possible in tensorflow tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024)). Or is there some other obvious route to improving this?

Hardware this will be used on:
Local machine 4X RTX A6000 (48GB)
Scaled to multinode HPC runs with V100 (16GB) and A100(40GB)

All running inside a Docker/Singularity/Shifter container solution.

Psuedo code:

import dask.dataframe as dd
from dask_cuda import LocalCudaCluster
from dask.distributed import Client
import 
cluster = LocalCUDACluster()
client = Client(cluster)

df = pd.read_parquet('file.parq')
ddf = dd.from_pandas(df, npartitions=4) # local machine has 4 GPUS

ddf.map_partitions(lambda df: df.apply(lambda x: my_gpu_func(x.main), axis=1)).compute(scheduler=client)

@pentschev
Copy link
Member

It's possible to start multiple compute threads per GPU passing threads_per_worker (defaults to 1) to `LocalCUDACluster. However, libraries such as cuDF today only support using the GPU's default stream, meaning parallelism can't be achieved as all threads would be enqueuing tasks all on the same stream and synchronizing the entire device when it needs to, in most situations leading to no performance gain whatsoever. Besides libraries being able to use multiple streams, we also need to ensure Dask-CUDA itself can take advantage of multiple streams, for that we have an effort for using Per Thread Default Streams being discussed in #517 , but for now performance gains are limited to some particular workflows, so there's much more work left until performance gains generalize.

@alex-rakowski
Copy link
Author

In my case, the performance per worker appears adequate with 1 thread, although I may need to play with the number of threads in the future, ideally, I need more workers. Currently, I can't assign the number of workers beyond the number of physical GPUs, I was hoping there was a workaround inside dask_cuda similar to that in TensorFlow.

@pentschev
Copy link
Member

Yes, today Dask-CUDA is limited to one worker per GPU as a design choice and it's not in our plans to extend that currently. Extending that would entail lots of different complications, such as handling memory pools and spilling efficiently, so our goal is to keep on working with one worker per GPU but multiple threads.

If you definitely want to go ahead and test that on your own, the alternative you have is to launch dask-scheduler and dask-cuda-workers via the command line. By default, dask-cuda-worker will also launch one worker per GPU, but you could run multiple instances of that, effectively achieving multiple workers per GPU. But as I mentioned above, we don't officially support nor test this and you will definitely have to, for example, balance memory utilization on your own, as each worker won't know it's sharing the GPU with other processes.

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@pentschev
Copy link
Member

Since this is currently out-of-scope for Dask-CUDA, I'm closing this. If there's more you would like to discuss, please feel free to reopen this or open a new issue.

@bartbkr
Copy link

bartbkr commented Feb 4, 2022

@pentschev I'm absolutely not advocating for adding this functionality. Fooling around with this setting has led to some hard to track down CUDA errors. Is there a way that setting threads_per_worker to anything > 1 can raise a warning?

@pentschev
Copy link
Member

@bartbkr what kind of warning? Running with more than one thread per worker should work fine, although we don't expect any reasonable performance gains with it, plus you're likely to end up with OOM errors faster than you would with the default one thread per worker. If these are OOM errors you're seeing, I wouldn't be surprised.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants