Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Expose rmm maximum_pool_size to LocalCUDACluster and dask-cuda-worker API #826

Closed
VibhuJawa opened this issue Jan 11, 2022 · 8 comments · Fixed by #827
Closed

[FEA] Expose rmm maximum_pool_size to LocalCUDACluster and dask-cuda-worker API #826

VibhuJawa opened this issue Jan 11, 2022 · 8 comments · Fixed by #827

Comments

@VibhuJawa
Copy link
Member

VibhuJawa commented Jan 11, 2022

We should expose rmm's maximum_pool_size argument (See docs) to the LocalCUDACluster and dask-cuda-worker CLI API .

Why
By default, the total available memory on the GPU is used by RMM . This can cause problems for workflows where it actually grows to the total available device memory and we need some memory outside the POOL .

Why we may need room for other allocations:

  1. Competing Process: Common case is Client & Worker on the same GPU.

  2. Competing pool/libraries: Some libraries might need to allocate some memory outside of the pool.
    Like Pytorch (Even Nccl/Raft might need some room.)

Giving the user the ability to set maximum_pool_size easily can circumvent those issues.

Workflow Context:
I ran into this while working on a workflow where the pool grew to 32501MiB (out of cards total 32510MiB ) and it left very little memory for cuML to do some non POOL allocation leading to failure.

Current WorkAround:
Using the RMM reinitialize API.

cluster = LocalCUDACluster(
                          jit_unspill=True,
                          rmm_pool_size=None,
                          local_directory='/raid3/vjawa/')
                          
client = Client(cluster)
## Set pool
_ = client.run(rmm.reinitialize,pool_allocator=True,initial_pool_size=28*(2**30), maximum_pool_size=(28)*(2**30));
@VibhuJawa
Copy link
Member Author

VibhuJawa commented Jan 11, 2022

Happy to take this on if we decide to do this.

@VibhuJawa VibhuJawa changed the title [FEA] Expose rmm maximum_pool_size API to dask-cuda [FEA] Expose rmm maximum_pool_size API to dask-cuda Jan 11, 2022
@VibhuJawa VibhuJawa changed the title [FEA] Expose rmm maximum_pool_size API to dask-cuda [FEA] Expose rmm maximum_pool_size to LocalCUDACluster and dask-cuda-worker API Jan 11, 2022
@mmccarty
Copy link

Would this allow me to run multiple workers per GPU?

@pentschev
Copy link
Member

Exposing this as rmm_maximum_pool_size/--rmm-maximum-pool-size sounds good to me @VibhuJawa . I believe the only thing we have to be careful about is to make this parameter a no-op if --rmm-pool-size isn't specified as that's what we use today to decide whether to create an RMM pool at all or not. Let me know if you want to work on it, otherwise I can do that too, it shouldn't take too long.

@pentschev
Copy link
Member

Would this allow me to run multiple workers per GPU?

@mmccarty I don't quite get the direct relation between this and having multiple workers per GPU. Today you can do that manually but I don't believe that would necessarily beneficial performance-wise. What's your use case for multiple workers per GPU?

@VibhuJawa
Copy link
Member Author

Exposing this as rmm_maximum_pool_size/--rmm-maximum-pool-size sounds good to me @VibhuJawa . I believe the only thing we have to be careful about is to make this parameter a no-op if --rmm-pool-size isn't specified as that's what we use today to decide whether to create an RMM pool at all or not. Let me know if you want to work on it, otherwise I can do that too, it shouldn't take too long.

Thanks @pentschev . Will push a PR keeping your suggestion in mind soon.

Would like to contribute to dask-cuda to just know the code base slightly better.

@mmccarty
Copy link

@pentschev Happy to talk about the use case. Is there a better issue? or should I just create a new one?

@rapids-bot rapids-bot bot closed this as completed in #827 Jan 12, 2022
rapids-bot bot pushed a commit that referenced this issue Jan 12, 2022
This PR closes #826

Authors:
  - Vibhu Jawa (https://github.com/VibhuJawa)

Approvers:
  - Peter Andreas Entschev (https://github.com/pentschev)

URL: #827
@pentschev
Copy link
Member

@mmccarty I think you should open a new issue, it may be covered somewhere but I can't tell for sure without more details.

@jakirkham
Copy link
Member

There was issue ( #571 ). Though yeah hard to tell whether that is related without knowing more about the use case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants