[FEA] Expose rmm `maximum_pool_size` to `LocalCUDACluster` and dask-cuda-worker API #826

VibhuJawa · 2022-01-11T17:42:17Z

We should expose rmm's maximum_pool_size argument (See docs) to the LocalCUDACluster and dask-cuda-worker CLI API .

Why
By default, the total available memory on the GPU is used by RMM . This can cause problems for workflows where it actually grows to the total available device memory and we need some memory outside the POOL .

Why we may need room for other allocations:

Competing Process: Common case is Client & Worker on the same GPU.
Competing pool/libraries: Some libraries might need to allocate some memory outside of the pool.
Like Pytorch (Even Nccl/Raft might need some room.)

Giving the user the ability to set maximum_pool_size easily can circumvent those issues.

Workflow Context:
I ran into this while working on a workflow where the pool grew to 32501MiB (out of cards total 32510MiB ) and it left very little memory for cuML to do some non POOL allocation leading to failure.

Current WorkAround:
Using the RMM reinitialize API.

cluster = LocalCUDACluster(
                          jit_unspill=True,
                          rmm_pool_size=None,
                          local_directory='/raid3/vjawa/')
                          
client = Client(cluster)
## Set pool
_ = client.run(rmm.reinitialize,pool_allocator=True,initial_pool_size=28*(2**30), maximum_pool_size=(28)*(2**30));

The text was updated successfully, but these errors were encountered:

VibhuJawa · 2022-01-11T17:42:33Z

Happy to take this on if we decide to do this.

mmccarty · 2022-01-11T18:45:51Z

Would this allow me to run multiple workers per GPU?

pentschev · 2022-01-11T19:59:35Z

Exposing this as rmm_maximum_pool_size/--rmm-maximum-pool-size sounds good to me @VibhuJawa . I believe the only thing we have to be careful about is to make this parameter a no-op if --rmm-pool-size isn't specified as that's what we use today to decide whether to create an RMM pool at all or not. Let me know if you want to work on it, otherwise I can do that too, it shouldn't take too long.

pentschev · 2022-01-11T20:01:10Z

Would this allow me to run multiple workers per GPU?

@mmccarty I don't quite get the direct relation between this and having multiple workers per GPU. Today you can do that manually but I don't believe that would necessarily beneficial performance-wise. What's your use case for multiple workers per GPU?

VibhuJawa · 2022-01-11T20:43:07Z

Exposing this as rmm_maximum_pool_size/--rmm-maximum-pool-size sounds good to me @VibhuJawa . I believe the only thing we have to be careful about is to make this parameter a no-op if --rmm-pool-size isn't specified as that's what we use today to decide whether to create an RMM pool at all or not. Let me know if you want to work on it, otherwise I can do that too, it shouldn't take too long.

Thanks @pentschev . Will push a PR keeping your suggestion in mind soon.

Would like to contribute to dask-cuda to just know the code base slightly better.

mmccarty · 2022-01-12T18:02:16Z

@pentschev Happy to talk about the use case. Is there a better issue? or should I just create a new one?

This PR closes #826 Authors: - Vibhu Jawa (https://github.com/VibhuJawa) Approvers: - Peter Andreas Entschev (https://github.com/pentschev) URL: #827

pentschev · 2022-01-12T20:29:49Z

@mmccarty I think you should open a new issue, it may be covered somewhere but I can't tell for sure without more details.

jakirkham · 2022-01-12T20:31:06Z

There was issue ( #571 ). Though yeah hard to tell whether that is related without knowing more about the use case.

VibhuJawa changed the title ~~[FEA] Expose rmm maximum_pool_size API to dask-cuda~~ [FEA] Expose rmm maximum_pool_size API to dask-cuda Jan 11, 2022

VibhuJawa changed the title ~~[FEA] Expose rmm maximum_pool_size API to dask-cuda~~ [FEA] Expose rmm maximum_pool_size to LocalCUDACluster and dask-cuda-worker API Jan 11, 2022

VibhuJawa mentioned this issue Jan 12, 2022

[REVIEW]Expose rmm-maximum_pool_size argument #827

Merged

rapids-bot bot closed this as completed in #827 Jan 12, 2022

rapids-bot bot pushed a commit that referenced this issue Jan 12, 2022

Expose rmm-maximum_pool_size argument (#827)

4fd5f64

This PR closes #826 Authors: - Vibhu Jawa (https://github.com/VibhuJawa) Approvers: - Peter Andreas Entschev (https://github.com/pentschev) URL: #827

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Expose rmm `maximum_pool_size` to `LocalCUDACluster` and dask-cuda-worker API #826

[FEA] Expose rmm `maximum_pool_size` to `LocalCUDACluster` and dask-cuda-worker API #826

VibhuJawa commented Jan 11, 2022 •

edited

Loading

VibhuJawa commented Jan 11, 2022 •

edited

Loading

mmccarty commented Jan 11, 2022

pentschev commented Jan 11, 2022

pentschev commented Jan 11, 2022

VibhuJawa commented Jan 11, 2022

mmccarty commented Jan 12, 2022

pentschev commented Jan 12, 2022

jakirkham commented Jan 12, 2022

[FEA] Expose rmm maximum_pool_size to LocalCUDACluster and dask-cuda-worker API #826

[FEA] Expose rmm maximum_pool_size to LocalCUDACluster and dask-cuda-worker API #826

Comments

VibhuJawa commented Jan 11, 2022 • edited Loading

VibhuJawa commented Jan 11, 2022 • edited Loading

mmccarty commented Jan 11, 2022

pentschev commented Jan 11, 2022

pentschev commented Jan 11, 2022

VibhuJawa commented Jan 11, 2022

mmccarty commented Jan 12, 2022

pentschev commented Jan 12, 2022

jakirkham commented Jan 12, 2022

[FEA] Expose rmm `maximum_pool_size` to `LocalCUDACluster` and dask-cuda-worker API #826

[FEA] Expose rmm `maximum_pool_size` to `LocalCUDACluster` and dask-cuda-worker API #826

VibhuJawa commented Jan 11, 2022 •

edited

Loading

VibhuJawa commented Jan 11, 2022 •

edited

Loading