An equivalent of `torch.cuda.max_memory_allocated` for pooled resource #1466

masahi · 2024-02-09T21:50:36Z

masahi
Feb 9, 2024

PyTorch has torch.cuda.max_memory_allocated function, which allows me to figure out how much VRAM is remaining for use by an application at any moment. Does rmm have an equivalent function?

My use case is in LLM serving: After an initial warm up step, which consumes some VRAM, I want to use all remaining VRAM for pre-allocating paged cache blocks. To decide the maximum number of cache blocks I can allocate, I need information as returned by torch.cuda.max_memory_allocated but using rmm.

pool_memory_resource::pool_size() from #962 is not what I need since it includes the size of free blocks.

Answered by harrism

Feb 10, 2024

I've converted this to a discussion. We do support a feature equivalent to torch.cuda.max_memory_allocated: statistics_resource_adaptor. You would create an instance of your resource, e.g. an appropriate rmm::mr::pool_memory_resource, and then construct a statistics_resource_adaptor with the pool resource as upstream.

The statistics_mr_tests show some examples of construction and usage in C++. Let me know if you have more questions about the adaptor.

As @jrhemstad pointed out though, this functionality (and the equivalent you linked from pytorch) is not the same as the amount of available VRAM. This will just show you how much the "warm-up" step successfully allocated.

View full answer

jrhemstad · 2024-02-09T22:28:51Z

jrhemstad
Feb 9, 2024
Collaborator

torch.cuda.max_memory_allocated doesn't seem like it does what you describe.

this returns the peak allocated memory since the beginning of this program.

This looks like it just returns the high water mark of allocated memory.

It sounds like you're asking for a way to query the amount of "free" memory hoping that you'll be able to allocate that amount of memory and it will succeed. Unfortunately, there is no such API and in general it is impossible to provide such an API.

0 replies

masahi · 2024-02-09T23:23:05Z

masahi
Feb 9, 2024
Author

Actually torch.cuda.max_memory_allocated does the job for me. I already use it. vLLM (https://github.com/vllm-project/vllm), which is very similar to an application I'm working on, was also using torch.cuda.max_memory_allocated to determine the maximum number of cache blocks that can be allocated, until vllm-project/vllm#2031.

"The warm-up" step I talked about is supposed to get the maximum-sized input and run an inference on an LLM. This gives an upper-bound on the peak VRAM footprint required by the model, for any input accepted by the applicatoin. The rest of available VRAM is used to allocate cache blocks. Now, I want to do the same thing using rmm in C++, without PyTorch.

I apologize if my explanation is not good, but please assume that I am looking for an equivalent API as torch.cuda.max_memory_allocated in rmm.

0 replies

harrism · 2024-02-10T00:23:45Z

harrism
Feb 10, 2024
Collaborator

I've converted this to a discussion. We do support a feature equivalent to torch.cuda.max_memory_allocated: statistics_resource_adaptor. You would create an instance of your resource, e.g. an appropriate rmm::mr::pool_memory_resource, and then construct a statistics_resource_adaptor with the pool resource as upstream.

The statistics_mr_tests show some examples of construction and usage in C++. Let me know if you have more questions about the adaptor.

As @jrhemstad pointed out though, this functionality (and the equivalent you linked from pytorch) is not the same as the amount of available VRAM. This will just show you how much the "warm-up" step successfully allocated.

1 reply

masahi Feb 10, 2024
Author

Thank you! This does seem what I need.

And I do understand that it is a responsiblity of an application to decide how much it can allocate, based on information from these API.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

An equivalent of `torch.cuda.max_memory_allocated` for pooled resource #1466

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

An equivalent of torch.cuda.max_memory_allocated for pooled resource #1466

masahi Feb 9, 2024

Replies: 3 comments · 1 reply

jrhemstad Feb 9, 2024 Collaborator

masahi Feb 9, 2024 Author

harrism Feb 10, 2024 Collaborator

masahi Feb 10, 2024 Author

An equivalent of `torch.cuda.max_memory_allocated` for pooled resource #1466

masahi
Feb 9, 2024

Replies: 3 comments 1 reply

jrhemstad
Feb 9, 2024
Collaborator

masahi
Feb 9, 2024
Author

harrism
Feb 10, 2024
Collaborator

masahi Feb 10, 2024
Author