-
PyTorch has torch.cuda.max_memory_allocated function, which allows me to figure out how much VRAM is remaining for use by an application at any moment. Does rmm have an equivalent function? My use case is in LLM serving: After an initial warm up step, which consumes some VRAM, I want to use all remaining VRAM for pre-allocating paged cache blocks. To decide the maximum number of cache blocks I can allocate, I need information as returned by
|
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 1 reply
-
torch.cuda.max_memory_allocated doesn't seem like it does what you describe.
This looks like it just returns the high water mark of allocated memory. It sounds like you're asking for a way to query the amount of "free" memory hoping that you'll be able to allocate that amount of memory and it will succeed. Unfortunately, there is no such API and in general it is impossible to provide such an API. |
Beta Was this translation helpful? Give feedback.
-
Actually "The warm-up" step I talked about is supposed to get the maximum-sized input and run an inference on an LLM. This gives an upper-bound on the peak VRAM footprint required by the model, for any input accepted by the applicatoin. The rest of available VRAM is used to allocate cache blocks. Now, I want to do the same thing using rmm in C++, without PyTorch. I apologize if my explanation is not good, but please assume that I am looking for an equivalent API as |
Beta Was this translation helpful? Give feedback.
-
I've converted this to a discussion. We do support a feature equivalent to The As @jrhemstad pointed out though, this functionality (and the equivalent you linked from pytorch) is not the same as the amount of available VRAM. This will just show you how much the "warm-up" step successfully allocated. |
Beta Was this translation helpful? Give feedback.
I've converted this to a discussion. We do support a feature equivalent to
torch.cuda.max_memory_allocated
:statistics_resource_adaptor
. You would create an instance of your resource, e.g. an appropriatermm::mr::pool_memory_resource
, and then construct astatistics_resource_adaptor
with the pool resource as upstream.The
statistics_mr_tests
show some examples of construction and usage in C++. Let me know if you have more questions about the adaptor.As @jrhemstad pointed out though, this functionality (and the equivalent you linked from pytorch) is not the same as the amount of available VRAM. This will just show you how much the "warm-up" step successfully allocated.