-
Notifications
You must be signed in to change notification settings - Fork 431
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix OOM in BlockManager #973
Conversation
We need to remind the users when specifying the value by ratio, care must be taken for TP>1 (different ranks may have different free memory sizes). Any idea on where to add the reminder in the docs ? @lvhan028 |
I suggest adding some advanced usages in https://github.com/InternLM/lmdeploy/blob/main/docs/en/inference/pipeline.md, showing how to set |
Hi @lzhangzz Usually when using a single machine with multiple GPUs for tensor parallelism, hardware specifications are the same, and the weights are also equally divided. May we assume that in most scenarios the |
Hi @lvhan028 I agree that the usage documentation should also be updated. May this fix be considered to be merged first? |
Hi, @zhyncs |
replace `free * ratio` with `(total * ratio - (total -free))`
Hi @lzhangzz @lvhan028 In the latest commit, I have updated the I agreed with @lvhan028's case. And if we want to completely solve the OOM issue and potential inconsistencies with multi-GPU's k/v cache blocks during TP, we may execute a forward pass with dummy inputs to profile the memory usage of the model like vLLM's |
Hi @lvhan028 |
If it is more difficult for us to decide whether to go with this temporary fix or a more sophisticated one, such as
|
@@ -86,7 +86,9 @@ size_t BlockManager::GetBlockCount(size_t block_size, double ratio) | |||
size_t free{}; | |||
size_t total{}; | |||
check_cuda_error(cudaMemGetInfo(&free, &total)); | |||
return static_cast<size_t>(total * ratio) / block_size; | |||
FT_CHECK_WITH_INFO(total * ratio - (total - free) > 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. But should it be total * ratio <= free
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no need to have this restriction, it doesn't matter.
e.g. A100 80G, Llama 2 13b Chat, ratio 0.9
total * ratio
=80*0.9=72, free
=80-26=54, total * ratio - (total - free)
=46 (weights: 26g, kv cache: 46g)
In this case, total * ratio
is not less than free
but it's ok.
We may take total * ratio - (total - free)
as a whole.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, the ratio
definition is that the size of ratio*total
memory is allocated for the k/v cache. So, when OOM happens, we suggest decreasing the ratio
.
However, the equation total * ratio - (total - free)
= free - (1-ratio)*total
indicates an increasing ratio
when memory is run out.
That's why I suggest FT_CHECK_WITH_INFO(total * ratio < free, ...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can consider revising the definition of ratio
. Using the ratio*total-sized memory for k/v cache can impose a significant mental burden on users. For example, when running Llama 2 7b chat fp16 on an A30 24g device, the weights occupy 14g, and the default ratio parameter is set to 0.5. If not configured, it could result in OOM. Even when configured manually, the user would need to subtract the weights from the total memory, make estimations, and perform calculations to determine an appropriate ratio
. It's not user-friendly.
The ratio
can convey the following meanings, eliminating the need for users to consider model weights.
weights: total - free
k/v cache: total * ratio - (total - free)
other: (1 - ratio) * total
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the ratio
definition could be the size of ratio*total - MODEL_WEIGHTS
memory is allocated for the k/v cache.
With However, it's inconsistent with the option name And it does not solve the possible inconsistency in TP mode. Anything depends on I suggest we still use To deal with the inconsistency, we may
|
@lzhangzz Ok I agree. The first commit is just |
from py3nvml.py3nvml import (
nvmlInit,
nvmlDeviceGetCount,
nvmlDeviceGetHandleByIndex,
nvmlDeviceGetMemoryInfo,
nvmlShutdown,
)
import os
def get_device_ids():
devices = os.getenv('CUDA_VISIBLE_DEVICES', '')
return list(map(int, devices.split(','))) if devices else []
def compare_individual_gpu_memory(device_ids):
try:
nvmlInit()
device_count = nvmlDeviceGetCount()
total_mem = []
free_mem = []
for i in range(device_count):
if device_ids and i not in device_ids:
continue
handle = nvmlDeviceGetHandleByIndex(i)
mem_info = nvmlDeviceGetMemoryInfo(handle)
total_mem.append(mem_info.total)
free_mem.append(mem_info.free)
all_total_equal = total_mem.count(total_mem[0]) == len(total_mem)
all_free_equal = free_mem.count(free_mem[0]) == len(free_mem)
print(f"Total memory equal across GPUs: {all_total_equal}")
print(f"Free memory equal across GPUs: {all_free_equal}")
nvmlShutdown()
except Exception as e:
print(f"An exception occurred: {e}")
device_ids = get_device_ids()
compare_individual_gpu_memory(device_ids) It works when TP > 1. Some code snippets like this @AllentDan |
It would be better that the check and logging a suggestion ratio could be done in Turbomind. Not only |
ok |
As for python package to get the memory information, I would suggest using |
Hi @lzhangzz @AllentDan May you help review the latest commit? Thanks. |
Hi @lvhan028 May you help merge this pr? Thanks. |
fix lint issues please |
ok |
@zhyncs please resolve linting error
|
done |
hold on, just found some problems in |
Hi @lzhangzz May you help review the latest commit? Thanks. |
Motivation
fix OOM in BlockManager
Modification
replace
total * ratio
withfree * ratio
inGetBlockCount
, exclude the weights memory usageChecklist