Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix OOM in BlockManager #973

Merged
merged 8 commits into from
Jan 26, 2024
Merged

fix OOM in BlockManager #973

merged 8 commits into from
Jan 26, 2024

Conversation

zhyncs
Copy link
Collaborator

@zhyncs zhyncs commented Jan 17, 2024

Motivation

fix OOM in BlockManager

Modification

replace total * ratio with free * ratio in GetBlockCount, exclude the weights memory usage

Checklist

  1. Pre-commit or other linting tools are used to fix the potential lint issues.
  2. The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness.
  3. If the modification has a dependency on downstream projects of a newer version, this PR should be tested with all supported versions of downstream projects.
  4. The documentation has been modified accordingly, like docstring or example tutorials.

@zhyncs
Copy link
Collaborator Author

zhyncs commented Jan 17, 2024

Hi @lzhangzz @lvhan028 May you help review this pr

@lzhangzz
Copy link
Collaborator

We need to remind the users when specifying the value by ratio, care must be taken for TP>1 (different ranks may have different free memory sizes).

Any idea on where to add the reminder in the docs ? @lvhan028

@lvhan028
Copy link
Collaborator

We need to remind the users when specifying the value by ratio, care must be taken for TP>1 (different ranks may have different free memory sizes).

Any idea on where to add the reminder in the docs ? @lvhan028

I suggest adding some advanced usages in https://github.com/InternLM/lmdeploy/blob/main/docs/en/inference/pipeline.md, showing how to set cache_max_entry_count. Meanwhile, we can update the FAQ for the OOM issue.

@zhyncs
Copy link
Collaborator Author

zhyncs commented Jan 18, 2024

We need to remind the users when specifying the value by ratio, care must be taken for TP>1 (different ranks may have different free memory sizes).

Any idea on where to add the reminder in the docs ? @lvhan028

Hi @lzhangzz Usually when using a single machine with multiple GPUs for tensor parallelism, hardware specifications are the same, and the weights are also equally divided. May we assume that in most scenarios the free memory of multiple ranks are the same?

@zhyncs
Copy link
Collaborator Author

zhyncs commented Jan 18, 2024

We need to remind the users when specifying the value by ratio, care must be taken for TP>1 (different ranks may have different free memory sizes).
Any idea on where to add the reminder in the docs ? @lvhan028

I suggest adding some advanced usages in https://github.com/InternLM/lmdeploy/blob/main/docs/en/inference/pipeline.md, showing how to set cache_max_entry_count. Meanwhile, we can update the FAQ for the OOM issue.

Hi @lvhan028 I agree that the usage documentation should also be updated. May this fix be considered to be merged first?

@lvhan028
Copy link
Collaborator

Hi, @zhyncs
I still have some concerns.
In the tensor parallelism case, if some specified GPUs are occupied by other programs or zombie processes, free memory will differ among the GPUs. It means the number of k/v cache blocks in each GPU is different, which will bring catastrophe during inference.

replace `free * ratio` with `(total * ratio - (total -free))`
@zhyncs
Copy link
Collaborator Author

zhyncs commented Jan 19, 2024

Hi @lzhangzz @lvhan028 In the latest commit, I have updated the free * ratio with total * ratio - (total - free).

I agreed with @lvhan028's case.

And if we want to completely solve the OOM issue and potential inconsistencies with multi-GPU's k/v cache blocks during TP, we may execute a forward pass with dummy inputs to profile the memory usage of the model like vLLM's profile_run.

@lvhan028
Copy link
Collaborator

Hi, @zhyncs
total * ratio - (total - free) might be negative.
I prefer to compute the k/v cache memory size on python side, not C++/cuda side.
@lzhangzz any comments?

@zhyncs
Copy link
Collaborator Author

zhyncs commented Jan 22, 2024

Hi @lvhan028
I agree that total * ratio - (total - free) might be negative. And we may add assertion for this case. The program should report the error to the user and gracefully exit, prompting the user to provide a correct value for the configuration. Guiding the user through the correct configuration will help prevent any further issues or misunderstandings.
Anyway, the current total * ratio is problematic.
Controlling these parameters and checksums on the Python side is ok.

@zhyncs
Copy link
Collaborator Author

zhyncs commented Jan 22, 2024

We need to remind the users when specifying the value by ratio, care must be taken for TP>1 (different ranks may have different free memory sizes).
Any idea on where to add the reminder in the docs ? @lvhan028

I suggest adding some advanced usages in https://github.com/InternLM/lmdeploy/blob/main/docs/en/inference/pipeline.md, showing how to set cache_max_entry_count. Meanwhile, we can update the FAQ for the OOM issue.

Hi @lvhan028 @lzhangzz

If it is more difficult for us to decide whether to go with this temporary fix or a more sophisticated one, such as profile_run similar to vLLM, then we can now go ahead and update the docs to temporarily bypass the OOM issue by configuring cache_max_entry_count when the user encounters it, even though there is a certain mind-burden associated with this configuration.

TurbomindEngineConfig cache_max_entry_count
  ->LlamaTritonModel EngineParams cache_max_block_count
    ->LlamaV2 EngineParams cache_max_block_count
      ->LlamaBatch EngineParams cache_max_block_count
        ->SequenceManager block_count
          ->BlockManager block_count
            ->GetBlockCount ratio

@@ -86,7 +86,9 @@ size_t BlockManager::GetBlockCount(size_t block_size, double ratio)
size_t free{};
size_t total{};
check_cuda_error(cudaMemGetInfo(&free, &total));
return static_cast<size_t>(total * ratio) / block_size;
FT_CHECK_WITH_INFO(total * ratio - (total - free) > 0,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. But should it be total * ratio <= free?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no need to have this restriction, it doesn't matter.
e.g. A100 80G, Llama 2 13b Chat, ratio 0.9
total * ratio =80*0.9=72, free =80-26=54, total * ratio - (total - free) =46 (weights: 26g, kv cache: 46g)
In this case, total * ratio is not less than free but it's ok.
We may take total * ratio - (total - free) as a whole.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, the ratio definition is that the size of ratio*total memory is allocated for the k/v cache. So, when OOM happens, we suggest decreasing the ratio.
However, the equation total * ratio - (total - free) = free - (1-ratio)*total indicates an increasing ratio when memory is run out.
That's why I suggest FT_CHECK_WITH_INFO(total * ratio < free, ...)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can consider revising the definition of ratio. Using the ratio*total-sized memory for k/v cache can impose a significant mental burden on users. For example, when running Llama 2 7b chat fp16 on an A30 24g device, the weights occupy 14g, and the default ratio parameter is set to 0.5. If not configured, it could result in OOM. Even when configured manually, the user would need to subtract the weights from the total memory, make estimations, and perform calculations to determine an appropriate ratio. It's not user-friendly.
The ratio can convey the following meanings, eliminating the need for users to consider model weights.

weights: total - free
k/v cache: total * ratio - (total - free)
other: (1 - ratio) * total

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the ratio definition could be the size of ratio*total - MODEL_WEIGHTS memory is allocated for the k/v cache.

@lzhangzz
Copy link
Collaborator

With kv = total * ratio - (total - free), the semantic of ratio becomes ratio = kv% + used%, which is the ratio of kv cache plus the ratio of weight.

However, it's inconsistent with the option name cache_max_entry_count, which suggesting we are configuring the amount kv-cache blocks.

And it does not solve the possible inconsistency in TP mode. Anything depends on free suffers the same problem.

I suggest we still use kv = free * ratio as it's consistent with the current option.

To deal with the inconsistency, we may

  1. Find the mininal free of all ranks, which requires additional communication and synchronization.
  2. Do nothing for now and flag a warning that the user should check for consistency.

@zhyncs
Copy link
Collaborator Author

zhyncs commented Jan 24, 2024

With kv = total * ratio - (total - free), the semantic of ratio becomes ratio = kv% + used%, which is the ratio of kv cache plus the ratio of weight.

However, it's inconsistent with the option name cache_max_entry_count, which suggesting we are configuring the amount kv-cache blocks.

And it does not solve the possible inconsistency in TP mode. Anything depends on free suffers the same problem.

I suggest we still use kv = free * ratio as it's consistent with the current option.

To deal with the inconsistency, we may

  1. Find the mininal free of all ranks, which requires additional communication and synchronization.
  2. Do nothing for now and flag a warning that the user should check for consistency.

@lzhangzz Ok I agree. The first commit is just free * ratio. In order to use cache_max_entry_count as before, I will revert to this and add warning for the TP consistency. And I think this check is suitable before initializing AsyncEngine. @AllentDan Do you have any suggestion?

@zhyncs
Copy link
Collaborator Author

zhyncs commented Jan 24, 2024

from py3nvml.py3nvml import (
    nvmlInit,
    nvmlDeviceGetCount,
    nvmlDeviceGetHandleByIndex,
    nvmlDeviceGetMemoryInfo,
    nvmlShutdown,
)
import os


def get_device_ids():
    devices = os.getenv('CUDA_VISIBLE_DEVICES', '')
    return list(map(int, devices.split(','))) if devices else []


def compare_individual_gpu_memory(device_ids):
    try:
        nvmlInit()
        device_count = nvmlDeviceGetCount()
        total_mem = []
        free_mem = []

        for i in range(device_count):
            if device_ids and i not in device_ids:
                continue
            handle = nvmlDeviceGetHandleByIndex(i)
            mem_info = nvmlDeviceGetMemoryInfo(handle)
            total_mem.append(mem_info.total)
            free_mem.append(mem_info.free)

        all_total_equal = total_mem.count(total_mem[0]) == len(total_mem)
        all_free_equal = free_mem.count(free_mem[0]) == len(free_mem)

        print(f"Total memory equal across GPUs: {all_total_equal}")
        print(f"Free memory equal across GPUs: {all_free_equal}")

        nvmlShutdown()

    except Exception as e:
        print(f"An exception occurred: {e}")


device_ids = get_device_ids()
compare_individual_gpu_memory(device_ids)

It works when TP > 1. Some code snippets like this @AllentDan

@AllentDan
Copy link
Collaborator

It would be better that the check and logging a suggestion ratio could be done in Turbomind. Not only AsyncEngine used Turbomind.

@zhyncs
Copy link
Collaborator Author

zhyncs commented Jan 24, 2024

It would be better that the check and logging a suggestion ratio could be done in Turbomind. Not only AsyncEngine used Turbomind.

ok

@AllentDan
Copy link
Collaborator

As for python package to get the memory information, I would suggest using torch.cuda.mem_get_info. This function requires no extra third party package.

@zhyncs
Copy link
Collaborator Author

zhyncs commented Jan 25, 2024

Hi @lzhangzz @AllentDan May you help review the latest commit? Thanks.

lzhangzz
lzhangzz previously approved these changes Jan 26, 2024
@zhyncs
Copy link
Collaborator Author

zhyncs commented Jan 26, 2024

Hi @lvhan028 May you help merge this pr? Thanks.

@lzhangzz
Copy link
Collaborator

fix lint issues please

@zhyncs
Copy link
Collaborator Author

zhyncs commented Jan 26, 2024

fix lint issues please

ok

@lvhan028
Copy link
Collaborator

@zhyncs please resolve linting error

pip install pre-commit
cd lmdeploy
pre-commit install .
pre-commit run --all-files

@zhyncs
Copy link
Collaborator Author

zhyncs commented Jan 26, 2024

@zhyncs please resolve linting error

pip install pre-commit
cd lmdeploy
pre-commit install .
pre-commit run --all-files
[username@hostname lmdeploy]# pre-commit run --all-files
flake8...................................................................Passed
isort....................................................................Passed
yapf.....................................................................Passed
trim trailing whitespace.................................................Passed
check yaml...............................................................Passed
fix end of files.........................................................Passed
fix requirements.txt.....................................................Passed
fix double quoted strings................................................Passed
check for merge conflicts................................................Passed
fix python encoding pragma...............................................Passed
mixed line ending........................................................Passed
mdformat.................................................................Passed
codespell................................................................Passed
docformatter.............................................................Passed
check copyright..........................................................Passed

done

@lzhangzz
Copy link
Collaborator

hold on, just found some problems in _compare_individual_gpu_memory

lmdeploy/turbomind/turbomind.py Outdated Show resolved Hide resolved
lmdeploy/turbomind/turbomind.py Outdated Show resolved Hide resolved
lmdeploy/turbomind/turbomind.py Outdated Show resolved Hide resolved
@zhyncs
Copy link
Collaborator Author

zhyncs commented Jan 26, 2024

Hi @lzhangzz May you help review the latest commit? Thanks.

@lzhangzz lzhangzz merged commit 0f324a4 into InternLM:main Jan 26, 2024
8 checks passed
@zhyncs zhyncs deleted the patch-1 branch January 26, 2024 08:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants