OOM for Mistral-Nemo-Base-2407 with NeMo + ThunderFX for input sequence lengths working for NeMo Eager #1475

mpatel31415 · 2024-11-26T14:40:39Z

🐛 Bug

When running Mistral-Nemo-Base-2407 with NeMo + ThunderFX even for small sequence lengths we get OOM error.

To Reproduce

The error is present on 1xH100.

Dockerfile used (I build it yesterday and I'm not sure yet how nemo:dev images are versioned, so I can't provide its detailed version):

FROM nvcr.io/nvidia/nemo:dev
ARG NVFUSER_REPO=git+https://github.com/NVIDIA/Fuser.git
ARG THUNDER_REPO=git+https://github.com/Lightning-AI/lightning-thunder.git

# Add cloned NeMo latest code
RUN git clone --recursive https://github.com/NVIDIA/NeMo.git /NeMo_cloned
RUN (cd /NeMo_cloned && python -m pip install .)


# Install requirements needed for NeMo, Thunder and NVFUser.
# We must install them in such compilated way because otherwise Thunder is not 
# updated and we are not able to use the latest version. 
RUN python -m pip install -r /NeMo_cloned/requirements/requirements_lightning.txt && \
    python -m pip install --upgrade ${NVFUSER_REPO}  && \
    python -m pip install --upgrade ${THUNDER_REPO} && \
    python -m pip install --upgrade --no-deps --force-reinstall ${NVFUSER_REPO} && \
    python -m pip install --upgrade --no-deps --force-reinstall ${THUNDER_REPO}
 
# Install Mixology requirements (this can be skipped, so I'm commenting it out)
# COPY requirements/mixology.txt mixology_requirements.txt
# RUN pip install --upgrade -r mixology_requirements.txt

Inside docker container please run:

model=mistralai/Mistral-Nemo-Base-2407
# Download the model (you might need to set HF_TOKEN and agree on the website to terms of use of this model)
huggingface-cli download $model --local-dir checkpoints/$model --cache-dir checkpoints/$model 
# Run benchmark
python bench_targets/llm_peft/_nemo.py --model checkpoints/$model --mbs 1 --seq-length 2048 --jit-backend thunder

Script bench_targets/llm_peft/_nemo.py can be obtained from internal Gitlab from akoumparouli/nemo_bench. You can contant me or @tfogal if you have any questions.

You can check that the command below works:

python bench_targets/llm_peft/_nemo.py --model checkpoints/$model --mbs 1 --seq-length 2048 --jit-backend eager

Expected behavior

We should be able to run at least the same sequence length as NeMo Eager.

Environment

nvfuser @ git+https://github.com/NVIDIA/Fuser.git@bb058595c49dc32416d563f5a4c1c5f22a01ca54
lightning-thunder @ git+https://github.com/Lightning-AI/lightning-thunder.git@81f83f3549d0cde9fc012fb9ebfb9eb4a3254e61

cc @tfogal

The text was updated successfully, but these errors were encountered:

mpatel31415 · 2024-11-26T17:15:09Z

As Tom suggested. the issue might be caused by this PR: #1400

IvanYashchuk · 2024-11-26T18:54:47Z

@tfogal, why do you think that pull request could cause problems?

tfogal · 2024-11-26T19:46:18Z

@tfogal, why do you think that pull request could cause problems?

I had git bisected and found that 052bac3 (the commit right before it) works well.

I am confused as to what in there is an issue, though; experimenting now, but my initial theory about pruning too aggressively doesn't hold water, so I'm not so sure at the moment.
edit: Phi-3 doesn't change at all before/after #1400; I somehow managed to see a difference in memory usage with mistral-nemo that went away when I tested again. I will need to dig in more next week.

IvanYashchuk · 2024-11-27T09:15:20Z

I had git bisected and found that 052bac3 (the commit right before it) works well.

@tfogal, what command did you run for bisection? I get an OOM error on H100 with the commit 052bac3 and 2k sequence length as in the issue description

python bench_targets/llm_peft/_nemo.py --model=mistralai/Mistral-Nemo-Base-2407 --mbs 1 --seq-length 2048 --jit-backend thunder

The same OOM error is with the linked commit, and before that, but a different error appears with the commit right after (a617503), which breaks memory consumption because a deepcopy of an fx.GraphModule also creates copies for all parameters.

With the 1k sequence here are the memory consumptions:

c9bbc5e0: 66880730624 bytes
052bac: 66880730624 bytes
a61750: OOM

@kshitij12345, looks like the problem was introduced in #1400.

kshitij12345 · 2024-11-27T18:12:53Z

Interestingly, copy.deepcopy only leads to copying of parameters when torch._dynamo.config.inline_inbuilt_nn_modules=False, and since PR it defaults to True (i.e. from release PyTorch 2.5) . Following snippet demonstrates the difference in generated graphs and memory usage with copy.deepcopy on GraphModule.

import torch
import copy

# copy.deepcopy leads to more memory usage (as modules with parameters are saved in GraphModule).
# Eg.
# class GraphModule(torch.nn.Module):
#     def forward(self, L_args_0_: "f32[1024, 1024]"):
#         l_args_0_ = L_args_0_
        
#          # File: /home/kkalambarkar/git/pytorch/torch/_dynamo/external_utils.py:31 in inner, code: return fn(*args, **kwargs)
#         fn_0: "f32[1024, 1024]" = self.fn_0(l_args_0_);  l_args_0_ = None
#         fn_1: "f32[1024, 1024]" = self.fn_1(fn_0);  fn_0 = None
#         return (fn_1,)
torch._dynamo.config.inline_inbuilt_nn_modules=False

# class GraphModule(torch.nn.Module):
#     def forward(self, L_fn_modules_0_parameters_weight_: "f32[1024, 1024]", L_fn_modules_0_parameters_bias_: "f32[1024]", L_args_0_: "f32[1024, 1024]", L_fn_modules_1_parameters_weight_: "f32[1024, 1024]", L_fn_modules_1_parameters_bias_: "f32[1024]"):
#         l_fn_modules_0_parameters_weight_ = L_fn_modules_0_parameters_weight_
#         l_fn_modules_0_parameters_bias_ = L_fn_modules_0_parameters_bias_
#         l_args_0_ = L_args_0_
#         l_fn_modules_1_parameters_weight_ = L_fn_modules_1_parameters_weight_
#         l_fn_modules_1_parameters_bias_ = L_fn_modules_1_parameters_bias_
        
#          # File: /home/kkalambarkar/git/pytorch/torch/_dynamo/external_utils.py:31 in inner, code: return fn(*args, **kwargs)
#         input_1: "f32[1024, 1024]" = torch._C._nn.linear(l_args_0_, l_fn_modules_0_parameters_weight_, l_fn_modules_0_parameters_bias_);  l_args_0_ = l_fn_modules_0_parameters_weight_ = l_fn_modules_0_parameters_bias_ = None
#         input_2: "f32[1024, 1024]" = torch._C._nn.linear(input_1, l_fn_modules_1_parameters_weight_, l_fn_modules_1_parameters_bias_);  input_1 = l_fn_modules_1_parameters_weight_ = l_fn_modules_1_parameters_bias_ = None
#         return (input_2,)
torch._dynamo.config.inline_inbuilt_nn_modules=True

gm_copy = None

def backend(gm, sample_args):
    global gm_copy
    gm_copy = copy.deepcopy(gm)

    gm.print_readable()

    return gm

with torch.device("cuda"):
    models = torch.nn.Sequential(torch.nn.Linear(1024, 1024), torch.nn.Linear(1024, 1024))

opt_model = torch.compile(models, backend=backend)

x = torch.randn(1024, 1024, device="cuda")
opt_model(x)

print(torch.cuda.memory_allocated())  # no_inline = 29507584, inline = 21110784
del opt_model, models
print(torch.cuda.memory_allocated())  # no_inline = 29507584, inline = 12713984
del gm_copy
print(torch.cuda.memory_allocated())  # no_inline = 29507584, inline = 12713984

IvanYashchuk · 2024-11-27T21:05:02Z

Cool, so as long as torch._dynamo.config.inline_inbuilt_nn_modules=True is used there's nothing to fix on the Thunder side.

IvanYashchuk added nemo Issues needed to support NVIDIA NeMo models. mixology Issues that the mixology team has surfaced labels Nov 26, 2024

IvanYashchuk assigned riccardofelluga Nov 26, 2024

IvanYashchuk assigned tfogal and unassigned riccardofelluga Nov 27, 2024

IvanYashchuk assigned kshitij12345 and unassigned tfogal Nov 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OOM for Mistral-Nemo-Base-2407 with NeMo + ThunderFX for input sequence lengths working for NeMo Eager #1475

OOM for Mistral-Nemo-Base-2407 with NeMo + ThunderFX for input sequence lengths working for NeMo Eager #1475

mpatel31415 commented Nov 26, 2024 •

edited by github-actions bot

Loading

mpatel31415 commented Nov 26, 2024

IvanYashchuk commented Nov 26, 2024

tfogal commented Nov 26, 2024 •

edited

Loading

IvanYashchuk commented Nov 27, 2024 •

edited

Loading

kshitij12345 commented Nov 27, 2024 •

edited

Loading

IvanYashchuk commented Nov 27, 2024

OOM for Mistral-Nemo-Base-2407 with NeMo + ThunderFX for input sequence lengths working for NeMo Eager #1475

OOM for Mistral-Nemo-Base-2407 with NeMo + ThunderFX for input sequence lengths working for NeMo Eager #1475

Comments

mpatel31415 commented Nov 26, 2024 • edited by github-actions bot Loading

🐛 Bug

To Reproduce

Expected behavior

Environment

mpatel31415 commented Nov 26, 2024

IvanYashchuk commented Nov 26, 2024

tfogal commented Nov 26, 2024 • edited Loading

IvanYashchuk commented Nov 27, 2024 • edited Loading

kshitij12345 commented Nov 27, 2024 • edited Loading

IvanYashchuk commented Nov 27, 2024

mpatel31415 commented Nov 26, 2024 •

edited by github-actions bot

Loading

tfogal commented Nov 26, 2024 •

edited

Loading

IvanYashchuk commented Nov 27, 2024 •

edited

Loading

kshitij12345 commented Nov 27, 2024 •

edited

Loading