support graph-by-graph benchmarking for PyTorch native checkpointing #1437

kiya00 · 2024-11-14T08:51:06Z

Before submitting

Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
Did you read the contributor guideline, Pull Request section?
Did you make sure to update the docs?
Did you write any new necessary tests?

What does this PR do?

The converter replaces the Torch operators in the checkpoint function with Thunder operators in-place, and also the compiled thunder/inductor module is replaced in-place, but the ThunderCompilerGraphBenchmarking/saving reproduction script needs the original GraphModule to compile/save.

Previously, the deepcopy of GraphModule is blocked by a pytorch/pytorch#139275. Thanks to @kshitij12345 for helping to fix it, we can use it to support the graph-by-graph benchmarking for PyTorch native checkpointing starting from Torch 2.6

Note:

The code is also useful for saving the reproduction script ThunderFX: Save the reproducer script into files #1380
The above mentioned deepcopy bug affects the test case test_thundercompiler_optim_step, so it's skipped
Torch 2.6 changes the GraphModule structure, the checkpoint function becomes a submodule of the module containing tag_activation_checkpoint

# Torch 2.6
GraphModule(
  (submod_1): GraphModule(
    (wrap_body_0): GraphModule()
  )
)
# before 2.6
GraphModule(
  (wrap_body_0): GraphModule()
  (submod_1): GraphModule()
)

Before 2.6, in order to get the input tensor of submod_1 we need to calculate the wrap_body_0 ourselves (wrap_body_0 is a placeholder node in submod_1 module, and there's no example_value in node.meta); In 2.6, the wrap_body_0 is a get_attr node in submod_1 module, not an input. Since the latest Torch is much cleaner, we don't currently support benchmark checkpointing in Torch<2.6.

Fixes #1381.

…heckpoint function module in-place (#1381)

kiya00 · 2024-11-19T08:06:22Z

Hi @IvanYashchuk @kshitij12345 , could you review it?

kshitij12345

LGTM, thanks @kiya00

thunder/dynamo/utils.py

t-vi

Stamping Thank you @kiya00 @kshitij12345

kiya00 added 2 commits November 13, 2024 20:37

Deepcopy the split_graph_module to avoid the converter modifing the c…

ec84226

…heckpoint function module in-place (#1381)

skip the benchmarking if using activation checkpoint and torch<2.6

77c7186

kiya00 requested review from kshitij12345 and IvanYashchuk November 14, 2024 08:51

kiya00 requested review from mruberry, lantiga and t-vi as code owners November 14, 2024 08:51

kiya00 mentioned this pull request Nov 15, 2024

ThunderFX: Save the reproducer script into files #1380

Merged

kshitij12345 approved these changes Nov 19, 2024

View reviewed changes

kshitij12345 reviewed Nov 19, 2024

View reviewed changes

thunder/dynamo/utils.py Show resolved Hide resolved

t-vi approved these changes Nov 19, 2024

View reviewed changes

t-vi enabled auto-merge (squash) November 19, 2024 11:10

kiya00 and others added 2 commits November 19, 2024 13:09

update docstring for submodule_to_compiled_functions

0266c96

Merge branch 'main' into bench_checkpoint

b42fbf2

t-vi merged commit 60f3ee1 into main Nov 19, 2024
41 checks passed

t-vi deleted the bench_checkpoint branch November 19, 2024 12:44

kiya00 mentioned this pull request Nov 26, 2024

AssertionError for Phi-3.5-mini-instruct and Qwen2.5-7B-Instruct with NeMo + ThunderFX #1476

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support graph-by-graph benchmarking for PyTorch native checkpointing #1437

support graph-by-graph benchmarking for PyTorch native checkpointing #1437

kiya00 commented Nov 14, 2024

kiya00 commented Nov 19, 2024

kshitij12345 left a comment

t-vi left a comment

support graph-by-graph benchmarking for PyTorch native checkpointing #1437

support graph-by-graph benchmarking for PyTorch native checkpointing #1437

Conversation

kiya00 commented Nov 14, 2024

What does this PR do?

kiya00 commented Nov 19, 2024

kshitij12345 left a comment

Choose a reason for hiding this comment

t-vi left a comment

Choose a reason for hiding this comment