[Performance] Sequential onloading #1263

kylesayrs · 2025-03-18T05:01:02Z

Purpose

Reduce hardware requirements when calibrating large models by only onloading one layer at a time when calibrating using the sequential pipeline
Updating the examples can be done after pipeline extraction lands. Examples which only use the basic pipeline should dispatch to "auto", while examples which use GPTQ should dispatch to the cpu and set oneshot_device.

Usage

When using the sequential pipeline, a few behaviors change

If your model is dispatched to the gpu (has parameters which execute on a gpu), then a warning is raised

logger.warning(
    "Calibrating a model dispatched to the gpu can potentially lead to OOM "
    "errors. Consider loading the model without a `device_map` and instead "
    "executing with `cuda:0` (set `oneshot_device` to override this default)"
)

Otherwise (if you model is dispatched to the cpu), then the oneshot_device argument is used to determine the onload device (this defaults to cuda if a cuda device is available)

elif oneshot_device is None:
    has_cuda = torch.cuda.is_available()
    oneshot_device = torch.device("cuda:0") if has_cuda else torch.device("cpu")
    logger.info(f"No oneshot_device passed, using {oneshot_device}")

This policy encourages users to dispatch to the CPU when using the sequential pipeline, and to dispatch to "auto" when using the basic pipeline

Changes

Keep layer parameters onloaded during the entire sequential calibration + compression step

Testing

Calibrated and GPTQ-compressed one layer of Deepseek-V3 with a single H100 in 50 seconds
- 4.5x Improvement over original 236 seconds
- Peak memory of ~40 GB, which can be further reduced by increasing the granularity of sequential targets
Not offloading activations did not result in a performance improvement

github-actions · 2025-03-18T05:01:11Z

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

dsikka · 2025-03-18T14:04:48Z

src/llmcompressor/utils/helpers.py

+
+
+@contextlib.contextmanager
+def align_modules(modules: Iterable[torch.nn.Module]):


Why not keep this in compressed tensors with the other cpu offloading tools?

Yep! Implementing here before the next CT release

neuralmagic/compressed-tensors#282

dsikka · 2025-03-18T14:08:34Z

src/llmcompressor/pipelines/layer_sequential/pipeline.py

-            # and is only used for capturing outputs from the newly compressed modules
-            with HooksMixin.disable_hooks():
-                for batch_index in tqdm.tqdm(range(len(dataloader)), desc=prop_desc):
+            with align_modules([layer]):


Seems like all we're doing is wrapping the forward passes in this context manager, if I'm reading this correctly?

Yes. Rather than onloading then discarding for each of the 512 forward passes, we onload once for the layer and keep it onloaded through compression and propagation.

src/llmcompressor/pipelines/sequential/helpers.py

dsikka · 2025-03-18T14:13:26Z

src/llmcompressor/pipelines/sequential/helpers.py

@@ -310,11 +313,13 @@ def partition_graph(model: Module, partitions: List[List[Node]]) -> List[Subgrap
        # save the subgraph for this partition
        graph.lint()
        input_names = set(node.name for node in graph.nodes if node.op == "placeholder")
+        modules = get_subgraph_modules(graph, parent_graph)


Do you mind explaining what we're changing in our graph partition here?

The graph partition doesn't change, this change just collects all the modules used by this subgraph for use in onloading/offloading by the sequential pipeline.

brian-dellabetta

sorry, i approved this thinking it was the one-liner removing clear-ml, will have to take a closer look

brian-dellabetta

I am understanding this for the most part -- very cool!

src/llmcompressor/utils/helpers.py

SUMMARY: - Remove requirement for tokens and the one test which uses them Signed-off-by: Kyle Sayers <[email protected]>

Signed-off-by: Kyle Sayers <[email protected]>

Co-authored-by: Brian Dellabetta <[email protected]> Signed-off-by: Kyle Sayers <[email protected]>

Signed-off-by: Kyle Sayers <[email protected]>

…ding

Signed-off-by: Kyle Sayers <[email protected]>

kylesayrs added the ready When a PR is ready for review label Mar 18, 2025

kylesayrs self-assigned this Mar 18, 2025

brian-dellabetta previously approved these changes Mar 18, 2025

View reviewed changes

brian-dellabetta self-requested a review March 18, 2025 14:17

dsikka requested changes Mar 18, 2025

View reviewed changes

brian-dellabetta reviewed Mar 18, 2025

View reviewed changes

kylesayrs requested review from dsikka and brian-dellabetta March 18, 2025 17:12

brian-dellabetta previously approved these changes Mar 18, 2025

View reviewed changes

kylesayrs dismissed brian-dellabetta’s stale review via 6a3733a March 18, 2025 20:03

kylesayrs mentioned this pull request Mar 18, 2025

[Utils] add align_modules neuralmagic/compressed-tensors#282

Open

brian-dellabetta mentioned this pull request Mar 18, 2025

[Question] Has anyone successfully quantinize Deepseek-V3 to int4-w4a16? #1203

Open

brian-dellabetta previously approved these changes Mar 18, 2025

View reviewed changes

src/llmcompressor/utils/helpers.py Show resolved Hide resolved

kylesayrs dismissed brian-dellabetta’s stale review via 63d1934 March 19, 2025 04:07

brian-dellabetta previously approved these changes Mar 19, 2025

View reviewed changes

kylesayrs mentioned this pull request Mar 21, 2025

Has anyone successfully quantinize Deepseek-R1 to w8a8? #1274

Open

kylesayrs dismissed brian-dellabetta’s stale review via cf09876 March 27, 2025 17:52

dsikka and others added 10 commits March 27, 2025 13:52

Remove clear_ml (#1261)

d72de48

SUMMARY: - Remove requirement for tokens and the one test which uses them Signed-off-by: Kyle Sayers <[email protected]>

sequential onloading

1b6b3cd

Signed-off-by: Kyle Sayers <[email protected]>

apply changes to layer_sequential pipeline

6185d6f

Signed-off-by: Kyle Sayers <[email protected]>

remove gha

7baba06

Signed-off-by: Kyle Sayers <[email protected]>

get submodules first, update docstring

76c5ca1

Signed-off-by: Kyle Sayers <[email protected]>

fix typo

0fbf6d0

Signed-off-by: Kyle Sayers <[email protected]>

Remove click (#1262)

e47a495

Signed-off-by: Kyle Sayers <[email protected]>

fix bug in align_modules

54d5f2c

Signed-off-by: Kyle Sayers <[email protected]>

Update src/llmcompressor/utils/helpers.py

6b3867d

Co-authored-by: Brian Dellabetta <[email protected]> Signed-off-by: Kyle Sayers <[email protected]>

appropriate oneshot_device for determinig onloading

72e7683

Signed-off-by: Kyle Sayers <[email protected]>

kylesayrs force-pushed the kylesayrs/sequential-onloading branch from cf09876 to 72e7683 Compare March 27, 2025 17:53

kylesayrs added 4 commits March 27, 2025 13:54

Merge remote-tracking branch 'origin' into kylesayrs/sequential-onloa…

1899d82

…ding

Merge remote-tracking branch 'origin' into kylesayrs/sequential-onloa…

4b5cf61

…ding

fix align_modules

98ee869

Signed-off-by: Kyle Sayers <[email protected]>

fix is_gpu_dispatched

61da757

Signed-off-by: Kyle Sayers <[email protected]>

kylesayrs mentioned this pull request Apr 8, 2025

Extend usability of calculate_offload_device_map #768

Closed

dispatching

382c3e6

Signed-off-by: Kyle Sayers <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance] Sequential onloading #1263

[Performance] Sequential onloading #1263

kylesayrs commented Mar 18, 2025 •

edited

Loading

github-actions bot commented Mar 18, 2025

dsikka Mar 18, 2025

kylesayrs Mar 18, 2025

kylesayrs Mar 18, 2025

dsikka Mar 18, 2025

kylesayrs Mar 18, 2025

dsikka Mar 18, 2025

kylesayrs Mar 18, 2025

brian-dellabetta left a comment

brian-dellabetta left a comment



		@contextlib.contextmanager
		def align_modules(modules: Iterable[torch.nn.Module]):

[Performance] Sequential onloading #1263

Are you sure you want to change the base?

[Performance] Sequential onloading #1263

Conversation

kylesayrs commented Mar 18, 2025 • edited Loading

Purpose

Usage

Changes

Testing

github-actions bot commented Mar 18, 2025

dsikka Mar 18, 2025

Choose a reason for hiding this comment

kylesayrs Mar 18, 2025

Choose a reason for hiding this comment

kylesayrs Mar 18, 2025

Choose a reason for hiding this comment

dsikka Mar 18, 2025

Choose a reason for hiding this comment

kylesayrs Mar 18, 2025

Choose a reason for hiding this comment

dsikka Mar 18, 2025

Choose a reason for hiding this comment

kylesayrs Mar 18, 2025

Choose a reason for hiding this comment

brian-dellabetta left a comment

Choose a reason for hiding this comment

brian-dellabetta left a comment

Choose a reason for hiding this comment

kylesayrs commented Mar 18, 2025 •

edited

Loading