Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BugFix] Multi-gpu temp bug fix #1286

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft

[BugFix] Multi-gpu temp bug fix #1286

wants to merge 1 commit into from

Conversation

horheynm
Copy link
Collaborator

@horheynm horheynm commented Mar 26, 2025

Signed-off-by: George Ohashi [email protected]

SUMMARY:
Bug caused by transformers >4.50.0, where in multi-gpu cases, a bug in data-parallel forward pass.

...
src/llmcompressor/transformers/finetune/session_mixin.py:289: in compute_loss
    loss = super().compute_loss(
venv/lib/python3.10/site-packages/transformers/trainer.py:3783: in compute_loss
    outputs = model(**inputs)
venv/lib/python3.10/site-packages/torch/nn/modules/module.py:1736: in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
venv/lib/python3.10/site-packages/torch/nn/modules/module.py:1747: in _call_impl
    return forward_call(*args, **kwargs)
venv/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py:183: in forward
    inputs, module_kwargs = self.scatter(inputs, kwargs, self.device_ids)
venv/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py:207: in scatter
    return scatter_kwargs(inputs, kwargs, device_ids, dim=self.dim)
...

Fix by monkey patching the test to use one GPU.
The other way is to use a lower transformers version.

Failures are seen here:
https://github.com/neuralmagic/llm-compressor-testing/actions/runs/14014229250/job/39237843197

TEST PLAN:
Pass nightly tests

Signed-off-by: George Ohashi <[email protected]>
Copy link

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

@horheynm horheynm marked this pull request as ready for review March 26, 2025 16:08
@horheynm horheynm added the ready When a PR is ready for review label Mar 26, 2025
@kylesayrs
Copy link
Collaborator

Does this bug affect anyone else? Are there any other related issues on transformers?

@dsikka dsikka marked this pull request as draft April 1, 2025 15:18
@dsikka
Copy link
Collaborator

dsikka commented Apr 1, 2025

Converting to draft as pinning to 4.49 temporarily

@dsikka dsikka self-assigned this Apr 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready When a PR is ready for review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants