Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat (llm/awq): activation-aware weight scaling #1213

Open
wants to merge 11 commits into
base: dev
Choose a base branch
from

Conversation

pablomlago
Copy link
Collaborator

@pablomlago pablomlago commented Mar 7, 2025

Reason for this PR

Implementation of AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration.

Using weight-only quantization and the configuration:

weight_bit_width: 3
weight_group_size: 128
weight_quant_granularity: per_group
weight_quant_type: asym
scaling_min_val: 0.00001
quantize_weight_zero_point: true
OPT-125M Llama3 1B
Float16 23.77 8.77
RTN 45.72 34.38
AWQ repo 31.53 15.16
AWQ scale 33.97 19.39
AWQ clip 34.22 20.80
AWQ scale+clip 31.53 15.77

Changes Made in this PR

  • Created a dataclass RegionAWQ , inheriting from Region to aggregate the information of the modules s on which AWQ optimizes the scale.
  • Adapted auto_scale and auto_clip to rely on Brevitas quantizers.

Testing Summary

Testing apply_awq against the author's repository.

Risk Highlight

  • This PR includes code from another work (please detail).
  • This PR contains API-breaking changes.
  • This PR depends on work in another PR (please provide links/details).
  • This PR introduces new dependencies (please detail).
  • There are coverage gaps not covered by tests.
  • Documentation updates required in subsequent PR.

Checklist

  • Code comments added to any hard-to-understand areas, if applicable.
  • Changes generate no new warnings.
  • Updated any relevant tests, if applicable.
  • No conflicts with destination dev branch.
  • I reviewed my own code changes.
  • Initial CI/CD passing.
  • 1+ reviews given, and any review issues addressed and approved.
  • Post-review full CI/CD passing.

@pablomlago pablomlago marked this pull request as ready for review March 10, 2025 12:09
@pablomlago pablomlago requested a review from Giuseppe5 March 10, 2025 12:09
@pablomlago pablomlago changed the title [DRAFT] Feat (llm/awq): activation-aware weight scaling Feat (llm/awq): activation-aware weight scaling Mar 10, 2025
@@ -251,6 +250,48 @@ def apply(self, model, is_training, quantization_enabled):
self.enable_param_quantization(model, is_training)


class disable_enable_quantization:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially interesting for all use cases where we need to do this. I'd expose flags to disable weight/act/bias quantization

@@ -780,9 +781,11 @@ def _no_equalize():
for module in chain(src_axes.values(), sink_axes.values()):
rewriters.extend(module.instantiate_rewriters(rewriter_class, scaling_factors))

# Apply rewriters before offloading
# Apply rewriters before offloading, if parametrize_inplace is True. Note that parametrizations
# are not immediately to prevent potential errors if the model is offloaded.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate a bit more the issue here?

raise ValueError # early exit to break later inference

# patch layer 0 to catch input and kwargs
layers[0] = Catcher(layers[0])
blocks[0] = Catcher(blocks[0])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this part of the codebase, why can't we do what we do in GPTQ to catch the input to the first block?
We can also move that piece of code to some utils in exmples/common/generative

@@ -64,3 +65,30 @@ def run(*args, **kwargs):
return function(*args, **kwargs)

return run


def longest_common_prefix(strings: List[str]):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems overly specific to AWQ, not sure if this should live here

"ffn.act": block.ffn.act,
"ffn.down_proj": block.ffn.down_proj,},
))
elif "falcon" in str(block.__class__).lower():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only Llama for now

@pablomlago pablomlago requested a review from Giuseppe5 March 12, 2025 09:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants