Feat (llm/awq): activation-aware weight scaling #1213

pablomlago · 2025-03-07T12:05:18Z

Reason for this PR

Implementation of AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration.

Using weight-only quantization and the configuration:

weight_bit_width: 3
weight_group_size: 128
weight_quant_granularity: per_group
weight_quant_type: asym
scaling_min_val: 0.00001
quantize_weight_zero_point: true

	OPT-125M	Llama3 1B
Float16	23.77	8.77
RTN	45.72	34.38
AWQ repo	31.53	15.16
AWQ scale	33.97	19.39
AWQ clip	34.22	20.80
AWQ scale+clip	31.53	15.77

Changes Made in this PR

Created a dataclass RegionAWQ , inheriting from Region to aggregate the information of the modules s on which AWQ optimizes the scale.
Adapted auto_scale and auto_clip to rely on Brevitas quantizers.

Testing Summary

Testing apply_awq against the author's repository.

Risk Highlight

This PR includes code from another work (please detail).
This PR contains API-breaking changes.
This PR depends on work in another PR (please provide links/details).
This PR introduces new dependencies (please detail).
There are coverage gaps not covered by tests.
Documentation updates required in subsequent PR.

Checklist

Code comments added to any hard-to-understand areas, if applicable.
Changes generate no new warnings.
Updated any relevant tests, if applicable.
No conflicts with destination dev branch.
I reviewed my own code changes.
Initial CI/CD passing.
1+ reviews given, and any review issues addressed and approved.
Post-review full CI/CD passing.

Giuseppe5 · 2025-03-10T14:19:42Z

src/brevitas/graph/calibrate.py

@@ -251,6 +250,48 @@ def apply(self, model, is_training, quantization_enabled):
            self.enable_param_quantization(model, is_training)


+class disable_enable_quantization:


Potentially interesting for all use cases where we need to do this. I'd expose flags to disable weight/act/bias quantization

Giuseppe5 · 2025-03-10T14:21:20Z

src/brevitas/graph/equalize.py

@@ -780,9 +781,11 @@ def _no_equalize():
    for module in chain(src_axes.values(), sink_axes.values()):
        rewriters.extend(module.instantiate_rewriters(rewriter_class, scaling_factors))

-    # Apply rewriters before offloading
+    # Apply rewriters before offloading, if parametrize_inplace is True.  Note that parametrizations
+    # are not immediately to prevent potential errors if the model is offloaded.


Can you elaborate a bit more the issue here?

Giuseppe5 · 2025-03-10T14:26:48Z

src/brevitas_examples/llm/llm_quant/awq/pre_quant.py

            raise ValueError  # early exit to break later inference

    # patch layer 0 to catch input and kwargs
-    layers[0] = Catcher(layers[0])
+    blocks[0] = Catcher(blocks[0])


I don't think we need this part of the codebase, why can't we do what we do in GPTQ to catch the input to the first block?
We can also move that piece of code to some utils in exmples/common/generative

Giuseppe5 · 2025-03-10T14:29:21Z

src/brevitas/utils/python_utils.py

@@ -64,3 +65,30 @@ def run(*args, **kwargs):
        return function(*args, **kwargs)

    return run
+
+
+def longest_common_prefix(strings: List[str]):


This seems overly specific to AWQ, not sure if this should live here

Giuseppe5 · 2025-03-10T14:34:28Z

src/brevitas_examples/llm/llm_quant/awq/utils/region.py

+                    "ffn.act": block.ffn.act,
+                    "ffn.down_proj": block.ffn.down_proj,},
+            ))
+    elif "falcon" in str(block.__class__).lower():


Only Llama for now

pablomlago added 9 commits March 7, 2025 11:41

AWQ initial commit

83339c3

Remove clipping parametrization

b4dd298

Remove group size in EqualizeAWQ

c8fec82

Ensure appropiate model placement

47f218e

Minor changes

91dd5ed

Bug fixes

aaeedee

Minor improvements

3abf555

Minor fix

73e2eed

Minor fix

c816f02

pablomlago marked this pull request as ready for review March 10, 2025 12:09

pablomlago requested a review from Giuseppe5 March 10, 2025 12:09

pablomlago changed the title ~~[DRAFT] Feat (llm/awq): activation-aware weight scaling~~ Feat (llm/awq): activation-aware weight scaling Mar 10, 2025

Giuseppe5 reviewed Mar 10, 2025

View reviewed changes

pablomlago added 2 commits March 11, 2025 16:17

Remove unused logic

502db2f

Fix bugs

abeacd5

pablomlago requested a review from Giuseppe5 March 12, 2025 09:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat (llm/awq): activation-aware weight scaling #1213

Feat (llm/awq): activation-aware weight scaling #1213

pablomlago commented Mar 7, 2025 •

edited

Loading

Giuseppe5 Mar 10, 2025

Giuseppe5 Mar 10, 2025

Giuseppe5 Mar 10, 2025

Giuseppe5 Mar 10, 2025

Giuseppe5 Mar 10, 2025

		@@ -251,6 +250,48 @@ def apply(self, model, is_training, quantization_enabled):
		self.enable_param_quantization(model, is_training)


		class disable_enable_quantization:

Feat (llm/awq): activation-aware weight scaling #1213

Are you sure you want to change the base?

Feat (llm/awq): activation-aware weight scaling #1213

Conversation

pablomlago commented Mar 7, 2025 • edited Loading

Reason for this PR

Changes Made in this PR

Testing Summary

Risk Highlight

Checklist

Giuseppe5 Mar 10, 2025

Choose a reason for hiding this comment

Giuseppe5 Mar 10, 2025

Choose a reason for hiding this comment

Giuseppe5 Mar 10, 2025

Choose a reason for hiding this comment

Giuseppe5 Mar 10, 2025

Choose a reason for hiding this comment

Giuseppe5 Mar 10, 2025

Choose a reason for hiding this comment

pablomlago commented Mar 7, 2025 •

edited

Loading