Add initial Sample-Layer Attention for GPTQ (PyTorch) #1237

irenaby · 2024-10-06T06:53:41Z

Pull Request Description:

Add hessian estimation per image hash.
Add sample-layer attention distillation loss.
Add weights per layer to soft round loss.
Update GPTQ config and its generation for sample layer attention.

Checklist before requesting a review:

I set the appropriate labels on the pull request.
I have added/updated the release note draft (if necessary).
I have updated the documentation to reflect my changes (if necessary).
All function and files are well documented.
All function and classes have type hints.
There is a licenses in all file.
The function and variable names are informative.
I have checked for code duplications.
I have added new unittest (if necessary).

irenaby · 2024-10-06T07:15:02Z

model_compression_toolkit/core/pytorch/hessian/activation_hessian_scores_calculator_pytorch.py

@@ -55,98 +56,145 @@ def __init__(self,
                                                                       hessian_scores_request=hessian_scores_request,
                                                                       num_iterations_for_approximation=num_iterations_for_approximation)

+    def forward_pass(self):


no change, just extracted to method

irenaby · 2024-10-06T07:18:28Z

model_compression_toolkit/core/pytorch/hessian/activation_hessian_scores_calculator_pytorch.py

+
+    def _compute_per_tensor(self, output, target_activation_tensors):
+        assert self.hessian_request.granularity == HessianScoresGranularity.PER_TENSOR
+        ipts_hessian_approx_scores = [torch.tensor([0.0], requires_grad=True, device=output.device)


no change (except line 177), just extracted to method

ofirgo · 2024-10-06T12:59:06Z

model_compression_toolkit/core/common/hessian/hessian_scores_request.py

@@ -40,6 +40,14 @@ class HessianScoresGranularity(Enum):
    PER_TENSOR = 2


+class HessianEstimationDistribution(str, Enum):


why do we need this? I think you can use the rademacher in all cases and remove this enum and the addition to the gptq config.
The results (on average) should be similar for both methods if Im not mistaken (unless you seen a different behavior?)

I think I saw different results, not sure. In any case it changes the existing behavior so I don't think this belongs to this PR. This isn't exposed to user anyway, so there is no problem to remove it later.

reuvenperetz · 2024-10-06T10:08:48Z

model_compression_toolkit/core/common/hessian/hessian_info_service.py

+        return hessian_score_by_image_hash
+
+    @staticmethod
+    def calc_image_hash(image):


Please add documentation

reuvenperetz · 2024-10-06T10:15:23Z

model_compression_toolkit/core/common/hessian/hessian_info_service.py

+        if not isinstance(inputs_batch, list):
+            raise TypeError('Expected a list of inputs')    # pragma: no cover
+        if len(inputs_batch) > 1:
+            raise NotImplementedError('Per-sample hessian computation is not supported for networks with multiple inputs')    # pragma: no cover


Maybe assert a non-empty list as well?

reuvenperetz · 2024-10-06T10:16:36Z

model_compression_toolkit/core/common/hessian/hessian_info_service.py

+                                                                           hessian_scores_request=hessian_scores_request,
+                                                                           num_iterations_for_approximation=self.num_iterations_for_approximation)
+        hessian_scores = fw_hessian_calculator.compute()
+        for b in range(inputs_batch[0].shape[0]):


Please rename 'b' (it made me think this is a batch, but this is a single image)

reuvenperetz · 2024-10-06T10:25:16Z

model_compression_toolkit/core/pytorch/hessian/activation_hessian_scores_calculator_pytorch.py

+        output = self.concat_tensors(output_tensors)
+        return output, target_activation_tensors
+
+    def _generate_random_vectors_batch(self, shape, distribution: HessianEstimationDistribution, device) -> torch.Tensor:


Please add missing type hints

reuvenperetz · 2024-10-06T10:32:15Z

model_compression_toolkit/core/pytorch/hessian/activation_hessian_scores_calculator_pytorch.py

+
+                # Update node Hessian approximation mean over random iterations
+                ipts_hessian_approx_scores[i] = (j * ipts_hessian_approx_scores[i] + hessian_approx_scores) / (j + 1)
+


There's an overlap between _compute_per_tensor and _compute_per_channel. Is there a good reason why not to extract the common logic into a shared function to avoid this duplication?

It's not exactly the same. I'm sure it can be rewritten, but it's not as straightforward as in other places that were extracted, so was not a priority.

reuvenperetz · 2024-10-06T13:34:34Z

model_compression_toolkit/gptq/common/gptq_training.py

+                self.hessian_service.compute_trackable_per_sample_hessian(request, inputs)
+            )
+        for img_hash, v in hessian_score_per_image_per_layer.items():
+            hessian_score_per_image_per_layer[img_hash] = {k: t.max(axis=0) for k, t in v.items()}


Why are we computing t.max(axis=0)?

That's the definition of the sample-layer attention. My understanding is that we are trying to approximate the upper bound, so we take the max among channels (this is per image so axis=0 is channels)

reuvenperetz · 2024-10-06T13:39:58Z

model_compression_toolkit/gptq/pytorch/gptq_training.py

+        self.hessian_score_per_layer = None    # for fixed layer weights
+        self.hessian_score_per_image_per_layer = None    # for sample-layer attention
+        if self.use_sample_layer_attention:
+            assert (hessian_cfg.norm_scores is False and hessian_cfg.log_norm is False and


Can you please add a comment on why this is true?

Changed to NotImplementedError and added comment.

reuvenperetz · 2024-10-06T14:04:55Z

model_compression_toolkit/gptq/pytorch/gptq_training.py

+        img_hashes = [self.hessian_service.calc_image_hash(img) for img in batch]
+        for img_hash in img_hashes:
+            if img_hash not in self.hessian_score_per_image_per_layer:
+                score_per_image_layer_per = self._compute_sample_layer_attention_scores(input_tensors)


I guess you meant "score_per_image_per_layer"?
In general, I think this function needs to be rewritten cause it's hard to track what's going on here.

Everything should be rewritten. Added more comments, hope it's clearer now.

reuvenperetz · 2024-10-06T14:09:39Z

model_compression_toolkit/gptq/pytorch/quantization_facade.py

            raise TypeError(f'gradual_activation_quantization argument should be bool or '
-                            f'GradualActivationQuantizationConfig, received {type(gradual_activation_quantization)}')    # pragma: no cover
+                            f'GradualActivationQuantizationConfig, received {type(gradual_activation_quantization)}')


Please, add a TODO to replace the 'no cover' since this case should be tested.

I think that's true for all no cover throughout the code

reuvenperetz · 2024-10-06T14:16:24Z

model_compression_toolkit/gptq/pytorch/quantizer/soft_rounding/soft_quantizer_reg.py

@@ -40,32 +40,36 @@ def __init__(self, beta_scheduler: Callable[[int], float]):

        self.count_iter = 0

-    def __call__(self, model: nn.Module, entropy_reg: float):
+    def __call__(self, model: nn.Module, entropy_reg: float, layer_weights: torch.Tensor = None):


If I'm not mistaken, the default case (where the weighting is all ones) is taking care outside of this function. If this is the case, maybe it is may be better to remove the default value of layer_weights?

You are not mistaken. I will remove it for now due to missing testing/coverage, but in principle I don't see a connection between the two. I think in this case it makes sense to have a default behavior, the fact that it was eventually more convenient to pass explicit weights in the specific use case should not really affect this.

irenaby added 2 commits October 6, 2024 09:29

initial sample layer attention implementation for torch

3ebae95

fix + calc batch hessian on the fly

19cad97

github-actions bot added auto:core auto:gptq auto:tests labels Oct 6, 2024

small fixes

4b84a90

irenaby force-pushed the sample_attn2 branch from 9ee9255 to 4b84a90 Compare October 6, 2024 07:11

irenaby commented Oct 6, 2024

View reviewed changes

irenaby requested review from ofirgo and reuvenperetz October 6, 2024 07:38

improve coverage

ba6433c

irenaby force-pushed the sample_attn2 branch from 3a5ceb0 to ba6433c Compare October 6, 2024 09:44

ofirgo reviewed Oct 6, 2024

View reviewed changes

reuvenperetz reviewed Oct 6, 2024

View reviewed changes

foxes per code review

a329151

reuvenperetz approved these changes Oct 7, 2024

View reviewed changes

irenaby merged commit b26dd82 into main Oct 7, 2024
35 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add initial Sample-Layer Attention for GPTQ (PyTorch) #1237

Add initial Sample-Layer Attention for GPTQ (PyTorch) #1237

irenaby commented Oct 6, 2024

irenaby Oct 6, 2024

irenaby Oct 6, 2024

ofirgo Oct 6, 2024

irenaby Oct 6, 2024

reuvenperetz Oct 6, 2024

reuvenperetz Oct 6, 2024

reuvenperetz Oct 6, 2024

reuvenperetz Oct 6, 2024

reuvenperetz Oct 6, 2024

irenaby Oct 6, 2024

reuvenperetz Oct 6, 2024

irenaby Oct 6, 2024

reuvenperetz Oct 6, 2024

irenaby Oct 6, 2024

reuvenperetz Oct 6, 2024

irenaby Oct 6, 2024

reuvenperetz Oct 6, 2024

irenaby Oct 6, 2024

reuvenperetz Oct 6, 2024

irenaby Oct 6, 2024

		@@ -40,6 +40,14 @@ class HessianScoresGranularity(Enum):
		PER_TENSOR = 2


		class HessianEstimationDistribution(str, Enum):


		# Update node Hessian approximation mean over random iterations
		ipts_hessian_approx_scores[i] = (j * ipts_hessian_approx_scores[i] + hessian_approx_scores) / (j + 1)

Add initial Sample-Layer Attention for GPTQ (PyTorch) #1237

Add initial Sample-Layer Attention for GPTQ (PyTorch) #1237

Conversation

irenaby commented Oct 6, 2024

Pull Request Description:

Checklist before requesting a review:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment