Skip to content

Commit

Permalink
Update index.html - fix typos and links
Browse files Browse the repository at this point in the history
  • Loading branch information
mobicham authored Nov 13, 2023
1 parent 1d44034 commit de47317
Showing 1 changed file with 23 additions and 25 deletions.
48 changes: 23 additions & 25 deletions index.html
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@ <h1 class="page-title">Half Quadratic Quantization of Large Machine Learning Mod
<hr />
<p>Large Language Models (LLMs) have revolutionized various subfields of machine learning like natural language processing, speech recognition and computer vision, enabling machines to understand and generate outputs with unprecedented accuracy and fluency. However, one of the most critical challenges in deploying LLMs is their expensive memory requirements, for both training and inference. Quantization methods such as <a href="https://github.com/TimDettmers/bitsandbytes">bitsandbytes</a>, <a href="https://arxiv.org/abs/2210.17323">GPTQ</a> and <a href="https://github.com/mit-han-lab/llm-awq">AWQ</a> have made it possible to use large models such as the popular LLama2 with significantly less memory, enabling <a href="https://huggingface.co/models?sort=trending&search=TheBloke">the machine learning community to conduct remarkable research using a single consumer-grade GPU</a>.
</p>
<p>In this article, we propose a new quantization technique called <b>H</b>alf <b>Q</b>uadratic <b>Q</b>uantization (<b>HQQ</b>).Our approach, requiring no calibration data, significantly speeds up the quantization of large models, while offering compression quality competitive with that of calibration-based methods.. For instance, HQQ takes less than 8 minutes to process the colossal LLama2-70B, that’s <em>27x</em> faster compared to the widely adopted GPTQ, while <em>significantly outperforming</em> it for extreme low-bit quantization.
<p>In this article, we propose a new quantization technique called <b>H</b>alf-<b>Q</b>uadratic <b>Q</b>uantization (<b>HQQ</b>). Our approach, requiring no calibration data, significantly speeds up the quantization of large models, while offering compression quality competitive with that of calibration-based methods. For instance, <b>HQQ</b> takes less than 8 minutes to process the colossal LLama2-70B, that’s <em>27x</em> faster compared to the widely adopted GPTQ, while <em>significantly outperforming</em> it for extreme low-bit quantization.
</p>
<!-- <p id="c8835517-e8ec-4781-8d42-047d63df4d94" class=""><strong>Paper</strong>: <a
href="https://arxiv.org/abs/2310.06694">https://arxiv.org/abs/2310.06694</a>
Expand Down Expand Up @@ -106,9 +106,9 @@ <h1 class="page-title">Half Quadratic Quantization of Large Machine Learning Mod
<h2 id="intro" class="">Introduction</h2>
<p>Model quantization is a crucial step to deploy large models with limited resources and save costs, which is particularly relevant to LLMs for both training and inference. Software packages such as bitsandbytes have made it possible to utilize large models on consumer-grade GPUs, which has been a game-changer for the machine learning community.</p>

<p></p>When it comes to weight-only quantization, there are two classes of approaches: data-free calibration techniques such as bitsandbytes rely on using the weights only without external data, and calibration-based methods such as GPTQ and AWQ that rely on an external dataset to adjust the quantization parameters. While calibration-based methods offer better quantization quality, they suffer from two main issues:</p>
<p></p>When it comes to weight-only quantization, there are two classes of approaches: data-free calibration techniques such as <i>bitsandbytes</i> rely on using the weights only without external data, and calibration-based methods such as GPTQ and AWQ that rely on an external dataset to adjust the quantization parameters. While calibration-based methods offer better quantization quality, they suffer from two main issues:</p>
<ol>
<li><em>Calibration data bias</em>:The quality of quantization can be negatively affected if incorrect calibration data is provided.</li>
<li><em>Calibration data bias</em>: the quality of quantization can be negatively affected if incorrect calibration data is provided.</li>
<li><em>Quantization time</em>: calibration can be a heavy computational process especially for very large models, which makes it difficult to test and deploy multiple models. </li>
</ol>
<p>Wouldn't it be great if we can achieve the quality of calibration-based methods for the speed of calibration-free quantization methods? That’s exactly what we propose via our method Half-Quadratic Quantization (HQQ). </p>
Expand All @@ -122,11 +122,11 @@ <h2 id="hqq" class="">Half-Quadratic Quantization</h2>

where \( Q_{z,s}() \) is the quantization operator which depends on the \( z \) and \( s \) parameters:
$$\begin{array}{c}
Q_{z,s}(W)=\text{round}(W/s+z)\\
Q_{z,s}(W)=\text{round}(W/s+z)=W_{q}\\
Q_{z,s}^{-1}(W_{q})=s(W_{q}-z)
\end{array}$$
\end{array}.$$

To find a solution to the problem, we adopt a Half-Quadratic solver by introducing an extra variable \( W_{e} \). Moreover, to make the problem simpler, we fix the scaling $s$ parameter and only optimize for the zero-point \( z \):
To find a solution to the problem, we adopt a <a href="https://ieeexplore.ieee.org/document/120331">Half-Quadratic solver</a> by introducing an extra variable \( W_{e} \). Moreover, to make the problem simpler, we fix the scaling \( s \) parameter and only optimize for the zero-point \( z \):

$$\underset{z,W_{e}}{\text{argmin}}\,\phi(W_{e})+\frac{\beta}{2}||W_{e}-(W-Q_{z}^{-1}(Q_{z}(W))||_{2}^{2}$$

Expand All @@ -138,40 +138,39 @@ <h2 id="hqq" class="">Half-Quadratic Quantization</h2>
& \beta^{(t+1)}\leftarrow\kappa\beta^{(t)}
\end{array}$$

<h4>Sub-problem (sp1)</h4>
<h4>Sub-problem (\( (sp}_{1} \))</h4>

This problem takes the form of a proximal operator. When $\phi()$ is the $l_1$ norm, the solution is the soft-thresholding operator. There exists a more general thresholding solution for the \( l_{p}\)-norm with \( 0 \le p \leq 1 \) that we adopt known is as the generalized soft-thresholding operator:
This problem takes the form of a <a href="https://web.stanford.edu/~boyd/papers/pdf/prox_algs.pdf">Proximal Operator</a>. When \( \phi() \) is the \( l_{1} \) norm, the solution is the <a href="https://sparse-plex.readthedocs.io/en/latest/book/opt/soft_thresholding.html">soft-thresholding operator</a>. There exists a more general thresholding solution for the \( l_{p}\)-norm with \( 0 \le p \leq 1 \) that we adopt known is as the <a href="https://inria.hal.science/hal-01317151/file/lowrank_ieee_tip.pdf">generalized soft-thresholding operator</a>:

$$\begin{array}{c}
W_{e}^{(t+1)}\leftarrow\text{shrink}_{l_{p}}\left((W-Q_{z}^{-1}(Q_{z}(W)),\beta\right)\\
W_{e}^{(t+1)}\leftarrow\text{shrink}_{l_{p}}\left(W-Q_{z}^{-1}(Q_{z}(W)),\beta\right)\\
\text{shrink}_{l_{p}}(x,\beta)=\text{sign}(x)\text{relu}(|x|-\frac{|x|^{p-1}}{\beta})
\end{array}$$


<h4></h4>Sub-problem (sp2)</h4>
<h4>Sub-problem (\( (sp}_{2} \))</h4>
The second sub-problem can be rewritten as follows:
$$\begin{array}{c}
z^{(t+1)}\leftarrow\underset{z}{\text{argmin}}\,\frac{1}{2}||z-\left(W_{q}^{(t+1)}-\frac{(W-W_{e}^{(t+1)})}{s}\right)||_{2}^{2}\\
W_{q}^{(t+1)}=\text{round}(W/s+z^{(t)})
\end{array}$$

The solution is simply the average over the axis the quantization grouping was performed on:
The solution is simply the average over the axis the quantization grouping is performed on:
$$z^{(t+1)}\leftarrow\langle W_{q}^{(t+1)}-\frac{(W-W_{e}^{(t+1)})}{s}\rangle$$

The last thing to set is the \( \kappa \) parameter which is a positive integer strictly higher than 1. In our implementation, we work with the inverse of the scale \( 1/s \) instead of \( s \) which we found to be a bit more stable with the half-precision calculations.
The last thing to set is the \( \kappa \) parameter which is a positive integer strictly higher than 1. In our implementation, we work with the inverse of the scale \( 1/s \) instead of \( s \) which we found to be a bit more stable with the half-precision calculations.<br>

Note that, contrary to using gradient descent with autograd, the solution that we propose relies on closed-form solutions, which means that there are no gradients calculated. This allows us to run all the calculations in inference mode with half-precision. Moreover, it only takes a few iterations for the solver to converge. Conversely, using the AdamW optimizer and Pytorch’s autograd takes thousands of iterations to achieve good results. It also fails with \( p \le 1 \), which is what we actually use to promote sparsity. Thanks to the Half-Quadratic solution, our quantization method achieves significant speed-up (over 100x faster vs. autograd to quantize LLama2-7B), being able to process even the largest models in only a few minutes!
Note that, contrary to using gradient descent with <a href="https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html">autograd</a>, the solution that we propose relies on closed-form solutions, which means that there are no gradients calculated. This allows us to run all the calculations in inference mode with half-precision. Moreover, it only takes a few iterations for the solver to converge. Conversely, using the AdamW optimizer and Pytorch’s autograd takes thousands of iterations to achieve good results. It also fails with \( p \le 1 \), which is what we actually use to promote sparsity. Thanks to the Half-Quadratic solution, our quantization method achieves significant speed-up (over <b>100x</b> faster vs. autograd to quantize LLama2-7B), being able to process even the largest models in only a few minutes!

<h2 id="processing_time" class="">Processing Time</h2>
<p>>We report the processing time to quantize the LLama2-family models. We noticed that the processing time for GPTQ and AWQ drastically changes from one machine to another. GPTQ heavily relies on the CPU which creates issues on virtual machines, so we limit the number of threads to those available in the virtual machine (32) to avoid the process hanging for hours. </p>

<p>Our method performs the whole quantization on the GPU with half-precision and only uses the CPU to transfer data to the GPU once the solver is finished. </p>


<p>We report the processing time to quantize the <a href="https://ai.meta.com/llama/">Llama2</a> models. We noticed that the processing time for GPTQ and AWQ drastically changes from one machine to another. GPTQ heavily relies on the CPU which creates issues on virtual machines, so we limit the number of threads to those available in the virtual machine (32) to avoid the process hanging for hours. Our method performs the whole quantization on the GPU with half-precision and only uses the CPU to transfer data to the GPU once the solver is finished. </p>
<center><img src="figs/llama2-7b_time.png" /></center>
center><img src="figs/llama2-13b_time.png" /></center>
center><img src="figs/llama2-70b_time.png" /></center>

<h2 id="benchmark" class="">Benchmark</h2>

<p>To measure the quantization quality of our method, we use the perplexity metric on the widely adopted wikitext2 dataset. We also report the runtime GPU memory the session takes to run the quantized model (additional memory is required for prediction depending on the sequence length). We compare against the popular approaches widely used by the community: <a href="https://github.com/TimDettmers/bitsandbytes/">BNB (bitsandbytes)</a> , <a href="https://github.com/PanQiWei/AutoGPTQ">GPTQ via AutoGPTQ</a> and <a href="https://github.com/casper-hansen/AutoAWQ">AWQ via AutoAWQ</a>. </p>
<p>To measure the quantization quality of our method, we use the perplexity metric on the widely adopted <a href="https://huggingface.co/datasets/wikitext/viewer/wikitext-2-raw-v1">wikitext2</a> dataset. We also report the runtime GPU memory the session takes to run the quantized model (additional memory is required for prediction depending on the sequence length). We compare against the popular approaches widely used by the community: <a href="https://github.com/TimDettmers/bitsandbytes/">BNB (bitsandbytes)</a> , <a href="https://github.com/PanQiWei/AutoGPTQ">GPTQ via AutoGPTQ</a> and <a href="https://github.com/casper-hansen/AutoAWQ">AWQ via AutoAWQ</a>. </p>

<p>Regarding the parameters, we fix the Half-Quadratic solver with the following: <em>p=0.7, kappa=1.01, iterations=20</em>. Additionally, we use early-stopping to exit the solver when the error doesn’t improve. We haven’t experimented much with the parameters, so different settings might actually yield better results. Similar to the other approaches, we use grouping to quantize the weights into buffers, we also quantize the zero-point into 8-bit without grouping or optimization. </p>

Expand Down Expand Up @@ -206,8 +205,8 @@ <h2 id="benchmark" class="">Benchmark</h2>
<td>7.9</td>
<td>4.67</td>
<td>14.4</td>
<td>oom</td>
<td>oom</td>
<td>OOM</td>
<td>OOM</td>
</tr>

<tr style="background-color: #F0F0F0;">
Expand Down Expand Up @@ -356,7 +355,7 @@ <h2 id="benchmark" class="">Benchmark</h2>
<tr style="background-color: rgb(180, 180, 180);">
<td>GPTQ_g64</td>
<td>2</td>
<td>nan</td>
<td>NaN</td>
<td><b>3.5</b></td>
<td>13</td>
<td>6</td>
Expand Down Expand Up @@ -404,12 +403,11 @@ <h2 id="benchmark" class="">Benchmark</h2>

<p>The graph below summarizes the various data points into a scatter graph.</p>

<img src="figs/scatter_plot.svg" />

<center><img src="figs/scatter_plot.svg" /></center>

<h2 id="conclusion">Conclusion</h2>

<p>This article demonstrates that the calibration-free quantization through our proposed HQQ method can achieve a quality competitive with popular data-dependent methods like GPTQ and AWQWe have demonstrated the effectiveness of HQQ even for extreme low-bit quantization across different model sizes. Moreover, by leveraging efficient optimization techniques such as Half-Quadratic splitting, our method cuts the quantization time to only a few minutes even for the biggest models available such as Llama2-70B. </p>
<p>This article demonstrates that calibration-free quantization through our proposed HQQ method can achieve a quality competitive with popular data-dependent methods like GPTQ and AWQ. We have demonstrated the effectiveness of HQQ even for extreme low-bit quantization across different model sizes. Moreover, by leveraging efficient optimization techniques such as Half-Quadratic splitting, our method cuts the quantization time to only a few minutes even for the biggest models available such as Llama2-70B. </p>

<p>We provide the code to reproduce all the results presented in this article: <a href="https://github.com/mobiusml/hqq/tree/main/code">https://github.com/mobiusml/hqq/tree/main/code</a> </p>

Expand Down

0 comments on commit de47317

Please sign in to comment.