Skip to content
This repository has been archived by the owner on May 1, 2023. It is now read-only.

Commit

Permalink
8-bit Quantization - Save model + add test + updated docs (#3)
Browse files Browse the repository at this point in the history
  • Loading branch information
guyjacob authored and nzmora committed May 14, 2018
1 parent 8192ab3 commit cadea06
Show file tree
Hide file tree
Showing 8 changed files with 367 additions and 342 deletions.
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -196,6 +196,8 @@ This example performs 8-bit quantization of ResNet20 for CIFAR10. We've include
$ python3 compress_classifier.py -a resnet20_cifar ../../../data.cifar10 --resume ../examples/ssl/checkpoints/checkpoint_trained_dense.pth.tar --quantize --evaluate
```

The command-line above will save a checkpoint named `quantized_checkpoint.pth.tar` containing the quantized model parameters.

### Explore the sample Jupyter notebooks
The set of notebooks that come with Distiller is described [here](https://nervanasystems.github.io/distiller/jupyter/index.html#using-the-distiller-notebooks), which also explains the steps to install the Jupyter notebook server.<br>
After installing and running the server, take a look at the [notebook](https://github.com/NervanaSystems/distiller/blob/master/jupyter/sensitivity_analysis.ipynb) covering pruning sensitivity analysis.
Expand Down
31 changes: 18 additions & 13 deletions docs-src/docs/algo_quantization.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,17 +2,22 @@

## Symmetric Linear Quantization

**Linear** - float value quantized by multiplying with scale factor. **Symmetric** - no quantization bias (or "offset") used, so zero in the float domain is mapped to zero in the integer domain.
In the current implementation the scale factor is chosen so that the entire range of the tensor is quantized. So, we get: (Using \(q\) to denote the scale factor and \(x\) to denote the tensor being quantized)
\[q_x = \frac{2^n-1}{\max|x|}\]
\[x_q = round(q_x\cdot x_f)\]
Where \(n\) is the number of bits used for quantization.
For weights and bias the scale factor is determined once at quantization setup ("offline"), and for activations it is determined dynamically at runtime ("online").
Currently, quantized implementations are provided for **convolution** and **fully-connected** layers. These layers are quantized as follows (using \(x, y, w, b\) for input, output, weights, bias respectively):
\[y_f = \sum{x_f\cdot w_f} + b_f = \sum{x_f\cdot w_f} + b_f = \sum{\frac{x_q}{q_x}\cdot \frac{w_q}{q_w}} + \frac{b_q}{q_b} = \frac{1}{q_x q_w} \sum{(x_q w_q + \frac{q_b}{q_x q_w}b_q)}\]
\[y_q = round(q_y\cdot y_f) = round(\frac{q_y}{q_x q_w} \sum{(x_q w_q + \frac{q_b}{q_x q_w}b_q)})\]
Note how the bias has to be re-scaled to match the scale of the summation.
All other layers are executed in FP32. This is done by adding quantize and de-quantize operations at the beginning and end of the quantized layers.
In this method, a float value is quantized by multiplying with a numeric constant (the **scale factor**), hence it is **Linear**. We use a signed integer to represent the quantized range, with no quantization bias (or "offset") used. As a result, the floating-point range considered for quantization is **symmetric** with respect to zero.
In the current implementation the scale factor is chosen so that the entire range of the floating-point tensor is quantized (we do not attempt to remove outliers).
Let us denote the original floating-point tensor by \(x_f\), the quantized tensor by \(x_q\), the scale factor by \(q_x\) and the number of bits used for quantization by \(n\). Then, we get:
\[q_x = \frac{2^{n-1}-1}{\max|x|}\]
\[x_q = round(q_x x_f)\]
(The \(round\) operation is round-to-nearest-integer)
Let's see how a **convolution** or **fully-connected (FC)** layer is quantized using this method: (we denote input, output, weights and bias with \(x, y, w\) and \(b\) respectively)
\[y_f = \sum{x_f w_f} + b_f = \sum{\frac{x_q}{q_x} \frac{w_q}{q_w}} + \frac{b_q}{q_b} = \frac{1}{q_x q_w} \sum{(x_q w_q + \frac{q_b}{q_x q_w}b_q)}\]
\[y_q = round(q_y y_f) = round(\frac{q_y}{q_x q_w} \sum{(x_q w_q + \frac{q_b}{q_x q_w}b_q)})\]
Note how the bias has to be re-scaled to match the scale of the summation.

!!! note
This method is implemented as **inference only**, with no back-propagation functionality. Hence, it can only be used to quantize a pre-trained FP32 model, and as such, using it with \(n < 8\) is likely to lead to severe accuracy degradation for any non-trivial workload.
### Implementation
We've implemented **convolution** and **FC** using this method.

- They are implemented by wrapping the existing PyTorch layers with quantization and de-quantization operations. That is - the computation is done on floating-point tensors, but the values themselves are restricted to integer values.
- All other layers are unaffected and are executed using their original FP32 implementation.
- For weights and bias the scale factor is determined once at quantization setup ("offline"), and for activations it is determined dynamically at runtime ("online").
- **Important note:** Currently, this method is implemented as **inference only**, with no back-propagation functionality. Hence, it can only be used to quantize a pre-trained FP32 model, with no re-training. As such, using it with \(n < 8\) is likely to lead to severe accuracy degradation for any non-trivial workload.
33 changes: 18 additions & 15 deletions docs/algo_quantization/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -167,21 +167,24 @@

<h1 id="quantization-algorithms">Quantization Algorithms</h1>
<h2 id="symmetric-linear-quantization">Symmetric Linear Quantization</h2>
<p><strong>Linear</strong> - float value quantized by multiplying with scale factor. <strong>Symmetric</strong> - no quantization bias (or "offset") used, so zero in the float domain is mapped to zero in the integer domain.<br />
In the current implementation the scale factor is chosen so that the entire range of the tensor is quantized. So, we get: (Using <script type="math/tex">q</script> to denote the scale factor and <script type="math/tex">x</script> to denote the tensor being quantized)
<script type="math/tex; mode=display">q_x = \frac{2^n-1}{\max|x|}</script>
<script type="math/tex; mode=display">x_q = round(q_x\cdot x_f)</script>
Where <script type="math/tex">n</script> is the number of bits used for quantization.<br />
For weights and bias the scale factor is determined once at quantization setup ("offline"), and for activations it is determined dynamically at runtime ("online").<br />
Currently, quantized implementations are provided for <strong>convolution</strong> and <strong>fully-connected</strong> layers. These layers are quantized as follows (using <script type="math/tex">x, y, w, b</script> for input, output, weights, bias respectively):
<script type="math/tex; mode=display">y_f = \sum{x_f\cdot w_f} + b_f = \sum{x_f\cdot w_f} + b_f = \sum{\frac{x_q}{q_x}\cdot \frac{w_q}{q_w}} + \frac{b_q}{q_b} = \frac{1}{q_x q_w} \sum{(x_q w_q + \frac{q_b}{q_x q_w}b_q)}</script>
<script type="math/tex; mode=display">y_q = round(q_y\cdot y_f) = round(\frac{q_y}{q_x q_w} \sum{(x_q w_q + \frac{q_b}{q_x q_w}b_q)})</script>
Note how the bias has to be re-scaled to match the scale of the summation.<br />
All other layers are executed in FP32. This is done by adding quantize and de-quantize operations at the beginning and end of the quantized layers. </p>
<div class="admonition note">
<p class="admonition-title">Note</p>
</div>
<p>This method is implemented as <strong>inference only</strong>, with no back-propagation functionality. Hence, it can only be used to quantize a pre-trained FP32 model, and as such, using it with <script type="math/tex">n < 8</script> is likely to lead to severe accuracy degradation for any non-trivial workload.</p>
<p>In this method, a float value is quantized by multiplying with a numeric constant (the <strong>scale factor</strong>), hence it is <strong>Linear</strong>. We use a signed integer to represent the quantized range, with no quantization bias (or "offset") used. As a result, the floating-point range considered for quantization is <strong>symmetric</strong> with respect to zero.<br />
In the current implementation the scale factor is chosen so that the entire range of the floating-point tensor is quantized (we do not attempt to remove outliers).<br />
Let us denote the original floating-point tensor by <script type="math/tex">x_f</script>, the quantized tensor by <script type="math/tex">x_q</script>, the scale factor by <script type="math/tex">q_x</script> and the number of bits used for quantization by <script type="math/tex">n</script>. Then, we get:
<script type="math/tex; mode=display">q_x = \frac{2^{n-1}-1}{\max|x|}</script>
<script type="math/tex; mode=display">x_q = round(q_x x_f)</script>
(The <script type="math/tex">round</script> operation is round-to-nearest-integer) </p>
<p>Let's see how a <strong>convolution</strong> or <strong>fully-connected (FC)</strong> layer is quantized using this method: (we denote input, output, weights and bias with <script type="math/tex">x, y, w</script> and <script type="math/tex">b</script> respectively)
<script type="math/tex; mode=display">y_f = \sum{x_f w_f} + b_f = \sum{\frac{x_q}{q_x} \frac{w_q}{q_w}} + \frac{b_q}{q_b} = \frac{1}{q_x q_w} \sum{(x_q w_q + \frac{q_b}{q_x q_w}b_q)}</script>
<script type="math/tex; mode=display">y_q = round(q_y y_f) = round(\frac{q_y}{q_x q_w} \sum{(x_q w_q + \frac{q_b}{q_x q_w}b_q)})</script>
Note how the bias has to be re-scaled to match the scale of the summation.</p>
<h3 id="implementation">Implementation</h3>
<p>We've implemented <strong>convolution</strong> and <strong>FC</strong> using this method. </p>
<ul>
<li>They are implemented by wrapping the existing PyTorch layers with quantization and de-quantization operations. That is - the computation is done on floating-point tensors, but the values themselves are restricted to integer values. </li>
<li>All other layers are unaffected and are executed using their original FP32 implementation. </li>
<li>For weights and bias the scale factor is determined once at quantization setup ("offline"), and for activations it is determined dynamically at runtime ("online"). </li>
<li><strong>Important note:</strong> Currently, this method is implemented as <strong>inference only</strong>, with no back-propagation functionality. Hence, it can only be used to quantize a pre-trained FP32 model, with no re-training. As such, using it with <script type="math/tex">n < 8</script> is likely to lead to severe accuracy degradation for any non-trivial workload.</li>
</ul>

</div>
</div>
Expand Down
2 changes: 1 addition & 1 deletion docs/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -246,5 +246,5 @@ <h3 id="more-energy-efficient">More energy efficient</h3>

<!--
MkDocs version : 0.17.2
Build Date UTC : 2018-05-09 14:06:49
Build Date UTC : 2018-05-14 13:58:17
-->
Loading

0 comments on commit cadea06

Please sign in to comment.