TinyML Model Optimization chapter #37

18jeffreyma · 2023-10-20T14:11:43Z

Missing items:

Checklist + Matt's feedback
Final pass of all figures, links and references.

Added Image file names + captions

Co-authored-by: Jeffrey Ma [email protected] Co-authored-by: Jayson Lin [email protected] Co-authored-by: Aghyad Deeb [email protected] Co-authored-by: Matthew Stewart [email protected] Co-authored-by: Costin-Andrei Oncescu [email protected] Co-authored-by: Vijay Janapa Reddi [email protected]

AditiR-42 · 2023-10-23T17:07:15Z

optimizations.qmd

+
+##### Hardware Efficiency
+
+Structured pruning often results in models that are more amenable to deployment on specialized hardware, such as Field-Programmable Gate Arrays (FPGAs) or Application-Specific Integrated Circuits (ASICs), due to the regularity and simplicity of the pruned architecture. With reduced computational requirements, it translates to lower energy consumption, which is crucial for battery-powered devices and sustainable computing practices.


It would be helpful here to give a brief explanation of what FPGAs and ASICs are, or at least provide a reference to more information about them. Most people without hardware backgrounds would not understand the context of how FPGAs and ASICs relate to hardware efficiency.

Agreed! I found the following links from Arm helpful in understanding what FPGAs and ASICs are:

https://www.arm.com/glossary/fpga

https://www.arm.com/glossary/asic

We could also link these to the AI hardware chapter where these will be explained in more detail

sophiacho1 · 2023-10-28T18:58:00Z

optimizations.qmd

- Layer Fusion
- Node Elimination
- Graph Rewriting
+Returning to the context of Machine Learning (ML), quantization refers to the process of constraining the possible values that numerical parameters (such as weights and biases) can take to a discrete set, thereby reducing the precision of the parameters and consequently, the model's memory footprint. When properly implemented, quantization can reduce model size by up to 4x and improve inference latency and throughput by up to 2-3x. For example, an Image Classification model like ResNet-50 can be compressed from 96MB down to 24MB with 8-bit quantization.There is typically less than 1% loss in model accuracy from well tuned quantization. Accuracy can often be recovered by re-training the quantized model with quantization aware training techniques. Therefore, this technique has emerged to be very important in deploying ML models to resource-constrained environments, such as mobile devices, IoT devices, and edge computing platforms, where computational resources (memory and processing power) are limited.


Missing a space between "quantization" and "There".

sophiacho1 · 2023-10-28T19:24:44Z

optimizations.qmd

+![](images/efficientnumerics_100x.png)
+![](images/efficientnumerics_horowitz.png)
+Source: [https://ieeexplore.ieee.org/document/6757323](https://ieeexplore.ieee.org/document/6757323)
+![](images/efficientnumerics_int8vsfloat.png)


I think you also have this figure in the Quantization section. Is this intentional?

It is not, Great Catch!

sophiacho1 · 2023-10-29T00:49:19Z

optimizations.qmd

+
+Non-uniform quantization, on the other hand, does not maintain a consistent interval between quantized values. This approach might be used to allocate more possible discrete values in regions where the parameter values are more densely populated, thereby preserving more detail where it is most needed. For instance, in bell-shaped distributions of weights with long tails, a set of weights in a model predominantly lies within a certain range; thus, more quantization levels might be allocated to that range to preserve finer details, enabling us to better capture information. However, one major weakness of non-uniform quantization is that it requires dequantization before higher precision computations due to its non-uniformity, restricting its ability to accelerate computation compared to uniform quantization.
+
+Typically, a rule-based non-uniform quantization uses a logarithmic distribution of exponentially increasing steps and levels as opposed to linearly. Another popular branch lies in binary-code-based quantization where real number vectors are quantized into binary vectors with a scaling factor. Notably, there is no closed form solution for minimizing errors between the real value and non-uniformly quantized value, so most quantizations in this field rely on heuristic solutions. For instance, recent work formulates non-uniform quantization as an optimization problem where the quantization steps/levels in quantizer Q are adjusted to minimize the difference between the original tensor and quantized counterpart.


I think it'd be nice to have a citation/link of the "recent work" here! Also, I think a short explanation of the equation below would be helpful (like you did with the other equations above).

sophiacho1 · 2023-10-29T07:00:16Z

optimizations.qmd

+
+Symmetric quantization maps real values to a symmetrical clipping range centered around 0. This involves choosing a range [\alpha, \beta] where \alpha = -\beta. For example, one symmetrical range would be based on the min/max values of the real values such that: -\alpha = \beta = max(abs(r\_max), abs(r\_min)).
+
+Symmetric clipping ranges are the most widely adopted in practice as they have the advantage of easier implementation. In particular, the zeroing out of the zero point can lead to reduction in computational cost during inference ["Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation" (2023)]([https://arxiv.org/abs/2004.09602](https://arxiv.org/abs/2004.09602)) .


I wonder if it'd be worth explaining a bit what "zeroing out of the zero point" is.

sophiacho1 · 2023-10-29T07:01:59Z

optimizations.qmd

+
+_Illustration of symmetric quantization (left) and asymmetric quantization (right). Symmetric quantization maps real values to [-127, 127], and asymmetric maps to [-128, 127]. (Credit: __**A Survey of Quantization Methods for Efficient Neural Network Inference**__ )._
+
+### Granularity


I wonder if the Granularity section and the Static and Dynamic Quantization section should go under the Calibration section since they're addressing clipping ranges.

sophiacho1 · 2023-10-29T07:04:39Z

optimizations.qmd

+
+_Illustration of the main forms of quantization granularities. In layerwise quantization, the same clipping range is applied to all filters which belong to the same layer. Notice how this can result in lower quantization resolutions for channels with narrow distributions, e.g. Filter 1, Filter 2, and Filter C. A higher quantization resolution can be achieved using channelwise quantization which dedicates different clipping ranges to different channels. (Credit: __**A Survey of Quantization Methods for Efficient Neural Network Inference**__ )._
+
+1. Layerwise Quantization: This approach determines the clipping range by considering all of the weights in the convolutional filters of a layer. Then, the same clipping range is used for all convolutional filters. It's the simplest to implement, and, as such, it often results in sub-optimal accuracy due the wide variety of differing ranges between filters. For example, a convolutional kernel with a narrower range of parameters loses its quantization resolution due to another kernel in the same layer having a wider range. .


There's an extra period at the end of this paragraph.

sophiacho1 · 2023-10-29T07:07:27Z

optimizations.qmd

+
+The two prevailing techniques for quantizing models are Post Training Quantization and Quantization Aware Training.
+
+**Post Training Quantization** - Post-training quantization (PTQ) is a quantization technique where the model is quantized after it has been trained.The model is trained in floating point and then weights and activations are quantized as a post-processing step. This is the simplest approach and does not require access to the training data. Unlike Quantization-Aware Training (QAT), PTQ sets weight and activation quantization parameters directly, making it low-overhead and suitable for limited or unlabeled data situations. However, not readjusting the weights after quantizing, especially in low-precision quantization can lead to very different behavior and thus lower accuracy. To tackle this, techniques like bias correction, equalizing weight ranges, and adaptive rounding methods have been developed. PTQ can also be applied in zero-shot scenarios, where no training or testing data are available. This method has been made even more efficient to benefit compute- and memory- intensive large language models. Recently, SmoothQuant, a training-free, accuracy-preserving, and general-purpose PTQ solution which enables 8-bit weight, 8-bit activation quantization for LLMs, has been developed, demonstrating up to 1.56x speedup and 2x memory reduction for LLMs with negligible loss in accuracy [SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models](2023)(https://arxiv.org/abs/2211.10438).


Missing a space before "The model is trained in floating point...".

sophiacho1 · 2023-10-29T07:37:03Z

optimizations.qmd

+
+**Post Training Quantization** - Post-training quantization (PTQ) is a quantization technique where the model is quantized after it has been trained.The model is trained in floating point and then weights and activations are quantized as a post-processing step. This is the simplest approach and does not require access to the training data. Unlike Quantization-Aware Training (QAT), PTQ sets weight and activation quantization parameters directly, making it low-overhead and suitable for limited or unlabeled data situations. However, not readjusting the weights after quantizing, especially in low-precision quantization can lead to very different behavior and thus lower accuracy. To tackle this, techniques like bias correction, equalizing weight ranges, and adaptive rounding methods have been developed. PTQ can also be applied in zero-shot scenarios, where no training or testing data are available. This method has been made even more efficient to benefit compute- and memory- intensive large language models. Recently, SmoothQuant, a training-free, accuracy-preserving, and general-purpose PTQ solution which enables 8-bit weight, 8-bit activation quantization for LLMs, has been developed, demonstrating up to 1.56x speedup and 2x memory reduction for LLMs with negligible loss in accuracy [SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models](2023)(https://arxiv.org/abs/2211.10438).
+
+![](images/efficientnumerics_lecturenote.png) 


In the "Integer quantization" row, I think there should be a parenthesis instead of a 0 after "post-training".

sophiacho1 · 2023-10-29T07:43:54Z

optimizations.qmd

+
+Src: integer quantization for deep learning Inference: principles and empirical evaluations
+
+| **Feature/Technique** | **Post Training Quantization** | **Quantization Aware Training** | **Dynamic Quantization** |


I think this table is missing entries for the Pros, Cons, and Tradeoffs rows?

sophiacho1 · 2023-10-29T08:18:29Z

optimizations.qmd

+
+Quantization not only reduces model size but also enables faster computations and draws less power, making it vital to edge development. Edge devices typically have tight resource constraints with compute, memory, and power, which are impossible to meet for many of the deep NN models of today. Furthermore, edge processors do not support floating point operations, making integer quantization particularly important for chips like GAP-8, a RISC-=V SoC for edge inference with a dedicated CNN accelerator, which only support integer arithmetic..
+
+One hardware platform utilizing quantization is the ARM Cortex-M group of 32=bit RISC ARM processor cores. They leverage fixed-point quantization with power of two scaling factors so that quantization and dequantization can be efficiently done by bit shifting. Additionally, Google Edge TPUs, Google's emerging solution for running inference at the edge, is designed for small, low-powered devices and can only support 8-bit arithmetic. Recently, there has been significant strides in the computing power of edge processors, enabling the deployment and inference of costly NN models previously limited to servers.


Is it supposed to be "RISC-V SoC" and "32-bit RISC ARM" instead of "RISC-=V SoC" and "32=bit RISC ARM"?

sophiacho1 · 2023-10-29T08:25:24Z

optimizations.qmd

+
+## Efficient Hardware Implementation {#sec-model_ops_hw}
+
+Efficient hardware implementation transcends the selection of suitable components; it requires a holistic understanding of how software will interact with underlying architectures. The essence of achieving peak performance in TinyML applications lies not only in refining algorithms to hardware but also in ensuring that the hardware is strategically tailored to support these algorithms. This synergy between hardware and software is crucial. As we delve deeper into the intricacies of efficient hardware implementation, the significance of a co-design approach, where hardware and software are developed in tandem, becomes increasingly evident. This section provides an overview of the techniques of how hardware and the interactions between hardware and software can be optimized to improve models performance.


I think you might want to write "models' performance" instead of "models performance".

sophiacho1 · 2023-10-29T08:43:44Z

optimizations.qmd

+
+### Challenges of Hardware-Aware Neural Architecture Search
+
+While HW-NAS carries high potential for finding optimal architectures for TinyML, it comes with some challenges. Hardware Metrics like latency, energy consumption and hardware utilization are harder to evaluate than the metrics of accuracy or loss. They often require specilized tools for precise measurements. Moreover, adding all these metrics leads to a much bigger search space. This leads to HW-NAS being time-consuming and expensive. It has to be applied to every hardware for optimal results, moreover, meaning that if one needs to deploy the model on multiple devices, the search has to be conducted multiple times and will result in different models, unless optimizing for all of them which means less accuracy. Finally, hardware changes frequently, and HW-NAS may need to be conducted on each version.


Should be "specialized" instead of "specilized".

sophiacho1 · 2023-10-29T08:50:20Z

optimizations.qmd

+
+##### Loop unrolling
+
+Instead of having a loop with loop control (incrementing the loop counter, checking the loop termination condition) the loop can be unrolled and the overhead of loop control can be omitted. This may also provide additional opportunities for parallelism that may not be possible with the loop structure. This can be particularly beneficial for tight loops, where the boy of the loop is a small number of instructions with a lot of iterations.


I think it should be "body" instead of "boy"?

sophiacho1 · 2023-10-29T08:52:39Z

optimizations.qmd

+
+##### Tiling
+
+Similarly to blocking, tiling divides data and computation into chunks, but extends beyond cache improvements. Tiling creates independent partitions of computation that can be run in parallel, which can result in significant performance improvements.:


It looks like there's an extra colon at the end of this paragraph.

sophiacho1 · 2023-10-29T19:22:36Z

optimizations.qmd

+
+### Hardware-Aware Neural Architecture Search
+
+Focusing only on the accuracy when performing Neural Architecture Search leads to models that are exponentially complex and require increasing memory and compute. This has lead to hardware constraints limiting the exploitation of the deep learning models at their full potential. Manually designing the architecture of the model is even harder when considering the hardware variety and limitations. This has lead to the creation of Hardware-aware Neural Architecture Search that incorporate the hardware contractions into their search and optimize the search space for a specific hardware and accuracy. HW-NAS can be catogrized based how it optimizes for hardware. We will briefly explore these categories and leave links to related papers for the interested reader.


Should be "categorized" instead of "catogrized".

sophiacho1 · 2023-10-29T19:27:20Z

optimizations.qmd

+
+#### Multiple Targets
+
+This category aims at optimizing a single model for multiple hardwares. This can be helpful for mobile devices development as it can optimize to different phones models. [1](https://arxiv.org/abs/2008.08178)[2](https://ieeexplore.ieee.org/document/9102721)


I think "phone models" or "phones' models" would be better than "phones models."

sophiacho1 · 2023-10-29T19:35:01Z

optimizations.qmd

+
+Similarly, MorphNet is a neural network optimization framework designed to automatically reshape and morph the architecture of deep neural networks, optimizing them for specific deployment requirements. It achieves this through two steps: first, it leverages a set of customizable network morphing operations, such as widening or deepening layers, to dynamically adjust the network's structure. These operations enable the network to adapt to various computational constraints, including model size, latency, and accuracy targets, which are extremely prevalent in edge computing usage. In the second step, MorphNet uses a reinforcement learning-based approach to search for the optimal permutation of morphing operations, effectively balancing the trade-off between model size and performance. This innovative method allows deep learning practitioners to automatically tailor neural network architectures to specific application and hardware requirements, ensuring efficient and effective deployment across various platforms.
+
+TinyNAS and MorphNet represent a few of the many significant advancements in the field of systematic neural network optimization, allowing architectures to be systematically chosen and generated to fit perfectly within problem constraints.


I wonder if it'd be worth moving the bulk of the stuff about TinyNAS to the TinyNAS section under Efficient Hardware Implementation and just briefly mentioning it here.

sophiacho1 · 2023-10-29T19:38:20Z

optimizations.qmd

+
+#### General Kernel Optimizations
+
+These are kernel optimizations that all devices can benefit from. They provide technics to convert the code to more efficient instructions.


I think it should be "techniques" instead of "technics".

sophiacho1 · 2023-10-29T19:50:17Z

optimizations.qmd

+
+### Optimization Frameworks
+
+Optimization Frameworks have been introduced to exploit the specific capabilities of the hardware to accelerate the software. One example of such a framework is hls4ml. This open-source software-hardware co-design workflow aids in interpreting and translating machine learning algorithms for implementation with both FPGA and ASIC technologies, enhancing their. Features such as network optimization, new Python APIs, quantization-aware pruning, and end-to-end FPGA workflows are embedded into the hls4ml framework, leveraging parallel processing units, memory hierarchies, and specialized instruction sets to optimize models for edge hardware. Moreover, hls4ml is capable of translating machine learning algorithms directly into FPGA firmware.


I think this sentence is incomplete: "This open-source software-hardware co-design workflow aids in interpreting and translating machine learning algorithms for implementation with both FPGA and ASIC technologies, enhancing their."

sophiacho1 · 2023-10-29T20:02:46Z

optimizations.qmd

+
+### SplitNets
+
+SplitNets were introduced in the context of Head-Mounted systems. They distribute the Deep Neural Networks (DNNs) workload among camera sensors and an aggregator. This is particularly compelling the in context of TinyML. The SplitNet framework is a split-aware NAS to find the optimal neural network architecture to achieve good accuracy, split the model among the sensors and the aggregator, and minimize the communication between the sensors and the aggregator. Minimal communication is important in TinyML where memory is highly constrained, this way the sensors conduct some of the processing on their chips and then they send only the necessary information to the aggregator. When testing on ImageNet, SplitNets were able to reduce the latency by one order of magnitude on head-mounted devices. This can be helpful when the sensor has it's own chip. [1](https://arxiv.org/pdf/2204.04705.pdf)


Should be "its" instead of "it's".

sophiacho1 · 2023-10-29T20:38:20Z

optimizations.qmd

+
+We then explored efficient numerics representations, where we covered the basics of numerics, numeric encodings and storage, benefits of efficient numerics, and the nuances of numeric representation with memory usage, computational complexity, hardware compatibility, and tradeoff scenarios. We finished by honing in on an efficient numerics staple: quantization, where we examined its history, calibration, techniques, and interaction with pruning.
+
+Finally, we looked at how we can make optimizations specific to the hardware we have. We explored how we can find model architectures tailored to the hardware, make optimizations in the kernel to better handle the model, and frameworks built to make the most use out of the hardware. We also looked at how we can go the other way around and build hardware around our specific software and talked about splitting networks to run on multiple processor available on the edge device.


I think it should be "multiple processors" instead of "multiple processor".

sophiacho1 · 2023-10-29T20:40:20Z

optimizations.qmd

I learned a lot from this chapter! Awesome job y'all!

arnaumarin · 2023-10-30T01:28:19Z

optimizations.qmd


-## Model Compression {#sec-kd}
+Model pruning is especially useful when deploying machine learning models to devices with limited compute resources, such as mobile phones or TinyML systems. The technique facilitates the deployment of larger, more complex models on these devices by reducing their resource demands. Additionally, smaller models require less data to generalize well and are less prone to overfitting. By providing an efficient way to simplify models, model pruning has become a vital technique for optimizing neural networks in machine learning.


Overall the model pruning part of the chapter is very clear. I like it!
One small detail: "Model pruning is especially useful when deploying machine learning models to devices with limited compute resources, such as mobile phones or TinyML systems".

This sentence might give the impression that mobile phones and TinyML systems are on the same level of computational constraint. TinyML devices are usually far more resource-constrained than today's smartphones (i.e. in mobile phones we can easily do on-device training, while TinyML, right now, is not that easy.)

I would clarify the difference between the two, perhaps like: "Model pruning is especially useful for deployment on resource-constrained devices; it is indispensable for TinyML systems and can also offer advantages in mobile phone deployments, where it is easier, but still impactful". Ideally, you could even include this reference: https://www.tensorflow.org/lite/examples/on_device_training/overview

arnaumarin · 2023-10-30T01:32:33Z

optimizations.qmd

- Tensor decomposition methods
- Low-rank matrix factorization
- Learned approximations of weight matrices
+So how does one choose the type of pruning methods? Many variations of pruning techniques exist where each varies the heuristic of what should be kept and pruned from the model as well the number of times pruning occurs. Traditionally, pruning happens after the model is fully trained, where the pruned model may experience mild accuracy loss. However, as we will discuss further, recent discoveries have found that pruning can be used during training (i.e., iteratively) to identify more efficient and accurate model representations.


Following the previous comment I posted, again, this section is quite clear. But, this part could be rephrased for better clarity and structure. It's not clear how one would go about choosing a pruning method from this description.

I would mention that there is trade-offs. In my personal experience, I fully understood the potential of pruning in assignment 2 (pruning reduced the size of the model over 50%, and the accuracy barely changed!). I would mention those examples, if you could include a reference or a image, that would be even better.

A suggested re-phrased paragraph:
"Choosing a pruning method often involves trade-offs between simplicity, effectiveness, and the nature of your specific application. [...]"

That could introduce the topic of pruning easier, I think.

elizakimball · 2023-10-30T15:45:48Z

optimizations.qmd

+
+With **channel** pruning, which is predominantly applied in convolutional neural networks (CNNs), it involves eliminating entire channels or filters, which in turn reduces the depth of the feature maps and impacts the network's ability to extract certain features from the input data. This is particularly crucial in image processing tasks where computational efficiency is paramount.
+
+Finally, **layer** pruning takes a more aggressive approach by removing entire layers of the network. This significantly reduces the network's depth and thereby its capacity to model complex patterns and hierarchies in the data. This approach necessitates a careful balance to ensure that the model's predictive capability is not unduly compromised.


Since you mention filter pruning below, I recommend adding a section here dedicated to this kind of pruning.

elizakimball · 2023-10-30T15:53:39Z

optimizations.qmd

+
+##### Quality vs. Size Reduction
+
+A key challenge in both structured and unstructured pruning is balancing size reduction with maintaining or improving predictive performance. This trade-off becomes more complex with unstructured pruning, where individual weight removal can create sparse weight matrices. Ensuring the pruned model retains generalization capacity while becoming more computationally efficient is critical, often requiring extensive experimentation and validation.


While you've added a visualization for KD, consider adding more diagrams, especially in the "Quality vs. Size Reduction" section to visually explain concepts like sparse weight matrices.

elizakimball · 2023-10-30T15:55:06Z

optimizations.qmd

+
+##### Legal and Ethical Considerations
+
+Last but not least, adherence to legal and ethical guidelines is paramount, especially in domains with significant consequences. Both pruning methods must undergo rigorous validation, testing, and potentially certification processes to ensure compliance with relevant regulations and standards. This is especially important in use cases like medical AI applications or autonomous driving where quality drops due to pruning like optimizationscan be life threatening.


This section might benefit from a bit more elaboration, maybe a real-world example or a more specific case study where pruning led to unintended consequences.

elizakimball · 2023-10-30T15:59:55Z

optimizations.qmd

+
+The loss function is another critical component that typically amalgamates a distillation loss, which measures the divergence between the teacher and student outputs, and a classification loss, which ensures the student model adheres to the true data labels. The Kullback-Leibler (KL) divergence is commonly employed to quantify the distillation loss, providing a measure of the discrepancy between the probability distributions output by the teacher and student models.
+
+Another core concept is "temperature scaling" in the softmax function. It plays the role in controlling the granularity of the information distilled from the teacher model. A higher temperature parameter produces softer, more informative distributions, thereby facilitating the transfer of more nuanced knowledge to the student model. However, it also introduces the challenge of effectively balancing the trade-off between the informativeness of the soft targets and the stability of the training process.


It was a little confusing as I read this since you mention temperature-scaled softmax functions above. I recommend reconsidering the order of this section, so your readers can understand each line as it appears in the text.

elizakimball · 2023-10-30T16:02:05Z

optimizations.qmd

+
+Numerical data, the bedrock upon which machine learning models stand, manifest in two primary forms. These are integers and floating point numbers.
+
+**Integers** : Whole numbers, devoid of fractional components, integers (e.g., -3, 0, 42) are key in scenarios demanding discrete values. For instance, in ML, class labels in a classification task might be represented as integers, where "cat", "dog", and "bird" could be encoded as 0, 1, and 2 respectively.


I am under the impression that encoding "cat", "dog", and "bird as 0, 1, and 2 implies an order between these variables that doesn't exist. Thus, one-hot encoding is better in this scenario. If I am incorrect and you believe there is some benefit in encoding these categorical variables in this manner, please explain why this is the case.

jared-ni

great work!!

jared-ni · 2023-11-02T01:39:57Z

optimizations.qmd

+
+Efficient numerics is not just about reducing the bit-width of numbers but understanding the trade-offs between accuracy and efficiency. As machine learning models become more pervasive, especially in real-world, resource-constrained environments, the focus on efficient numerics will continue to grow. By thoughtfully selecting and leveraging the appropriate numeric precision, one can achieve robust model performance while optimizing for speed, memory, and energy.
+
+### Numeric Representation Nuances


I see that these are a lot of similarities between this chapter and the On-device Learning chapter! Awesome to see the connections.

jared-ni · 2023-11-02T01:43:08Z

optimizations.qmd

+- Computational Complexity: The models must process and analyze vast streams of market data with minimal latency, where even slight delays, potentially introduced by higher-precision numerics, can result in missed opportunities.
+- Precision and Accuracy Trade-offs: Financial computations often demand high numerical precision to ensure accurate pricing and risk assessments, posing challenges in balancing computational efficiency and numerical accuracy.
+
+##### Edge-Based Surveillance Systems


could refer to the on-device learning chapter for any potential on-device learning discussions!

jared-ni · 2023-11-02T01:43:34Z

optimizations.qmd

+
+As discussed, some precision in the real value is lost by quantization. In this case, the recovered value r ̃ will not exactly match r due to the rounding operation. This is an important tradeoff to note; however, in many successful uses of quantization, the loss of precision can be negligible and the test accuracy remains high. Despite this, uniform quantization continues to be the current de-facto choice due to its simplicity and efficient mapping to hardware.
+
+#### Non-uniform Quantization


Love the details when it comes to explaining quantization!

jared-ni · 2023-11-02T01:44:48Z

optimizations.qmd

+
+Of these, channelwise quantization is the current standard used for quantizing convolutional kernels, since it enables the adjustment of clipping ranges for each individual kernel with negligible overhead.
+
+### Static and Dynamic Quantization


I didn't know there are so many quantization techniques. This is awesome and great to learn about. It'd be super cool to see a graphical representation of how to visualize the different types of quantization for later improvements of the textbook. It might be a lot of work to visualize these techniques, though, if there are no existing sources online.

Aghyad Deeb and others added 29 commits October 18, 2023 20:00

Commit for authors

eaf1587

first draft

9d1b2ce

adding images and efficient representation

198750f

updated optimizations

50e7a33

added references

7f392e3

added three floating point formats img

1a0e750

removed excess description

5be9279

Update optimizations.qmd

c887fe7

Added Image file names + captions

Uploaded Images

26529c4

Fixed images directories

59b9dde

Fixed formatting in Efficient Hardware Implementation

dd215e5

More formatting fixes on Efficient Hardware Implementation

20470be

Added the efficient hardware part of conclusion

ab18866

Update optimizations.qmd

35d2293

attempt to fix tabbing

5037553

merge

ae11a3b

added images for efficient numerics

eb3b0e7

Merge branch 'main' of github.com:18jeffreyma/cs249r_book

5cf5537

Updated reference links

ca47d12

Update optimizations.qmd

8375c57

Adding images

d897dff

Merge branch 'main' of github.com:18jeffreyma/cs249r_book

5e7124e

Merge branch 'main' of github.com:18jeffreyma/cs249r_book

d9fa20d

add quantization hyperlink

34c79a9

fixed refs and images for hw optimization

6d91755

fixed refs and images in HW section

9f1a94c

Fixed table in QAT section

e6bc892

merge

d884d20

AditiR-42 reviewed Oct 23, 2023

View reviewed changes

sophiacho1 reviewed Oct 28, 2023

View reviewed changes

sophiacho1 reviewed Oct 29, 2023

View reviewed changes

arnaumarin reviewed Oct 30, 2023

View reviewed changes

elizakimball suggested changes Oct 30, 2023

View reviewed changes

mpstewart1 and others added 2 commits October 30, 2023 14:01

Merging changes from main book

250c462

Update readme and contributors.qmd with contributors

3d4d066

mpstewart1 merged commit afa03f3 into harvard-edge:main Oct 30, 2023

jared-ni reviewed Nov 2, 2023

View reviewed changes

uchendui added the cs249r label Nov 7, 2023


		##### Hardware Efficiency

		Structured pruning often results in models that are more amenable to deployment on specialized hardware, such as Field-Programmable Gate Arrays (FPGAs) or Application-Specific Integrated Circuits (ASICs), due to the regularity and simplicity of the pruned architecture. With reduced computational requirements, it translates to lower energy consumption, which is crucial for battery-powered devices and sustainable computing practices.


		Non-uniform quantization, on the other hand, does not maintain a consistent interval between quantized values. This approach might be used to allocate more possible discrete values in regions where the parameter values are more densely populated, thereby preserving more detail where it is most needed. For instance, in bell-shaped distributions of weights with long tails, a set of weights in a model predominantly lies within a certain range; thus, more quantization levels might be allocated to that range to preserve finer details, enabling us to better capture information. However, one major weakness of non-uniform quantization is that it requires dequantization before higher precision computations due to its non-uniformity, restricting its ability to accelerate computation compared to uniform quantization.

		Typically, a rule-based non-uniform quantization uses a logarithmic distribution of exponentially increasing steps and levels as opposed to linearly. Another popular branch lies in binary-code-based quantization where real number vectors are quantized into binary vectors with a scaling factor. Notably, there is no closed form solution for minimizing errors between the real value and non-uniformly quantized value, so most quantizations in this field rely on heuristic solutions. For instance, recent work formulates non-uniform quantization as an optimization problem where the quantization steps/levels in quantizer Q are adjusted to minimize the difference between the original tensor and quantized counterpart.


		Symmetric quantization maps real values to a symmetrical clipping range centered around 0. This involves choosing a range [\alpha, \beta] where \alpha = -\beta. For example, one symmetrical range would be based on the min/max values of the real values such that: -\alpha = \beta = max(abs(r\_max), abs(r\_min)).

		Symmetric clipping ranges are the most widely adopted in practice as they have the advantage of easier implementation. In particular, the zeroing out of the zero point can lead to reduction in computational cost during inference ["Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation" (2023)]([https://arxiv.org/abs/2004.09602](https://arxiv.org/abs/2004.09602)) .


		_Illustration of symmetric quantization (left) and asymmetric quantization (right). Symmetric quantization maps real values to [-127, 127], and asymmetric maps to [-128, 127]. (Credit: __A Survey of Quantization Methods for Efficient Neural Network Inference__ )._

		### Granularity


		_Illustration of the main forms of quantization granularities. In layerwise quantization, the same clipping range is applied to all filters which belong to the same layer. Notice how this can result in lower quantization resolutions for channels with narrow distributions, e.g. Filter 1, Filter 2, and Filter C. A higher quantization resolution can be achieved using channelwise quantization which dedicates different clipping ranges to different channels. (Credit: __A Survey of Quantization Methods for Efficient Neural Network Inference__ )._

		1. Layerwise Quantization: This approach determines the clipping range by considering all of the weights in the convolutional filters of a layer. Then, the same clipping range is used for all convolutional filters. It's the simplest to implement, and, as such, it often results in sub-optimal accuracy due the wide variety of differing ranges between filters. For example, a convolutional kernel with a narrower range of parameters loses its quantization resolution due to another kernel in the same layer having a wider range. .


		The two prevailing techniques for quantizing models are Post Training Quantization and Quantization Aware Training.

		Post Training Quantization - Post-training quantization (PTQ) is a quantization technique where the model is quantized after it has been trained.The model is trained in floating point and then weights and activations are quantized as a post-processing step. This is the simplest approach and does not require access to the training data. Unlike Quantization-Aware Training (QAT), PTQ sets weight and activation quantization parameters directly, making it low-overhead and suitable for limited or unlabeled data situations. However, not readjusting the weights after quantizing, especially in low-precision quantization can lead to very different behavior and thus lower accuracy. To tackle this, techniques like bias correction, equalizing weight ranges, and adaptive rounding methods have been developed. PTQ can also be applied in zero-shot scenarios, where no training or testing data are available. This method has been made even more efficient to benefit compute- and memory- intensive large language models. Recently, SmoothQuant, a training-free, accuracy-preserving, and general-purpose PTQ solution which enables 8-bit weight, 8-bit activation quantization for LLMs, has been developed, demonstrating up to 1.56x speedup and 2x memory reduction for LLMs with negligible loss in accuracy [SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models](2023)(https://arxiv.org/abs/2211.10438).


		Post Training Quantization - Post-training quantization (PTQ) is a quantization technique where the model is quantized after it has been trained.The model is trained in floating point and then weights and activations are quantized as a post-processing step. This is the simplest approach and does not require access to the training data. Unlike Quantization-Aware Training (QAT), PTQ sets weight and activation quantization parameters directly, making it low-overhead and suitable for limited or unlabeled data situations. However, not readjusting the weights after quantizing, especially in low-precision quantization can lead to very different behavior and thus lower accuracy. To tackle this, techniques like bias correction, equalizing weight ranges, and adaptive rounding methods have been developed. PTQ can also be applied in zero-shot scenarios, where no training or testing data are available. This method has been made even more efficient to benefit compute- and memory- intensive large language models. Recently, SmoothQuant, a training-free, accuracy-preserving, and general-purpose PTQ solution which enables 8-bit weight, 8-bit activation quantization for LLMs, has been developed, demonstrating up to 1.56x speedup and 2x memory reduction for LLMs with negligible loss in accuracy [SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models](2023)(https://arxiv.org/abs/2211.10438).

		![](images/efficientnumerics_lecturenote.png)


		Src: integer quantization for deep learning Inference: principles and empirical evaluations

		\| Feature/Technique \| Post Training Quantization \| Quantization Aware Training \| Dynamic Quantization \|


		Quantization not only reduces model size but also enables faster computations and draws less power, making it vital to edge development. Edge devices typically have tight resource constraints with compute, memory, and power, which are impossible to meet for many of the deep NN models of today. Furthermore, edge processors do not support floating point operations, making integer quantization particularly important for chips like GAP-8, a RISC-=V SoC for edge inference with a dedicated CNN accelerator, which only support integer arithmetic..

		One hardware platform utilizing quantization is the ARM Cortex-M group of 32=bit RISC ARM processor cores. They leverage fixed-point quantization with power of two scaling factors so that quantization and dequantization can be efficiently done by bit shifting. Additionally, Google Edge TPUs, Google's emerging solution for running inference at the edge, is designed for small, low-powered devices and can only support 8-bit arithmetic. Recently, there has been significant strides in the computing power of edge processors, enabling the deployment and inference of costly NN models previously limited to servers.


		## Efficient Hardware Implementation {#sec-model_ops_hw}

		Efficient hardware implementation transcends the selection of suitable components; it requires a holistic understanding of how software will interact with underlying architectures. The essence of achieving peak performance in TinyML applications lies not only in refining algorithms to hardware but also in ensuring that the hardware is strategically tailored to support these algorithms. This synergy between hardware and software is crucial. As we delve deeper into the intricacies of efficient hardware implementation, the significance of a co-design approach, where hardware and software are developed in tandem, becomes increasingly evident. This section provides an overview of the techniques of how hardware and the interactions between hardware and software can be optimized to improve models performance.


		### Challenges of Hardware-Aware Neural Architecture Search

		While HW-NAS carries high potential for finding optimal architectures for TinyML, it comes with some challenges. Hardware Metrics like latency, energy consumption and hardware utilization are harder to evaluate than the metrics of accuracy or loss. They often require specilized tools for precise measurements. Moreover, adding all these metrics leads to a much bigger search space. This leads to HW-NAS being time-consuming and expensive. It has to be applied to every hardware for optimal results, moreover, meaning that if one needs to deploy the model on multiple devices, the search has to be conducted multiple times and will result in different models, unless optimizing for all of them which means less accuracy. Finally, hardware changes frequently, and HW-NAS may need to be conducted on each version.


		##### Loop unrolling

		Instead of having a loop with loop control (incrementing the loop counter, checking the loop termination condition) the loop can be unrolled and the overhead of loop control can be omitted. This may also provide additional opportunities for parallelism that may not be possible with the loop structure. This can be particularly beneficial for tight loops, where the boy of the loop is a small number of instructions with a lot of iterations.


		##### Tiling

		Similarly to blocking, tiling divides data and computation into chunks, but extends beyond cache improvements. Tiling creates independent partitions of computation that can be run in parallel, which can result in significant performance improvements.:


		### Hardware-Aware Neural Architecture Search

		Focusing only on the accuracy when performing Neural Architecture Search leads to models that are exponentially complex and require increasing memory and compute. This has lead to hardware constraints limiting the exploitation of the deep learning models at their full potential. Manually designing the architecture of the model is even harder when considering the hardware variety and limitations. This has lead to the creation of Hardware-aware Neural Architecture Search that incorporate the hardware contractions into their search and optimize the search space for a specific hardware and accuracy. HW-NAS can be catogrized based how it optimizes for hardware. We will briefly explore these categories and leave links to related papers for the interested reader.


		#### Multiple Targets

		This category aims at optimizing a single model for multiple hardwares. This can be helpful for mobile devices development as it can optimize to different phones models. [1](https://arxiv.org/abs/2008.08178)[2](https://ieeexplore.ieee.org/document/9102721)


		Similarly, MorphNet is a neural network optimization framework designed to automatically reshape and morph the architecture of deep neural networks, optimizing them for specific deployment requirements. It achieves this through two steps: first, it leverages a set of customizable network morphing operations, such as widening or deepening layers, to dynamically adjust the network's structure. These operations enable the network to adapt to various computational constraints, including model size, latency, and accuracy targets, which are extremely prevalent in edge computing usage. In the second step, MorphNet uses a reinforcement learning-based approach to search for the optimal permutation of morphing operations, effectively balancing the trade-off between model size and performance. This innovative method allows deep learning practitioners to automatically tailor neural network architectures to specific application and hardware requirements, ensuring efficient and effective deployment across various platforms.

		TinyNAS and MorphNet represent a few of the many significant advancements in the field of systematic neural network optimization, allowing architectures to be systematically chosen and generated to fit perfectly within problem constraints.


		#### General Kernel Optimizations

		These are kernel optimizations that all devices can benefit from. They provide technics to convert the code to more efficient instructions.


		### Optimization Frameworks

		Optimization Frameworks have been introduced to exploit the specific capabilities of the hardware to accelerate the software. One example of such a framework is hls4ml. This open-source software-hardware co-design workflow aids in interpreting and translating machine learning algorithms for implementation with both FPGA and ASIC technologies, enhancing their. Features such as network optimization, new Python APIs, quantization-aware pruning, and end-to-end FPGA workflows are embedded into the hls4ml framework, leveraging parallel processing units, memory hierarchies, and specialized instruction sets to optimize models for edge hardware. Moreover, hls4ml is capable of translating machine learning algorithms directly into FPGA firmware.


		### SplitNets

		SplitNets were introduced in the context of Head-Mounted systems. They distribute the Deep Neural Networks (DNNs) workload among camera sensors and an aggregator. This is particularly compelling the in context of TinyML. The SplitNet framework is a split-aware NAS to find the optimal neural network architecture to achieve good accuracy, split the model among the sensors and the aggregator, and minimize the communication between the sensors and the aggregator. Minimal communication is important in TinyML where memory is highly constrained, this way the sensors conduct some of the processing on their chips and then they send only the necessary information to the aggregator. When testing on ImageNet, SplitNets were able to reduce the latency by one order of magnitude on head-mounted devices. This can be helpful when the sensor has it's own chip. [1](https://arxiv.org/pdf/2204.04705.pdf)


		We then explored efficient numerics representations, where we covered the basics of numerics, numeric encodings and storage, benefits of efficient numerics, and the nuances of numeric representation with memory usage, computational complexity, hardware compatibility, and tradeoff scenarios. We finished by honing in on an efficient numerics staple: quantization, where we examined its history, calibration, techniques, and interaction with pruning.

		Finally, we looked at how we can make optimizations specific to the hardware we have. We explored how we can find model architectures tailored to the hardware, make optimizations in the kernel to better handle the model, and frameworks built to make the most use out of the hardware. We also looked at how we can go the other way around and build hardware around our specific software and talked about splitting networks to run on multiple processor available on the edge device.

TinyML Model Optimization chapter #37

TinyML Model Optimization chapter #37

Conversation

18jeffreyma commented Oct 20, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sophiacho1 Oct 29, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sophiacho1 Oct 29, 2023 • edited Loading

Choose a reason for hiding this comment

arnaumarin Oct 30, 2023 • edited Loading

Choose a reason for hiding this comment

arnaumarin Oct 30, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jared-ni left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sophiacho1 Oct 29, 2023 •

edited

Loading

sophiacho1 Oct 29, 2023 •

edited

Loading

arnaumarin Oct 30, 2023 •

edited

Loading

arnaumarin Oct 30, 2023 •

edited

Loading


		## Model Compression {#sec-kd}
		Model pruning is especially useful when deploying machine learning models to devices with limited compute resources, such as mobile phones or TinyML systems. The technique facilitates the deployment of larger, more complex models on these devices by reducing their resource demands. Additionally, smaller models require less data to generalize well and are less prone to overfitting. By providing an efficient way to simplify models, model pruning has become a vital technique for optimizing neural networks in machine learning.


		With channel pruning, which is predominantly applied in convolutional neural networks (CNNs), it involves eliminating entire channels or filters, which in turn reduces the depth of the feature maps and impacts the network's ability to extract certain features from the input data. This is particularly crucial in image processing tasks where computational efficiency is paramount.

		Finally, layer pruning takes a more aggressive approach by removing entire layers of the network. This significantly reduces the network's depth and thereby its capacity to model complex patterns and hierarchies in the data. This approach necessitates a careful balance to ensure that the model's predictive capability is not unduly compromised.


		##### Quality vs. Size Reduction

		A key challenge in both structured and unstructured pruning is balancing size reduction with maintaining or improving predictive performance. This trade-off becomes more complex with unstructured pruning, where individual weight removal can create sparse weight matrices. Ensuring the pruned model retains generalization capacity while becoming more computationally efficient is critical, often requiring extensive experimentation and validation.


		##### Legal and Ethical Considerations

		Last but not least, adherence to legal and ethical guidelines is paramount, especially in domains with significant consequences. Both pruning methods must undergo rigorous validation, testing, and potentially certification processes to ensure compliance with relevant regulations and standards. This is especially important in use cases like medical AI applications or autonomous driving where quality drops due to pruning like optimizationscan be life threatening.


		The loss function is another critical component that typically amalgamates a distillation loss, which measures the divergence between the teacher and student outputs, and a classification loss, which ensures the student model adheres to the true data labels. The Kullback-Leibler (KL) divergence is commonly employed to quantify the distillation loss, providing a measure of the discrepancy between the probability distributions output by the teacher and student models.

		Another core concept is "temperature scaling" in the softmax function. It plays the role in controlling the granularity of the information distilled from the teacher model. A higher temperature parameter produces softer, more informative distributions, thereby facilitating the transfer of more nuanced knowledge to the student model. However, it also introduces the challenge of effectively balancing the trade-off between the informativeness of the soft targets and the stability of the training process.


		Numerical data, the bedrock upon which machine learning models stand, manifest in two primary forms. These are integers and floating point numbers.

		Integers : Whole numbers, devoid of fractional components, integers (e.g., -3, 0, 42) are key in scenarios demanding discrete values. For instance, in ML, class labels in a classification task might be represented as integers, where "cat", "dog", and "bird" could be encoded as 0, 1, and 2 respectively.


		Efficient numerics is not just about reducing the bit-width of numbers but understanding the trade-offs between accuracy and efficiency. As machine learning models become more pervasive, especially in real-world, resource-constrained environments, the focus on efficient numerics will continue to grow. By thoughtfully selecting and leveraging the appropriate numeric precision, one can achieve robust model performance while optimizing for speed, memory, and energy.

		### Numeric Representation Nuances


		As discussed, some precision in the real value is lost by quantization. In this case, the recovered value r ̃ will not exactly match r due to the rounding operation. This is an important tradeoff to note; however, in many successful uses of quantization, the loss of precision can be negligible and the test accuracy remains high. Despite this, uniform quantization continues to be the current de-facto choice due to its simplicity and efficient mapping to hardware.

		#### Non-uniform Quantization


		Of these, channelwise quantization is the current standard used for quantizing convolutional kernels, since it enables the adjustment of clipping ranges for each individual kernel with negligible overhead.

		### Static and Dynamic Quantization