-
Notifications
You must be signed in to change notification settings - Fork 167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TinyML Model Optimization chapter #37
Merged
Merged
Changes from 1 commit
Commits
Show all changes
33 commits
Select commit
Hold shift + click to select a range
eaf1587
Commit for authors
9d1b2ce
first draft
18jeffreyma 198750f
adding images and efficient representation
18jeffreyma 50e7a33
updated optimizations
18jeffreyma 7f392e3
added references
18jeffreyma 1a0e750
added three floating point formats img
jaysonzlin 5be9279
removed excess description
jaysonzlin c887fe7
Update optimizations.qmd
aptl26 26529c4
Uploaded Images
aptl26 59b9dde
Fixed images directories
aptl26 dd215e5
Fixed formatting in Efficient Hardware Implementation
aptl26 20470be
More formatting fixes on Efficient Hardware Implementation
aptl26 ab18866
Added the efficient hardware part of conclusion
aptl26 35d2293
Update optimizations.qmd
jaysonzlin 5037553
attempt to fix tabbing
18jeffreyma ae11a3b
merge
18jeffreyma eb3b0e7
added images for efficient numerics
jaysonzlin 5cf5537
Merge branch 'main' of github.com:18jeffreyma/cs249r_book
jaysonzlin ca47d12
Updated reference links
profvjreddi 8375c57
Update optimizations.qmd
profvjreddi d897dff
Adding images
jaysonzlin 5e7124e
Merge branch 'main' of github.com:18jeffreyma/cs249r_book
jaysonzlin 4d6a543
Add co-authors
18jeffreyma d9fa20d
Merge branch 'main' of github.com:18jeffreyma/cs249r_book
18jeffreyma 34c79a9
add quantization hyperlink
18jeffreyma 6d91755
fixed refs and images for hw optimization
9f1a94c
fixed refs and images in HW section
e6bc892
Fixed table in QAT section
jaysonzlin d884d20
merge
jaysonzlin f2beb68
Merge branch 'harvard-edge:main' into main
18jeffreyma e57cd67
Update readme and contributors.qmd with contributors
github-actions[bot] 250c462
Merging changes from main book
mpstewart1 3d4d066
Update readme and contributors.qmd with contributors
github-actions[bot] File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
fixed refs and images for hw optimization
- Loading branch information
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
- source: project | ||
quarto-pub: | ||
- id: 95f24e14-fabd-4f8e-9540-3ddcdbee8451 | ||
url: 'https://aptl26.quarto.pub/machine-learning-systems' |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -407,7 +407,7 @@ ML models are increasingly being utilized in scientific simulations, such as cli | |
|
||
These examples illustrate diverse scenarios where the challenges of numerics representation in ML models are prominently manifested. Each system presents a unique set of requirements and constraints, necessitating tailored strategies and solutions to navigate the challenges of memory usage, computational complexity, precision-accuracy trade-offs, and hardware compatibility. | ||
|
||
## Quantization | ||
## Quantization {#sec-quantization} | ||
|
||
Source: [https://arxiv.org/abs/2004.09602](https://arxiv.org/abs/2004.09602) | ||
|
||
|
@@ -666,7 +666,10 @@ Efficient hardware implementation transcends the selection of suitable component | |
|
||
## Hardware-Aware Neural Architecture Search | ||
|
||
Focusing only on the accuracy when performing Neural Architecture Search leads to models that are exponentially complex and require increasing memory and compute. This has lead to hardware constraints limiting the exploitation of the deep learning models at their full potential. Manually designing the architecture of the model is even harder when considering the hardware variety and limitations. This has lead to the creation of Hardware-aware Neural Architecture Search that incorporate the hardware contractions into their search and optimize the search space for a specific hardware and accuracy. HW-NAS can be catogrized based how it optimizes for hardware. We will briefly explore these categories and leave links to related papers for the interested reader. ![Taxonomy of HW-NAS [1](https://www.ijcai.org/proceedings/2021/592)](images/modeloptimization_HW-NAS.png). | ||
Focusing only on the accuracy when performing Neural Architecture Search leads to models that are exponentially complex and require increasing memory and compute. This has lead to hardware constraints limiting the exploitation of the deep learning models at their full potential. Manually designing the architecture of the model is even harder when considering the hardware variety and limitations. This has lead to the creation of Hardware-aware Neural Architecture Search that incorporate the hardware contractions into their search and optimize the search space for a specific hardware and accuracy. HW-NAS can be catogrized based how it optimizes for hardware. We will briefly explore these categories and leave links to related papers for the interested reader. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should be "categorized" instead of "catogrized". |
||
|
||
![Taxonomy of HW-NAS [1](https://www.ijcai.org/proceedings/2021/592)](images/modeloptimization_HW-NAS.png) | ||
|
||
|
||
### Single Target, Fixed Platfrom Configuration | ||
|
||
|
@@ -698,9 +701,11 @@ TinyNAS adopts a two stage approach to finding an optimal architecture for model | |
|
||
First, TinyNAS generate multiple search spaces by varying the input resolution of the model, and the number of channels of the layers of the model. Then, TinyNAS chooses a search space based on the FLOPs (Floating Point Operations Per Second) of each search space | ||
|
||
Then, TinyNAS performs a search operation on the chosen space to find the optimal architecture for the specific constraints of the microcontroller. ![A diagram showing how search spaces with high probability of finding an architecture with large number of FLOPs provide models with higher accuracy [1](https://arxiv.org/abs/2007.10319)](images/modeloptimization_TinyNAS.png) [1](https://arxiv.org/abs/2007.10319) | ||
Then, TinyNAS performs a search operation on the chosen space to find the optimal architecture for the specific constraints of the microcontroller. [1](https://arxiv.org/abs/2007.10319) | ||
|
||
![A diagram showing how search spaces with high probability of finding an architecture with large number of FLOPs provide models with higher accuracy [1](https://arxiv.org/abs/2007.10319)](images/modeloptimization_TinyNAS.png) | ||
|
||
### Topology-Aware NAS | ||
#### Topology-Aware NAS | ||
|
||
Focuses on creating and optimizing a search space that aligns with the hardware topology of the device. [1](https://arxiv.org/pdf/1911.09251.pdf) | ||
|
||
|
@@ -736,15 +741,17 @@ This comprises developing optimized kernels that take full advantage of a specif | |
|
||
This is one example of Algorithm-Hardware Co-design. CiM is a computing paradigm that performs computation within memory. Therefore, CiM architectures allow for operations to be performed directly on the stored data, without the need to shuttle data back and forth between separate processing and memory units. This design paradigm is particularly beneficial in scenarios where data movement is a primary source of energy consumption and latency, such as in TinyML applications on edge devices. Through algorithm-hardware co-design, the algorithms can be optimized to leverage the unique characteristics of CiM architectures, and conversely, the CiM hardware can be customized or configured to better support the computational requirements and characteristics of the algorithms. This is achieved by using the analog properties of memory cells, such as addition and multiplication in DRAM. [1](https://arxiv.org/abs/2111.06503) | ||
|
||
![A figure showing how Computing in Memory can be used for always-on tasks to offload tasks of the power consuming processing unit [1](https://arxiv.org/abs/2111.06503) | ||
](images/modeloptimization_CiM.png) | ||
![A figure showing how Computing in Memory can be used for always-on tasks to offload tasks of the power consuming processing unit [1](https://arxiv.org/abs/2111.06503)](images/modeloptimization_CiM.png) | ||
|
||
## Memory Access Optimization | ||
|
||
Different devices may have different memory hierarchies. Optimizing for the specific memory hierarchy in the specific hardware can lead to great performance improvements by reducing the costly operations of reading and writing to memory. Dataflow optimization can be achieved by optimizing for reusing data within a single layer and across multiple layers. This dataflow optimization can be tailored to the specific memory hierarchy of the hardware, which can lead to greater benefits than general optimizations for different hardwares. | ||
|
||
## Leveraging Sparsity | ||
|
||
Pruning is a fundamental approach to compress models to make them compatible with resource constrained devices. This results in sparse models where a lot of weights are 0's. Therefore, leveraging this sparsity can lead to significant improvements in performance. Tools were created to achieve exactly this. RAMAN, is a sparseTinyML accelerator designed for inference on edge devices. RAMAN overlap input and output activations on the same memory space, reducing storage requirements by up to 50%. [1](https://ar5iv.labs.arxiv.org/html/2306.06493) ![In this figure, the sparse columns of the filter matrix of a CNN are aggregated to create a dense matrix that, leading to smaller dimensions in the matrix and more efficient computations[1](https://arxiv.org/abs/1811.04770)](images/modeloptimization_sparsity.png) | ||
Pruning is a fundamental approach to compress models to make them compatible with resource constrained devices. This results in sparse models where a lot of weights are 0's. Therefore, leveraging this sparsity can lead to significant improvements in performance. Tools were created to achieve exactly this. RAMAN, is a sparseTinyML accelerator designed for inference on edge devices. RAMAN overlap input and output activations on the same memory space, reducing storage requirements by up to 50%. [1](https://ar5iv.labs.arxiv.org/html/2306.06493) | ||
|
||
![A figure showing the sparse columns of the filter matrix of a CNN that are aggregated to create a dense matrix that, leading to smaller dimensions in the matrix and more efficient computations[1](https://arxiv.org/abs/1811.04770)](images/modeloptimization_sparsity.png) | ||
|
||
## Optimization Frameworks | ||
|
||
|
@@ -756,14 +763,16 @@ One other framework for FPGAs that focuses on a holistic approach is CFU Playgro | |
|
||
## Hardware Built Around Software | ||
|
||
In a contrasting approach, hardware can be custom-designed around software requirements to optimize the performance for a specific application. This paradigm creates specialized hardware to better adapt to the specifics of the software, thus reducing computational overhead and improving operational efficiency. One example of this approach is a voice-recognition application by [resource number]. The paper proposes a structure wherein preprocessing operations, traditionally handled by software, are allocated to custom-designed hardware. This technique was achieved by introducing resistor–transistor logic to an inter-integrated circuit sound module for windowing and audio raw data acquisition in the voice-recognition application. Consequently, this offloading of preprocessing operations led to a reduction in computational load on the software, showcasing a practical application of building hardware around software to enhance the efficiency and performance. [1](https://www.mdpi.com/2076-3417/11/22/11073) | ||
In a contrasting approach, hardware can be custom-designed around software requirements to optimize the performance for a specific application. This paradigm creates specialized hardware to better adapt to the specifics of the software, thus reducing computational overhead and improving operational efficiency. One example of this approach is a voice-recognition application by [1](https://www.mdpi.com/2076-3417/11/22/11073). The paper proposes a structure wherein preprocessing operations, traditionally handled by software, are allocated to custom-designed hardware. This technique was achieved by introducing resistor–transistor logic to an inter-integrated circuit sound module for windowing and audio raw data acquisition in the voice-recognition application. Consequently, this offloading of preprocessing operations led to a reduction in computational load on the software, showcasing a practical application of building hardware around software to enhance the efficiency and performance. [1](https://www.mdpi.com/2076-3417/11/22/11073) | ||
|
||
![A diagram showing how an FPGA was used to offload data preprocessing of the general purpose computation unit. [1](https://www.mdpi.com/2076-3417/11/22/11073)](images/modeloptimization_preprocessor.png) | ||
|
||
|
||
## SplitNets | ||
|
||
SplitNets were introduced in the context of Head-Mounted systems. They distribute the Deep Neural Networks (DNNs) workload among camera sensors and an aggregator. This is particularly compelling the in context of TinyML. The SplitNet framework is a split-aware NAS to find the optimal neural network architecture to achieve good accuracy, split the model among the sensors and the aggregator, and minimize the communication between the sensors and the aggregator. Minimal communication is important in TinyML where memory is highly constrained, this way the sensors conduct some of the processing on their chips and then they send only the necessary information to the aggregator. When testing on ImageNet, SplitNets were able to reduce the latency by one order of magnitude on head-mounted devices. This can be helpful when the sensor has it's own chip. [1](https://arxiv.org/pdf/2204.04705.pdf) ![A chart showing a comparison between the performance of SplitNets vs all on sensor and all on aggregator approaches. [1](https://arxiv.org/pdf/2204.04705.pdf)](images/modeloptimization_SplitNets.png) | ||
SplitNets were introduced in the context of Head-Mounted systems. They distribute the Deep Neural Networks (DNNs) workload among camera sensors and an aggregator. This is particularly compelling the in context of TinyML. The SplitNet framework is a split-aware NAS to find the optimal neural network architecture to achieve good accuracy, split the model among the sensors and the aggregator, and minimize the communication between the sensors and the aggregator. Minimal communication is important in TinyML where memory is highly constrained, this way the sensors conduct some of the processing on their chips and then they send only the necessary information to the aggregator. When testing on ImageNet, SplitNets were able to reduce the latency by one order of magnitude on head-mounted devices. This can be helpful when the sensor has it's own chip. [1](https://arxiv.org/pdf/2204.04705.pdf) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should be "its" instead of "it's". |
||
|
||
![A chart showing a comparison between the performance of SplitNets vs all on sensor and all on aggregator approaches. [1](https://arxiv.org/pdf/2204.04705.pdf)](images/modeloptimization_SplitNets.png) | ||
|
||
|
||
## Hardware Specific Data Augmentation | ||
|
@@ -772,17 +781,17 @@ Each edge device may possess unique sensor characteristics, leading to specific | |
|
||
# Software and Framework Support | ||
|
||
While all of the aforementioned techniques like pruning (TODO LINK), quantization (TODO LINK), and efficient numerics are well-known, they would remain impractical and inaccessible without extensive software support. For example, directly quantizing weights and activations in a model would require manually modifying the model definition and inserting quantization operations throughout. Similarly, directly pruning model weights requires manipulating weight tensors. Such tedious approaches become infeasible at scale. | ||
While all of the aforementioned techniques like [pruning](#sec-pruning), [quantization](#sec-quantization), and efficient numerics are well-known, they would remain impractical and inaccessible without extensive software support. For example, directly quantizing weights and activations in a model would require manually modifying the model definition and inserting quantization operations throughout. Similarly, directly pruning model weights requires manipulating weight tensors. Such tedious approaches become infeasible at scale. | ||
|
||
Without the extensive software innovation across frameworks, optimization tools and hardware integration, most of these techniques would remain theoretical or only viable to experts. Without framework APIs and automation to simplify applying these optimizations, they would not see adoption. Software support makes them accessible to general practitioners and unlocks real-world benefits. In addition, issues such as hyperparameter tuning for pruning, managing the trade-off between model size and accuracy, and ensuring compatibility with target devices pose hurdles that developers must navigate. | ||
|
||
## Built-in Optimization APIs | ||
|
||
Major machine learning frameworks like TensorFlow, PyTorch, and MXNet provide libraries and APIs to allow common model optimization techniques to be applied without requiring custom implementations. For example, TensorFlow offers the TensorFlow Model Optimization Toolkit which contains modules like: | ||
|
||
- quantization (TODO HYPERLINK) - Applies quantization-aware training to convert floating point models to lower precision like int8 with minimal accuracy loss. Handles weight and activation quantization. | ||
- sparsity - Provides pruning APIs to induce sparsity and remove unnecessary connections in models like neural networks. Can prune weights, layers, etc. | ||
- clustering - Supports model compression by clustering weights into groups for higher compression rates. | ||
- [quantization](https://www.tensorflow.org/model_optimization/api_docs/python/tfmot/quantization/keras/quantize_model) - Applies quantization-aware training to convert floating point models to lower precision like int8 with minimal accuracy loss. Handles weight and activation quantization. | ||
- [sparsity](https://www.tensorflow.org/model_optimization/api_docs/python/tfmot/sparsity/keras) - Provides pruning APIs to induce sparsity and remove unnecessary connections in models like neural networks. Can prune weights, layers, etc. | ||
- [clustering](https://www.tensorflow.org/model_optimization/api_docs/python/tfmot/clustering) - Supports model compression by clustering weights into groups for higher compression rates. | ||
|
||
These APIs allow users to enable optimization techniques like quantization and pruning without directly modifying model code. Parameters like target sparsity rates, quantization bit-widths etc. can be configured. Similarly, PyTorch provides torch.quantization for converting models to lower precision representations. TorchTensor and TorchModule form the base classes for quantization support. It also offers torch.nn.utils.prune for built-in pruning of models. MXNet offers gluon.contrib layers that add quantization capabilities like fixed point rounding and stochastic rounding of weights/activations during training. This allows quantization to be readily included in gluon models. | ||
|
||
|
@@ -796,7 +805,7 @@ Automated optimization tools provided by frameworks can analyze models and autom | |
- [Pruning](https://www.tensorflow.org/model_optimization/guide/pruning/pruning_with_keras) - Automatically removes unnecessary connections in a model based on analysis of weight importance. Can prune entire filters in convolutional layers or attention heads in transformers. Handles iterative re-training to recover any accuracy loss. | ||
- [GraphOptimizer](https://www.tensorflow.org/guide/graph_optimization) - Applies graph optimizations like operator fusion to consolidate operations and reduce execution latency, especially for inference. | ||
|
||
[Figure for GraphOptimizer](Before/after diagram showing GraphOptimizer fusing operators in a sample graph.) | ||
![Before/after diagram showing GraphOptimizer fusing operators in a sample graph](images/modeloptimization_graph_optimization.png) | ||
|
||
These automated modules only require the user to provide the original floating point model, and handle the end-to-end optimization pipeline including any re-training to regain accuracy. Other frameworks like PyTorch also offer increasing automation support, for example through torch.quantization.quantize\_dynamic. Automated optimization makes efficient ML accessible to practitioners without optimization expertise. | ||
|
||
|
@@ -810,7 +819,7 @@ Kernel Optimization: For instance, TensorRT does auto-tuning to optimize CUDA ke | |
|
||
Operator Fusion: TensorFlow XLA does aggressive fusion to create optimized binary for TPUs. On mobile, frameworks like NCNN also support fused operators. | ||
|
||
Hardware-Specific Code: Libraries are used to generate optimized binary code specialized for the target hardware. For example, TensorRT (LINK PLEASE) uses Nvidia CUDA/cuDNN libraries which are hand-tuned for each GPU architecture. This hardware-specific coding is key for performance. On tinyML devices, this can mean assembly code optimized for a Cortex M4 CPU for example. Vendors provide CMSIS-NN and other libraries. | ||
Hardware-Specific Code: Libraries are used to generate optimized binary code specialized for the target hardware. For example, [TensorRT](https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html) uses Nvidia CUDA/cuDNN libraries which are hand-tuned for each GPU architecture. This hardware-specific coding is key for performance. On tinyML devices, this can mean assembly code optimized for a Cortex M4 CPU for example. Vendors provide CMSIS-NN and other libraries. | ||
|
||
Data Layout Optimizations - We can efficiently leverage memory hierarchy of hardware like cache and registers through techniques like tensor/weight rearrangement, tiling, and reuse. For example, TensorFlow XLA optimizes buffer layouts to maximize TPU utilization. This helps any memory constrained systems. | ||
|
||
|
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the examples presented here of challenges in numerics representation in ML models. I think it would be helpful to also connect these examples to the TinyML field. Are the challenges similar across ML and TinyML? Are there are any key differences for numerics representation in TinyML?