diff --git a/optimizations.qmd b/optimizations.qmd index 02c33084..1f947d93 100644 --- a/optimizations.qmd +++ b/optimizations.qmd @@ -632,7 +632,7 @@ Efficient hardware implementation transcends the selection of suitable component Focusing only on the accuracy when performing Neural Architecture Search leads to models that are exponentially complex and require increasing memory and compute. This has lead to hardware constraints limiting the exploitation of the deep learning models at their full potential. Manually designing the architecture of the model is even harder when considering the hardware variety and limitations. This has lead to the creation of Hardware-aware Neural Architecture Search that incorporate the hardware contractions into their search and optimize the search space for a specific hardware and accuracy. HW-NAS can be categorized based how it optimizes for hardware. We will briefly explore these categories and leave links to related papers for the interested reader. -![Taxonomy of HW-NAS (Benmeziane et al. ([2021](https://www.ijcai.org/proceedings/2021/592)))](images/modeloptimization_HW-NAS.png) +![Taxonomy of HW-NAS [@ijcai2021p592]](images/modeloptimization_HW-NAS.png) #### Single Target, Fixed Platfrom Configuration @@ -645,15 +645,15 @@ Here, the search is a multi-objective optimization problem, where both the accur ##### Hardware-aware Search Space -Here, the search space is restricted to the architectures that perform well on the specific hardware. This can be achieved by either measuring the operators (Conv operator, Pool operator, …) performance, or define a set of rules that limit the search space. (Zhang et al. ([2020](https://openaccess.thecvf.com/content_CVPRW_2020/html/w40/Zhang_Fast_Hardware-Aware_Neural_Architecture_Search_CVPRW_2020_paper.html))) +Here, the search space is restricted to the architectures that perform well on the specific hardware. This can be achieved by either measuring the operators (Conv operator, Pool operator, …) performance, or define a set of rules that limit the search space. [@Zhang_2020_CVPR_Workshops] #### Single Target, Multiple Platform Configurations -Some hardwares may have different configurations. For example, FPGAs have Configurable Logic Blocks (CLBs) that can be configured by the firmware. This method allows for the HW-NAS to explore different configurations. (Jiang et al. ([2019](https://arxiv.org/abs/1901.11211)))(Yang et al. ([2020](https://arxiv.org/abs/2002.04116))) +Some hardwares may have different configurations. For example, FPGAs have Configurable Logic Blocks (CLBs) that can be configured by the firmware. This method allows for the HW-NAS to explore different configurations. [@jiang2019accuracy; @yang2020coexploration] #### Multiple Targets -This category aims at optimizing a single model for multiple hardwares. This can be helpful for mobile devices development as it can optimize to different phones models. (Chu et al. ([2020](https://arxiv.org/abs/2008.08178)))(Jiang et al. ([2020](https://ieeexplore.ieee.org/document/9102721))) +This category aims at optimizing a single model for multiple hardwares. This can be helpful for mobile devices development as it can optimize to different phones models. [@chu2021discovering; @jiang2019accuracy] #### Examples of Hardware-Aware Neural Architecture Search @@ -663,14 +663,14 @@ TinyNAS adopts a two stage approach to finding an optimal architecture for model First, TinyNAS generate multiple search spaces by varying the input resolution of the model, and the number of channels of the layers of the model. Then, TinyNAS chooses a search space based on the FLOPs (Floating Point Operations Per Second) of each search space -Then, TinyNAS performs a search operation on the chosen space to find the optimal architecture for the specific constraints of the microcontroller. (Han et al. ([2020](https://arxiv.org/abs/2007.10319))) +Then, TinyNAS performs a search operation on the chosen space to find the optimal architecture for the specific constraints of the microcontroller. [@lin2020mcunet] -![A diagram showing how search spaces with high probability of finding an architecture with large number of FLOPs provide models with higher accuracy (Han et al. ([2020](https://arxiv.org/abs/2007.10319)))](images/modeloptimization_TinyNAS.png) +![A diagram showing how search spaces with high probability of finding an architecture with large number of FLOPs provide models with higher accuracy [@lin2020mcunet]](images/modeloptimization_TinyNAS.png) #### Topology-Aware NAS -Focuses on creating and optimizing a search space that aligns with the hardware topology of the device. (Zhang et al. ([2019](https://arxiv.org/pdf/1911.09251.pdf))) +Focuses on creating and optimizing a search space that aligns with the hardware topology of the device. [@zhang2019autoshrink] ### Challenges of Hardware-Aware Neural Architecture Search @@ -698,13 +698,13 @@ Similarly to blocking, tiling divides data and computation into chunks, but exte ##### Optimized Kernel Libraries -This comprises developing optimized kernels that take full advantage of a specific hardware. One example is the CMSIS-NN library, which is a collection of efficient neural network kernels developed to optimize the performance and minimize the memory footprint of models on Arm Cortex-M processors, which are common on IoT edge devices. The kernel leverage multiple hardware capabilities of Cortex-M processors like Single Instruction Multple Data (SIMD), Floating Point Units (FPUs) and M-Profile Vector Extensions (MVE). These optimization make common operations like matrix multiplications more efficient, boosting the performance of model operations on Cortex-M processors. (Lai et al. ([2018](https://arxiv.org/abs/1801.06601#:~:text=This%20paper%20presents%20CMSIS,for%20intelligent%20IoT%20edge%20devices))) +This comprises developing optimized kernels that take full advantage of a specific hardware. One example is the CMSIS-NN library, which is a collection of efficient neural network kernels developed to optimize the performance and minimize the memory footprint of models on Arm Cortex-M processors, which are common on IoT edge devices. The kernel leverage multiple hardware capabilities of Cortex-M processors like Single Instruction Multple Data (SIMD), Floating Point Units (FPUs) and M-Profile Vector Extensions (MVE). These optimization make common operations like matrix multiplications more efficient, boosting the performance of model operations on Cortex-M processors. [@lai2018cmsisnn] ### Compute-in-Memory (CiM) -This is one example of Algorithm-Hardware Co-design. CiM is a computing paradigm that performs computation within memory. Therefore, CiM architectures allow for operations to be performed directly on the stored data, without the need to shuttle data back and forth between separate processing and memory units. This design paradigm is particularly beneficial in scenarios where data movement is a primary source of energy consumption and latency, such as in TinyML applications on edge devices. Through algorithm-hardware co-design, the algorithms can be optimized to leverage the unique characteristics of CiM architectures, and conversely, the CiM hardware can be customized or configured to better support the computational requirements and characteristics of the algorithms. This is achieved by using the analog properties of memory cells, such as addition and multiplication in DRAM. (Zhou et al. ([2021](https://arxiv.org/abs/2111.06503))) +This is one example of Algorithm-Hardware Co-design. CiM is a computing paradigm that performs computation within memory. Therefore, CiM architectures allow for operations to be performed directly on the stored data, without the need to shuttle data back and forth between separate processing and memory units. This design paradigm is particularly beneficial in scenarios where data movement is a primary source of energy consumption and latency, such as in TinyML applications on edge devices. Through algorithm-hardware co-design, the algorithms can be optimized to leverage the unique characteristics of CiM architectures, and conversely, the CiM hardware can be customized or configured to better support the computational requirements and characteristics of the algorithms. This is achieved by using the analog properties of memory cells, such as addition and multiplication in DRAM. [@zhou2021analognets] -![A figure showing how Computing in Memory can be used for always-on tasks to offload tasks of the power consuming processing unit [1](https://arxiv.org/abs/2111.06503)](images/modeloptimization_CiM.png) +![A figure showing how Computing in Memory can be used for always-on tasks to offload tasks of the power consuming processing unit [@zhou2021analognets]](images/modeloptimization_CiM.png) ### Memory Access Optimization @@ -712,31 +712,31 @@ Different devices may have different memory hierarchies. Optimizing for the spec ### Leveraging Sparsity -Pruning is a fundamental approach to compress models to make them compatible with resource constrained devices. This results in sparse models where a lot of weights are 0's. Therefore, leveraging this sparsity can lead to significant improvements in performance. Tools were created to achieve exactly this. RAMAN, is a sparseTinyML accelerator designed for inference on edge devices. RAMAN overlap input and output activations on the same memory space, reducing storage requirements by up to 50%. (Krishna et al. ([2023](https://ar5iv.labs.arxiv.org/html/2306.06493))) +Pruning is a fundamental approach to compress models to make them compatible with resource constrained devices. This results in sparse models where a lot of weights are 0's. Therefore, leveraging this sparsity can lead to significant improvements in performance. Tools were created to achieve exactly this. RAMAN, is a sparseTinyML accelerator designed for inference on edge devices. RAMAN overlap input and output activations on the same memory space, reducing storage requirements by up to 50%. [@krishna2023raman] -![A figure showing the sparse columns of the filter matrix of a CNN that are aggregated to create a dense matrix that, leading to smaller dimensions in the matrix and more efficient computations (Kung et al. ([2018](https://arxiv.org/abs/1811.04770)))](images/modeloptimization_sparsity.png) +![A figure showing the sparse columns of the filter matrix of a CNN that are aggregated to create a dense matrix that, leading to smaller dimensions in the matrix and more efficient computations. [@kung2018packing] ### Optimization Frameworks Optimization Frameworks have been introduced to exploit the specific capabilities of the hardware to accelerate the software. One example of such a framework is hls4ml. This open-source software-hardware co-design workflow aids in interpreting and translating machine learning algorithms for implementation with both FPGA and ASIC technologies, enhancing their. Features such as network optimization, new Python APIs, quantization-aware pruning, and end-to-end FPGA workflows are embedded into the hls4ml framework, leveraging parallel processing units, memory hierarchies, and specialized instruction sets to optimize models for edge hardware. Moreover, hls4ml is capable of translating machine learning algorithms directly into FPGA firmware. -![A Diagram showing the workflow with the hls4ml framework (Fahim et al. ([2021](https://arxiv.org/pdf/2103.05579.pdf)))](images/modeloptimization_hls4ml.png) +![A Diagram showing the workflow with the hls4ml framework [@fahim2021hls4ml]](images/modeloptimization_hls4ml.png) -One other framework for FPGAs that focuses on a holistic approach is CFU Playground (Prakash et al. ([2022](https://arxiv.org/abs/2201.01863))) +One other framework for FPGAs that focuses on a holistic approach is CFU Playground [@Prakash_2023] ### Hardware Built Around Software -In a contrasting approach, hardware can be custom-designed around software requirements to optimize the performance for a specific application. This paradigm creates specialized hardware to better adapt to the specifics of the software, thus reducing computational overhead and improving operational efficiency. One example of this approach is a voice-recognition application by (Kwon et al. ([2021](https://www.mdpi.com/2076-3417/11/22/11073))). The paper proposes a structure wherein preprocessing operations, traditionally handled by software, are allocated to custom-designed hardware. This technique was achieved by introducing resistor–transistor logic to an inter-integrated circuit sound module for windowing and audio raw data acquisition in the voice-recognition application. Consequently, this offloading of preprocessing operations led to a reduction in computational load on the software, showcasing a practical application of building hardware around software to enhance the efficiency and performance. +In a contrasting approach, hardware can be custom-designed around software requirements to optimize the performance for a specific application. This paradigm creates specialized hardware to better adapt to the specifics of the software, thus reducing computational overhead and improving operational efficiency. One example of this approach is a voice-recognition application by [@app112211073]. The paper proposes a structure wherein preprocessing operations, traditionally handled by software, are allocated to custom-designed hardware. This technique was achieved by introducing resistor–transistor logic to an inter-integrated circuit sound module for windowing and audio raw data acquisition in the voice-recognition application. Consequently, this offloading of preprocessing operations led to a reduction in computational load on the software, showcasing a practical application of building hardware around software to enhance the efficiency and performance. -![A diagram showing how an FPGA was used to offload data preprocessing of the general purpose computation unit. (Kwon et al. ([2021](https://www.mdpi.com/2076-3417/11/22/11073)))](images/modeloptimization_preprocessor.png) +![A diagram showing how an FPGA was used to offload data preprocessing of the general purpose computation unit. [@app112211073]](images/modeloptimization_preprocessor.png) ### SplitNets -SplitNets were introduced in the context of Head-Mounted systems. They distribute the Deep Neural Networks (DNNs) workload among camera sensors and an aggregator. This is particularly compelling the in context of TinyML. The SplitNet framework is a split-aware NAS to find the optimal neural network architecture to achieve good accuracy, split the model among the sensors and the aggregator, and minimize the communication between the sensors and the aggregator. Minimal communication is important in TinyML where memory is highly constrained, this way the sensors conduct some of the processing on their chips and then they send only the necessary information to the aggregator. When testing on ImageNet, SplitNets were able to reduce the latency by one order of magnitude on head-mounted devices. This can be helpful when the sensor has its own chip. (Dong et al. ([2022](https://arxiv.org/pdf/2204.04705.pdf))) +SplitNets were introduced in the context of Head-Mounted systems. They distribute the Deep Neural Networks (DNNs) workload among camera sensors and an aggregator. This is particularly compelling the in context of TinyML. The SplitNet framework is a split-aware NAS to find the optimal neural network architecture to achieve good accuracy, split the model among the sensors and the aggregator, and minimize the communication between the sensors and the aggregator. Minimal communication is important in TinyML where memory is highly constrained, this way the sensors conduct some of the processing on their chips and then they send only the necessary information to the aggregator. When testing on ImageNet, SplitNets were able to reduce the latency by one order of magnitude on head-mounted devices. This can be helpful when the sensor has its own chip. [@dong2022splitnets] -![A chart showing a comparison between the performance of SplitNets vs all on sensor and all on aggregator approaches. (Dong et al. ([2022](https://arxiv.org/pdf/2204.04705.pdf)))](images/modeloptimization_SplitNets.png) +![A chart showing a comparison between the performance of SplitNets vs all on sensor and all on aggregator approaches. [@dong2022splitnets]](images/modeloptimization_SplitNets.png) ### Hardware Specific Data Augmentation @@ -803,7 +803,7 @@ For example, consider sparsity optimizations. Sparsity visualization tools can p Trend plots can also track sparsity over successive pruning rounds - they may show initial rapid pruning followed by more gradual incremental increases. Tracking the current global sparsity along with statistics like average, minimum, and maximum sparsity per-layer in tables or plots provides an overview of the model composition. For a sample convolutional network, these tools could reveal that the first convolution layer is pruned 20% while the final classifier layer is pruned 70% given its redundancy. The global model sparsity may increase from 10% after initial pruning to 40% after five rounds. -![A figure showing the sparse columns of the filter matrix of a CNN that are aggregated to create a dense matrix that, leading to smaller dimensions in the matrix and more efficient computations (Kung et al. ([2018](https://arxiv.org/abs/1811.04770)))](images/modeloptimization_sparsity.png) +![A figure showing the sparse columns of the filter matrix of a CNN that are aggregated to create a dense matrix that, leading to smaller dimensions in the matrix and more efficient computations [@kung2018packing]](images/modeloptimization_sparsity.png) By making sparsity data visually accessible, practitioners can better understand exactly how their model is being optimized and which areas are being impacted. The visibility enables them to fine-tune and control the pruning process for a given architecture. @@ -813,7 +813,7 @@ Sparsity visualization turns pruning into a transparent technique instead of a b Converting models to lower numeric precisions through quantization introduces errors that can impact model accuracy if not properly tracked and addressed. Visualizing quantization error distributions provides valuable insights into the effects of reduced precision numerics applied to different parts of a model. For this, histograms of the quantization errors for weights and activations can be generated. These histograms can reveal the shape of the error distribution - whether they resemble a Gaussian distribution or contain significant outliers and spikes. Large outliers may indicate issues with particular layers handling the quantization. Comparing the histograms across layers highlights any problem areas standing out with abnormally high errors. -![A smooth histogram of quantization error. (Kuzmin et al. ([2021](https://arxiv.org/pdf/2208.09225.pdf)))](images/modeloptimization_quant_hist.png) +![A smooth histogram of quantization error. [@kuzmin2022fp8]](images/modeloptimization_quant_hist.png) Activation visualizations are also important to detect overflow issues. By color mapping the activations before and after quantization, any values pushed outside the intended ranges become visible. This reveals saturation and truncation issues that could skew the information flowing through the model. Detecting these errors allows recalibrating activations to prevent loss of information. (Mandal ([2022](https://medium.com/exemplifyml-ai/visualizing-neural-network-activation-a27caa451ff))) diff --git a/references.bib b/references.bib index 83991aa9..4501f577 100644 --- a/references.bib +++ b/references.bib @@ -1153,4 +1153,167 @@ @misc{quantdeep author = {Krishnamoorthi}, month = jun, year = {2018}, +} +@inproceedings{ijcai2021p592, + title = {Hardware-Aware Neural Architecture Search: Survey and Taxonomy}, + author = {Benmeziane, Hadjer and El Maghraoui, Kaoutar and Ouarnoughi, Hamza and Niar, Smail and Wistuba, Martin and Wang, Naigang}, + booktitle = {Proceedings of the Thirtieth International Joint Conference on + Artificial Intelligence, {IJCAI-21}}, + publisher = {International Joint Conferences on Artificial Intelligence Organization}, + editor = {Zhi-Hua Zhou}, + pages = {4322--4329}, + year = {2021}, + month = {8}, + note = {Survey Track}, + doi = {10.24963/ijcai.2021/592}, + url = {https://doi.org/10.24963/ijcai.2021/592}, +} + +@InProceedings{Zhang_2020_CVPR_Workshops, +author = {Zhang, Li Lyna and Yang, Yuqing and Jiang, Yuhang and Zhu, Wenwu and Liu, Yunxin}, +title = {Fast Hardware-Aware Neural Architecture Search}, +booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, +month = {June}, +year = {2020} +} + +@misc{jiang2019accuracy, + title={Accuracy vs. Efficiency: Achieving Both through FPGA-Implementation Aware Neural Architecture Search}, + author={Weiwen Jiang and Xinyi Zhang and Edwin H. -M. Sha and Lei Yang and Qingfeng Zhuge and Yiyu Shi and Jingtong Hu}, + year={2019}, + eprint={1901.11211}, + archivePrefix={arXiv}, + primaryClass={cs.DC} +} + +@misc{yang2020coexploration, + title={Co-Exploration of Neural Architectures and Heterogeneous ASIC Accelerator Designs Targeting Multiple Tasks}, + author={Lei Yang and Zheyu Yan and Meng Li and Hyoukjun Kwon and Liangzhen Lai and Tushar Krishna and Vikas Chandra and Weiwen Jiang and Yiyu Shi}, + year={2020}, + eprint={2002.04116}, + archivePrefix={arXiv}, + primaryClass={cs.LG} +} + +@misc{chu2021discovering, + title={Discovering Multi-Hardware Mobile Models via Architecture Search}, + author={Grace Chu and Okan Arikan and Gabriel Bender and Weijun Wang and Achille Brighton and Pieter-Jan Kindermans and Hanxiao Liu and Berkin Akin and Suyog Gupta and Andrew Howard}, + year={2021}, + eprint={2008.08178}, + archivePrefix={arXiv}, + primaryClass={cs.CV} +} + +@misc{lin2020mcunet, + title={MCUNet: Tiny Deep Learning on IoT Devices}, + author={Ji Lin and Wei-Ming Chen and Yujun Lin and John Cohn and Chuang Gan and Song Han}, + year={2020}, + eprint={2007.10319}, + archivePrefix={arXiv}, + primaryClass={cs.CV} +} + +@misc{zhang2019autoshrink, + title={AutoShrink: A Topology-aware NAS for Discovering Efficient Neural Architecture}, + author={Tunhou Zhang and Hsin-Pai Cheng and Zhenwen Li and Feng Yan and Chengyu Huang and Hai Li and Yiran Chen}, + year={2019}, + eprint={1911.09251}, + archivePrefix={arXiv}, + primaryClass={cs.LG} +} + +@misc{lai2018cmsisnn, + title={CMSIS-NN: Efficient Neural Network Kernels for Arm Cortex-M CPUs}, + author={Liangzhen Lai and Naveen Suda and Vikas Chandra}, + year={2018}, + eprint={1801.06601}, + archivePrefix={arXiv}, + primaryClass={cs.NE} +} + +@misc{zhou2021analognets, + title={AnalogNets: ML-HW Co-Design of Noise-robust TinyML Models and Always-On Analog Compute-in-Memory Accelerator}, + author={Chuteng Zhou and Fernando Garcia Redondo and Julian Büchel and Irem Boybat and Xavier Timoneda Comas and S. R. Nandakumar and Shidhartha Das and Abu Sebastian and Manuel Le Gallo and Paul N. Whatmough}, + year={2021}, + eprint={2111.06503}, + archivePrefix={arXiv}, + primaryClass={cs.AR} +} + +@misc{krishna2023raman, + title={RAMAN: A Re-configurable and Sparse tinyML Accelerator for Inference on Edge}, + author={Adithya Krishna and Srikanth Rohit Nudurupati and Chandana D G and Pritesh Dwivedi and André van Schaik and Mahesh Mehendale and Chetan Singh Thakur}, + year={2023}, + eprint={2306.06493}, + archivePrefix={arXiv}, + primaryClass={cs.NE} +} + +@misc{kung2018packing, + title={Packing Sparse Convolutional Neural Networks for Efficient Systolic Array Implementations: Column Combining Under Joint Optimization}, + author={H. T. Kung and Bradley McDanel and Sai Qian Zhang}, + year={2018}, + eprint={1811.04770}, + archivePrefix={arXiv}, + primaryClass={cs.LG} +} + +@misc{fahim2021hls4ml, + title={hls4ml: An Open-Source Codesign Workflow to Empower Scientific Low-Power Machine Learning Devices}, + author={Farah Fahim and Benjamin Hawks and Christian Herwig and James Hirschauer and Sergo Jindariani and Nhan Tran and Luca P. Carloni and Giuseppe Di Guglielmo and Philip Harris and Jeffrey Krupa and Dylan Rankin and Manuel Blanco Valentin and Josiah Hester and Yingyi Luo and John Mamish and Seda Orgrenci-Memik and Thea Aarrestad and Hamza Javed and Vladimir Loncar and Maurizio Pierini and Adrian Alan Pol and Sioni Summers and Javier Duarte and Scott Hauck and Shih-Chieh Hsu and Jennifer Ngadiuba and Mia Liu and Duc Hoang and Edward Kreinar and Zhenbin Wu}, + year={2021}, + eprint={2103.05579}, + archivePrefix={arXiv}, + primaryClass={cs.LG} +} + +@inproceedings{Prakash_2023, + doi = {10.1109/ispass57527.2023.00024}, + + url = {https://doi.org/10.1109%2Fispass57527.2023.00024}, + + year = 2023, + month = {apr}, + + publisher = {{IEEE} +}, + + author = {Shvetank Prakash and Tim Callahan and Joseph Bushagour and Colby Banbury and Alan V. Green and Pete Warden and Tim Ansell and Vijay Janapa Reddi}, + + title = {{CFU} Playground: Full-Stack Open-Source Framework for Tiny Machine Learning ({TinyML}) Acceleration on {FPGAs}}, + + booktitle = {2023 {IEEE} International Symposium on Performance Analysis of Systems and Software ({ISPASS})} +} + + +@Article{app112211073, +AUTHOR = {Kwon, Jisu and Park, Daejin}, +TITLE = {Hardware/Software Co-Design for TinyML Voice-Recognition Application on Resource Frugal Edge Devices}, +JOURNAL = {Applied Sciences}, +VOLUME = {11}, +YEAR = {2021}, +NUMBER = {22}, +ARTICLE-NUMBER = {11073}, +URL = {https://www.mdpi.com/2076-3417/11/22/11073}, +ISSN = {2076-3417}, +ABSTRACT = {On-device artificial intelligence has attracted attention globally, and attempts to combine the internet of things and TinyML (machine learning) applications are increasing. Although most edge devices have limited resources, time and energy costs are important when running TinyML applications. In this paper, we propose a structure in which the part that preprocesses externally input data in the TinyML application is distributed to the hardware. These processes are performed using software in the microcontroller unit of an edge device. Furthermore, resistor–transistor logic, which perform not only windowing using the Hann function, but also acquire audio raw data, is added to the inter-integrated circuit sound module that collects audio data in the voice-recognition application. As a result of the experiment, the windowing function was excluded from the TinyML application of the embedded board. When the length of the hardware-implemented Hann window is 80 and the quantization degree is 2−5, the exclusion causes a decrease in the execution time of the front-end function and energy consumption by 8.06% and 3.27%, respectively.}, +DOI = {10.3390/app112211073} +} + +@misc{dong2022splitnets, + title={SplitNets: Designing Neural Architectures for Efficient Distributed Computing on Head-Mounted Systems}, + author={Xin Dong and Barbara De Salvo and Meng Li and Chiao Liu and Zhongnan Qu and H. T. Kung and Ziyun Li}, + year={2022}, + eprint={2204.04705}, + archivePrefix={arXiv}, + primaryClass={cs.LG} +} + +@misc{kuzmin2022fp8, + title={FP8 Quantization: The Power of the Exponent}, + author={Andrey Kuzmin and Mart Van Baalen and Yuwei Ren and Markus Nagel and Jorn Peters and Tijmen Blankevoort}, + year={2022}, + eprint={2208.09225}, + archivePrefix={arXiv}, + primaryClass={cs.LG} } \ No newline at end of file