title

openreview

abstract

layout

series

publisher

issn

id

month

tex_title

firstpage

lastpage

page

order

cycles

bibtex_author

author

date

address

container-title

volume

genre

issued

pdf

extras

FLIQS: One-Shot Mixed-Precision Floating-Point and Integer Quantization Search

d69NqU8YmM

Quantization has become a mainstream compression technique for reducing model size, computational requirements, and energy consumption for modern deep neural networks (DNNs). With improved numerical support in recent hardware, including multiple variants of integer and floating point, mixed-precision quantization has become necessary to achieve high-quality results with low model cost. Prior mixed-precision methods have performed either a post-training quantization search, which compromises on accuracy, or a differentiable quantization search, which leads to high memory usage from branching. Therefore, we propose the first one-shot mixed-precision quantization search that eliminates the need for retraining in both integer and low-precision floating point models. We evaluate our search (FLIQS) on multiple convolutional and vision transformer networks to discover Pareto-optimal models. Our approach improves upon uniform precision, manual mixed-precision, and recent integer quantization search methods. With integer models, we increase the accuracy of ResNet-18 on ImageNet by 1.3% points and ResNet-50 by 0.90% points with equivalent model cost over previous methods. Additionally, for the first time, we explore a novel mixed-precision floating-point search and improve MobileNetV2 by up to 0.98% points compared to prior state-of-the-art FP8 models. Finally, we extend FLIQS to simultaneously search a joint quantization and neural architecture space and improve the ImageNet accuracy by 2.69% points with similar model cost on a MobileNetV2 search space.

inproceedings

Proceedings of Machine Learning Research

PMLR

2640-3498

dotzel24a

0

FLIQS: One-Shot Mixed-Precision Floating-Point and Integer Quantization Search

6/1

26

6/1-26

6

false

Dotzel, Jordan and Wu, Gang and Li, Andrew and Umar, Muhammad and Ni, Yun and Abdelfattah, Mohamed S and Zhang, Zhiru and Cheng, Liqun and Dixon, Martin G and Jouppi, Norman P and Le, Quoc V and Li, Sheng

given	family
Jordan	Dotzel

given	family
Gang	Wu

given	family
Andrew	Li

given	family
Muhammad	Umar

given	family
Yun	Ni

given	family
Mohamed S	Abdelfattah

given	family
Zhiru	Zhang

given	family
Liqun	Cheng

given	family
Martin G	Dixon

given	family
Norman P	Jouppi

given	family
Quoc V	Le

given	family
Sheng	Li

2024-10-09

Proceedings of the Third International Conference on Automated Machine Learning

256

inproceedings

date-parts

2024

10

9

https://raw.githubusercontent.com/mlresearch/v256/main/assets/dotzel24a/dotzel24a.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2024-10-09-dotzel24a.md

2024-10-09-dotzel24a.md

Files

2024-10-09-dotzel24a.md

Latest commit

History

2024-10-09-dotzel24a.md

File metadata and controls