Intel® Neural Compressor is an open-source Python library designed to help users quickly deploy low-precision inference solutions on popular deep learning (DL) frameworks such as TensorFlow*, PyTorch*, MXNet, and ONNX Runtime. It automatically optimizes low-precision recipes for deep learning models in order to achieve optimal product objectives, such as inference performance and memory usage, with expected accuracy criteria.
These APIs are intended to unify low-precision quantization interfaces cross multiple DL frameworks for the best out-of-the-box experiences.
Note
Neural Compressor is continuously improving user-facing APIs to create a better user experience.
Two sets of user-facing APIs exist. One is the default one supported from Neural Compressor v1.0 for backwards compatibility. The other set consists of new APIs in the
neural_compressor.experimental
package.
We recommend that you use the APIs located in neural_compressor.experimental. All examples have been updated to use the experimental APIs.
The major differences between the default user-facing APIs and the experimental APIs are:
- The experimental APIs abstract the
neural_compressor.experimental.common.Model
concept to cover those cases whose weight and graph files are stored separately. - The experimental APIs unify the calling style of the
Quantization
,Pruning
, andBenchmark
classes by setting model, calibration dataloader, evaluation dataloader, and metric through class attributes rather than passing them as function inputs. - The experimental APIs refine Neural Compressor built-in transforms/datasets/metrics by unifying the APIs cross different framework backends.
Experimental user-facing APIs consist of the following components:
# neural_compressor.experimental.Quantization
class Quantization(object):
def __init__(self, conf_fname_or_obj):
...
def __call__(self):
...
@property
def calib_dataloader(self):
...
@property
def eval_dataloader(self):
...
@property
def model(self):
...
@property
def metric(self):
...
@property
def postprocess(self, user_postprocess):
...
@property
def q_func(self):
...
@property
def eval_func(self):
...
The conf_fname_or_obj
parameter used in the class initialization is the path to the user yaml configuration file or Quantization_Conf class. This yaml file is used to control the entire tuning behavior on the model.
Neural Compressor User YAML Syntax
Intel® Neural Compressor provides template yaml files for Post-Training Quantization, Quantization-Aware Training, and Pruning scenarios. Refer to these template files to understand the meaning of each field.
Note that most fields in the yaml templates are optional. View the HelloWorld Yaml example for reference.
# Typical Launcher code
from neural_compressor.experimental import Quantization, common
# optional if Neural Compressor built-in dataset could be used as model input in yaml
class dataset(object):
def __init__(self, *args):
...
def __getitem__(self, idx):
# return single sample and label tuple without collate. label should be 0 for label-free case
...
def len(self):
...
# optional if Neural Compressor built-in metric could be used to do accuracy evaluation on model output in yaml
class custom_metric(object):
def __init__(self):
...
def update(self, predict, label):
# metric update per mini-batch
...
def result(self):
# final metric calculation invoked only once after all mini-batch are evaluated
# return a scalar to neural_compressor for accuracy-driven tuning.
# by default the scalar is higher-is-better. if not, set tuning.accuracy_criterion.higher_is_better to false in yaml.
...
quantizer = Quantization(conf.yaml)
quantizer.model = '/path/to/model'
# below two lines are optional if Neural Compressor built-in dataset is used as model calibration input in yaml
cal_dl = dataset('/path/to/calibration/dataset')
quantizer.calib_dataloader = common.DataLoader(cal_dl, batch_size=32)
# below two lines are optional if Neural Compressor built-in dataset is used as model evaluation input in yaml
dl = dataset('/path/to/evaluation/dataset')
quantizer.eval_dataloader = common.DataLoader(dl, batch_size=32)
# optional if Neural Compressor built-in metric could be used to do accuracy evaluation in yaml
quantizer.metric = common.Metric(custom_metric)
q_model = quantizer.fit()
q_model.save('/path/to/output/dir')
model
attribute in Quantization
class is an abstraction of model formats across different frameworks. Neural Compressor supports passing the path of keras model
, frozen pb
, checkpoint
, saved model
, torch.nn.model
, mxnet.symbol.Symbol
, gluon.HybirdBlock
, and onnx model
to instantiate a neural_compressor.experimental.
class and set to quantizer.model
.
calib_dataloader
and eval_dataloader
attribute in Quantization
class is used to set up a calibration dataloader by code. It is optional to set if the user sets corresponding fields in yaml.
metric
attribute in Quantization
class is used to set up a custom metric by code. It is optional to set if user finds Neural Compressor built-in metric could be used with their model and sets corresponding fields in yaml.
postprocess
attribute in Quantization
class is not necessary in most of the use cases. It is only needed when the user wants to use the built-in metric but the model output can not directly be handled by Neural Compressor built-in metrics. In this case, the user can register a transform to convert the model output to the expected one required by the built-in metric.
q_func
attribute in Quantization
class is only for Quantization Aware Training
case, in which the user needs to register a function that takes model
as the input parameter and executes the entire training process with self-contained training hyper-parameters.
eval_func
attribute in Quantization
class is reserved for special cases. If the user had an evaluation function when train a model, the user must implement a calib_dataloader
and leave eval_dataloader
as None. Then, modify this evaluation function to take model
as the input parameter and return a higher-is-better scaler. In some scenarios, it may reduce development effort.
class Pruning(object):
def __init__(self, conf_fname_or_obj):
...
def on_epoch_begin(self, epoch):
...
def on_batch_begin(self, batch_id):
...
def on_batch_end(self):
...
def on_epoch_end(self):
...
def __call__(self):
...
@property
def model(self):
...
@property
def q_func(self):
...
This API is used to do sparsity pruning. Currently, it is a Proof of Concept; Neural Compressor only supports magnitude pruning
on PyTorch.
To learn how to use this API, refer to the pruning document.
class Benchmark(object):
def __init__(self, conf_fname_or_obj):
...
def __call__(self):
...
@property
def model(self):
...
@property
def metric(self):
...
@property
def b_dataloader(self):
...
@property
def postprocess(self, user_postprocess):
...
This API is used to measure model performance and accuracy.
To learn how to use this API, refer to the benchmarking document.
The default user-facing APIs exist for backwards compatibility from the v1.0 release. Refer to v1.1 API to understand how the default user-facing APIs work.
View the HelloWorld example that uses default user-facing APIs for user reference.
Full examples using default user-facing APIs can be found here.