Skip to content

Latest commit

 

History

History
210 lines (145 loc) · 8.33 KB

api-introduction.md

File metadata and controls

210 lines (145 loc) · 8.33 KB

API Documentation

Introduction

Intel® Neural Compressor is an open-source Python library designed to help users quickly deploy low-precision inference solutions on popular deep learning (DL) frameworks such as TensorFlow*, PyTorch*, MXNet, and ONNX Runtime. It automatically optimizes low-precision recipes for deep learning models in order to achieve optimal product objectives, such as inference performance and memory usage, with expected accuracy criteria.

User-facing APIs

These APIs are intended to unify low-precision quantization interfaces cross multiple DL frameworks for the best out-of-the-box experiences.

Note

Neural Compressor is continuously improving user-facing APIs to create a better user experience.

Two sets of user-facing APIs exist. One is the default one supported from Neural Compressor v1.0 for backwards compatibility. The other set consists of new APIs in the neural_compressor.experimental package.

We recommend that you use the APIs located in neural_compressor.experimental. All examples have been updated to use the experimental APIs.

The major differences between the default user-facing APIs and the experimental APIs are:

  1. The experimental APIs abstract the neural_compressor.experimental.common.Model concept to cover those cases whose weight and graph files are stored separately.
  2. The experimental APIs unify the calling style of the Quantization, Pruning, and Benchmark classes by setting model, calibration dataloader, evaluation dataloader, and metric through class attributes rather than passing them as function inputs.
  3. The experimental APIs refine Neural Compressor built-in transforms/datasets/metrics by unifying the APIs cross different framework backends.

Experimental user-facing APIs

Experimental user-facing APIs consist of the following components:

Quantization-related APIs

# neural_compressor.experimental.Quantization
class Quantization(object):
    def __init__(self, conf_fname_or_obj):
        ...

    def __call__(self):
        ...

    @property
    def calib_dataloader(self):
        ...

    @property
    def eval_dataloader(self):
        ...

    @property
    def model(self):
        ...

    @property
    def metric(self):
        ...

    @property
    def postprocess(self, user_postprocess):
        ...

    @property
    def q_func(self):
        ...

    @property
    def eval_func(self):
        ...

The conf_fname_or_obj parameter used in the class initialization is the path to the user yaml configuration file or Quantization_Conf class. This yaml file is used to control the entire tuning behavior on the model.

Neural Compressor User YAML Syntax

Intel® Neural Compressor provides template yaml files for Post-Training Quantization, Quantization-Aware Training, and Pruning scenarios. Refer to these template files to understand the meaning of each field.

Note that most fields in the yaml templates are optional. View the HelloWorld Yaml example for reference.

# Typical Launcher code
from neural_compressor.experimental import Quantization, common

# optional if Neural Compressor built-in dataset could be used as model input in yaml
class dataset(object):
  def __init__(self, *args):
      ...

  def __getitem__(self, idx):
      # return single sample and label tuple without collate. label should be 0 for label-free case
      ...

  def len(self):
      ...

# optional if Neural Compressor built-in metric could be used to do accuracy evaluation on model output in yaml
class custom_metric(object):
    def __init__(self):
        ...

    def update(self, predict, label):
        # metric update per mini-batch
        ...

    def result(self):
        # final metric calculation invoked only once after all mini-batch are evaluated
        # return a scalar to neural_compressor for accuracy-driven tuning.
        # by default the scalar is higher-is-better. if not, set tuning.accuracy_criterion.higher_is_better to false in yaml.
        ...

quantizer = Quantization(conf.yaml)
quantizer.model = '/path/to/model'
# below two lines are optional if Neural Compressor built-in dataset is used as model calibration input in yaml
cal_dl = dataset('/path/to/calibration/dataset')
quantizer.calib_dataloader = common.DataLoader(cal_dl, batch_size=32)
# below two lines are optional if Neural Compressor built-in dataset is used as model evaluation input in yaml
dl = dataset('/path/to/evaluation/dataset')
quantizer.eval_dataloader = common.DataLoader(dl, batch_size=32)
# optional if Neural Compressor built-in metric could be used to do accuracy evaluation in yaml
quantizer.metric = common.Metric(custom_metric) 
q_model = quantizer.fit()
q_model.save('/path/to/output/dir') 

model attribute in Quantization class is an abstraction of model formats across different frameworks. Neural Compressor supports passing the path of keras model, frozen pb, checkpoint, saved model, torch.nn.model, mxnet.symbol.Symbol, gluon.HybirdBlock, and onnx model to instantiate a neural_compressor.experimental. class and set to quantizer.model.

calib_dataloader and eval_dataloader attribute in Quantization class is used to set up a calibration dataloader by code. It is optional to set if the user sets corresponding fields in yaml.

metric attribute in Quantization class is used to set up a custom metric by code. It is optional to set if user finds Neural Compressor built-in metric could be used with their model and sets corresponding fields in yaml.

postprocess attribute in Quantization class is not necessary in most of the use cases. It is only needed when the user wants to use the built-in metric but the model output can not directly be handled by Neural Compressor built-in metrics. In this case, the user can register a transform to convert the model output to the expected one required by the built-in metric.

q_func attribute in Quantization class is only for Quantization Aware Training case, in which the user needs to register a function that takes model as the input parameter and executes the entire training process with self-contained training hyper-parameters.

eval_func attribute in Quantization class is reserved for special cases. If the user had an evaluation function when train a model, the user must implement a calib_dataloader and leave eval_dataloader as None. Then, modify this evaluation function to take model as the input parameter and return a higher-is-better scaler. In some scenarios, it may reduce development effort.

Pruning-related APIs (POC)

class Pruning(object):
    def __init__(self, conf_fname_or_obj):
        ...

    def on_epoch_begin(self, epoch):
        ...

    def on_batch_begin(self, batch_id):
        ...

    def on_batch_end(self):
        ...

    def on_epoch_end(self):
        ...

    def __call__(self):
        ...

    @property
    def model(self):
        ...

    @property
    def q_func(self):
        ...

This API is used to do sparsity pruning. Currently, it is a Proof of Concept; Neural Compressor only supports magnitude pruning on PyTorch.

To learn how to use this API, refer to the pruning document.

Benchmarking-related APIs

class Benchmark(object):
    def __init__(self, conf_fname_or_obj):
        ...

    def __call__(self):
        ...

    @property
    def model(self):
        ...

    @property
    def metric(self):
        ...

    @property
    def b_dataloader(self):
        ...

    @property
    def postprocess(self, user_postprocess):
        ...

This API is used to measure model performance and accuracy.

To learn how to use this API, refer to the benchmarking document.

Default user-facing APIs

The default user-facing APIs exist for backwards compatibility from the v1.0 release. Refer to v1.1 API to understand how the default user-facing APIs work.

View the HelloWorld example that uses default user-facing APIs for user reference.

Full examples using default user-facing APIs can be found here.