Skip to content

Latest commit

 

History

History
 
 

Step-by-Step

This example load a BERT model and confirm its accuracy and speed based on GLUE data.

Prerequisite

1. Environment

pip install neural-compressor
pip install -r requirements.txt

Note: Validated ONNX Runtime Version.

2. Prepare Dataset

download the GLUE data with prepare_data.sh script.

export GLUE_DIR=path/to/glue_data
export TASK_NAME=MRPC

bash prepare_data.sh --data_dir=$GLUE_DIR --task_name=$TASK_NAME

3. Prepare Model

python prepare_model.py --input_model='MRPC.zip' --output_model='bert.onnx'

Run

Diagnosis

Neural Compressor offers quantization and benchmark diagnosis. Adding diagnosis parameter to Quantization/Benchmark config will provide additional details useful in diagnostics.

Benchmark diagnosis

config = BenchmarkConfig(
    diagnosis=True,
    ...
)

1. Quantization

Static quantization with QOperator format:

bash run_quant.sh --input_model=path/to/model \ # model path as *.onnx
                   --output_model=path/to/model_tune \
                   --dataset_location=path/to/glue_data \
                   --quant_format="QOperator"

Static quantization with QDQ format:

bash run_quant.sh --input_model=path/to/model \ # model path as *.onnx
                   --output_model=path/to/model_tune \ # model path as *.onnx
                   --dataset_location=path/to/glue_data \
                   --quant_format="QDQ"

2. Benchmark

bash run_benchmark.sh --input_model=path/to/model \ # model path as *.onnx
                      --dataset_location=path/to/glue_data \
                      --batch_size=batch_size \
                      --mode=performance # or accuracy