This example load a BERT model and confirm its accuracy and speed based on GLUE data.
pip install neural-compressor
pip install -r requirements.txt
Note: Validated ONNX Runtime Version.
download the GLUE data with prepare_data.sh
script.
export GLUE_DIR=path/to/glue_data
export TASK_NAME=MRPC
bash prepare_data.sh --data_dir=$GLUE_DIR --task_name=$TASK_NAME
python prepare_model.py --input_model='MRPC.zip' --output_model='bert.onnx'
Neural Compressor offers quantization and benchmark diagnosis. Adding diagnosis
parameter to Quantization/Benchmark config will provide additional details useful in diagnostics.
config = BenchmarkConfig(
diagnosis=True,
...
)
Static quantization with QOperator format:
bash run_quant.sh --input_model=path/to/model \ # model path as *.onnx
--output_model=path/to/model_tune \
--dataset_location=path/to/glue_data \
--quant_format="QOperator"
Static quantization with QDQ format:
bash run_quant.sh --input_model=path/to/model \ # model path as *.onnx
--output_model=path/to/model_tune \ # model path as *.onnx
--dataset_location=path/to/glue_data \
--quant_format="QDQ"
bash run_benchmark.sh --input_model=path/to/model \ # model path as *.onnx
--dataset_location=path/to/glue_data \
--batch_size=batch_size \
--mode=performance # or accuracy