This tutorial will introduce CPU performance considerations for the deep learning model BERT Large for language modeling and demonstrate how to use Intel® Optimizations for TensorFlow to improve inference time on CPUs. This tutorial will also provide code examples to use Intel Model Zoo's pre-trained BERT model for a quick off-the-ground implementation.
With BFloat16 (BF16) instructions and optimizations now in the Intel® Xeon® Scalable processor and Intel® Optimizations for TensorFlow, deep learning workload performance can benefit from a smaller data representation (16-bit instead of the traditional 32-bit floating point) often with little or no loss of accuracy. This is because the BF16 standard halves the data size in a way that retains most of the precision near zero while sacrificing more precision at the extremes of the numerical range. For many machine and deep learning tasks, this is a favorable trade-off. For more technical details, see this article on lowering numerical precision to increase deep learning performance.
BERT (Bidirectional Encoder Representations from Transformers) is a popular language modeling topology. Since its publication in May 2019, BERT has quickly become state-of-the-art for many Natural Language Processing (NLP) tasks, including question answering and next sentence prediction. The BERT Large variant has 340 million parameters and uses an innovative masked language model (MLM) pre-training approach that allows a second training stage called fine-tuning to achieve a wide variety of NLP tasks. To demonstrate Bert Large inference performance with BF16 precision, this tutorial uses the Intel Model Zoo's BERT Large pre-trained model which has been fine-tuned for question answering with the SQuAD dataset. The tutorial concludes with FP32 inference for comparison of performance and accuracy.
In addition to TensorFlow optimizations that use the Intel® oneAPI Deep Neural Network Library (Intel® oneDNN), the run-time settings also significantly contribute to improved performance. Tuning these options to optimize CPU workloads is vital to optimize performance of TensorFlow on Intel® processors. Below are the set of run-time options tested empirically on BERT Large and recommended by Intel:
Run-time options | Recommendations |
---|---|
Batch Size | 32. Regardless of the hardware |
Hyperthreading | Enabled. Turn on in BIOS. Requires a restart. |
intra_op_parallelism_threads | # physical cores |
inter_op_parallelism_threads | 1 or 2 |
NUMA Controls | --cpunodebind=0 --membind=0 |
KMP_AFFINITY | KMP_AFFINITY=granularity=fine,verbose,compact,1,0 |
KMP_BLOCKTIME | 1 |
KMP_SETTINGS | 1 |
OMP_NUM_THREADS | # physical cores - 1 or # physical cores - 2 |
Note 1: Refer to this link to learn more about the run-time options.
Note 2: You can remove verbose
from KMP_AFFINITY
setting to avoid verbose output at runtime.
Run the following commands to get your processor information:
a. # physical cores per socket: lscpu | grep "Core(s) per socket" | cut -d':' -f2 | xargs
b. # all physical cores: lscpu -b -p=Core,Socket | grep -v '^#' | sort -u | wc -l
Below is a code snippet you can incorporate into your existing TensorFlow application to set the best settings. You can either set them in the CLI or in the Python script. Note that inter and intra_op_parallelism_threads settings can only be set in the Python script.
export OMP_NUM_THREADS=<# physical cores - 2>
export KMP_AFFINITY="granularity=fine,verbose,compact,1,0"
export KMP_BLOCKTIME=1
export KMP_SETTINGS=1
(or)
import os
os.environ["KMP_BLOCKTIME"] = "1"
os.environ["KMP_SETTINGS"] = "1"
os.environ["KMP_AFFINITY"]= "granularity=fine,verbose,compact,1,0"
os.environ["OMP_NUM_THREADS"]= <# physical cores - 2>
tf.config.threading.set_inter_op_parallelism_threads(1)
tf.config.threading.set_intra_op_parallelism_threads(<# physical cores>)
This section shows how to measure and compare BF16 and FP32 inference performance on Intel's Model Zoo pre-trained model (or your pre-trained model) by setting the above-discussed run-time flags.
Note: These steps are adapted from the BERT Large Inference README. Please check there for the most up-to-date information and links.
- Clone IntelAI models and download into your home directory, skip this step if you already have Intel AI models installed.
cd ~
git clone https://github.com/IntelAI/models.git
- Download and unzip the BERT large uncased (whole word masking) model from the google bert repo.
Then, download the
dev-v1.1.json
file from the google bert repo into thewwm_uncased_L-24_H-1024_A-16
directory that was just unzipped.
wget https://storage.googleapis.com/bert_models/2019_05_30/wwm_uncased_L-24_H-1024_A-16.zip
unzip wwm_uncased_L-24_H-1024_A-16.zip
wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json -P wwm_uncased_L-24_H-1024_A-16
The wwm_uncased_L-24_H-1024_A-16
directory is what will be passed as the --data-location
when running inference.
- Download and unzip the pre-trained model. The file is 3.4GB so it will take some time.
wget https://storage.googleapis.com/intel-optimized-tensorflow/models/v1_8/bert_large_checkpoints.zip
unzip bert_large_checkpoints.zip
This directory will be passed as the --checkpoint
location when running inference.
-
Install Docker since the tutorial runs in a Docker container.
-
Pull the relevant Intel-optimized TensorFlow Docker image. Click here to find all the available Docker images.
docker pull intel/intel-optimized-tensorflow:latest
- Navigate to the inference script directory in local IntelAI repository.
cd ~/models/benchmarks
Run the Python script launch_benchmark.py
with the pre-trained model.
The launch_benchmark.py
script can be treated as an entry point to conveniently perform out-of-box high performance inference on pre-trained models from the Intel Model Zoo.
The script will automatically set the recommended run-time options for supported topologies, but if you choose to set your own options, refer to the full list of available flags and a detailed explanation of launch_benchmark.py
here.
This step will automatically launch a new container on every run and terminate. Go to this optional step to interactively run the script on the container.
- BF16 Batch Inference
Console in:
python launch_benchmark.py \
--model-name=bert_large \
--precision=bfloat16 \
--mode=inference \
--framework=tensorflow \
--batch-size=32 \
--data-location ~/wwm_uncased_L-24_H-1024_A-16 \
--checkpoint ~/bert_large_checkpoints \
--output-dir ~/output \
--benchmark-only \
--docker-image intel/intel-optimized-tensorflow:latest \
-- infer_option=SQuAD
Console out:
...
I0424 21:14:28.002666 140184442087232 run_squad.py:1365] Processed #examples: 960
INFO:tensorflow:prediction_loop marked as finished
Elapsed time: ...
throughput((num_processed_examples-threshod_examples)/Elapsedtime): ...
Ran inference with batch size 32
Log location outside container: /~/output/benchmark_bert_large_inference_bfloat16_20200424_210607.log
- BF16 Accuracy
Console in:
python launch_benchmark.py \
--model-name=bert_large \
--precision=bfloat16 \
--mode=inference \
--framework=tensorflow \
--batch-size=32 \
--data-location ~/wwm_uncased_L-24_H-1024_A-16 \
--checkpoint ~/bert_large_checkpoints \
--output-dir ~/output \
--accuracy-only \
--docker-image intel/intel-optimized-tensorflow:latest \
-- infer_option=SQuAD
Console out:
INFO:tensorflow:Processing example: 10830
I0428 00:26:11.595798 140332503766848 run_squad.py:1370] Processing example: 10830
INFO:tensorflow:prediction_loop marked as finished
INFO:tensorflow:Writing predictions to: /~/output/predictions.json
I0428 00:26:11.794145 140332503766848 run_squad.py:804] Writing predictions to: /~/output/predictions.json
INFO:tensorflow:Writing nbest to: /~/output/nbest_predictions.json
I0428 00:26:11.794228 140332503766848 run_squad.py:805] Writing nbest to: /~/output/nbest_predictions.json
{"exact_match": ..., "f1": ...}
Ran inference with batch size 32
Log location outside container: /~/output/benchmark_bert_large_inference_bfloat16_20200427_224428.log
Output files and logs are saved to the --output-dir
or to the default location models/benchmarks/common/tensorflow/logs, if no --output-dir
is set.
- FP32 Batch Inference
To see the FP32 batch inference performance, run the same command from above but change --precision=bfloat16
to --precision=fp32
.
python launch_benchmark.py \
--model-name=bert_large \
--precision=fp32 \
--mode=inference \
--framework=tensorflow \
--batch-size=32 \
--data-location ~/wwm_uncased_L-24_H-1024_A-16 \
--checkpoint ~/bert_large_checkpoints \
--output-dir ~/output \
--benchmark-only \
--docker-image intel/intel-optimized-tensorflow:latest \
-- infer_option=SQuAD
- FP32 Accuracy
Similarly, to see the FP32 accuracy, run the above command but change --precision=bfloat16
to --precision=fp32
.
python launch_benchmark.py \
--model-name=bert_large \
--precision=fp32 \
--mode=inference \
--framework=tensorflow \
--batch-size=32 \
--data-location ~/wwm_uncased_L-24_H-1024_A-16 \
--checkpoint ~/bert_large_checkpoints \
--output-dir ~/output \
--accuracy-only \
--docker-image intel/intel-optimized-tensorflow:latest \
-- infer_option=SQuAD
If you want to run launch_benchmark.py
interactively from within the docker container, add flag --debug
. This will launch a docker container based on the --docker_image
,
perform necessary installs, and run the launch_benchmark.py
script, but does not terminate the container process. As an example, this is how you would launch interactive BF16 batch inference for benchmarking:
Console in:
python launch_benchmark.py \
--model-name=bert_large \
--precision=bfloat16 \
--mode=inference \
--framework=tensorflow \
--batch-size=32 \
--data-location ~/wwm_uncased_L-24_H-1024_A-16 \
--checkpoint ~/bert_large_checkpoints \
--output-dir ~/output \
--benchmark-only \
--docker-image intel/intel-optimized-tensorflow:latest \
--debug \
-- infer_option=SQuAD
Console out:
root@c49f3442efb1:/workspace/benchmarks/common/tensorflow#
To rerun the benchmarking script, execute the start.sh
bash script from your existing directory with the available flags, which in turn will run launch_benchmark.py
.
For example, to run with different batch size settings (e.g. batch size=64) run with BATCH_SIZE
and to skip the run from reinstalling packages pass True
to NOINSTALL
.
chmod +x ./start.sh
NOINSTALL=True BATCH_SIZE=64 ./start.sh
All other flags will be defaulted to values passed in the first launch_benchmark.py
that starts the container. See here to get the full list of flags.