Skip to content

Latest commit

 

History

History
148 lines (112 loc) · 7.23 KB

sample.md

File metadata and controls

148 lines (112 loc) · 7.23 KB

Tensorflow BERT Samples

Using bert_transformer Tensorflow op in a transformer encoder

The trunk network of BERT model consists of a multi-layer transformer encoder, which is implemented as the transformer_model() function in the file modeling.py in their official Github repository. Samples provided in file fast_infer_util.py show how to re-implement this function with our ops in order to get an inference time speedup.

The function fast_transformer_model_trans() implements the transformer encoder using the bert_transformer op. In order to do that, we only need to first import the op at the beginning of the file, then call bert_transformer op at the end of each encoder layer. This turns out can be done by adding several lines of code to the original transformer_model() function as the following.

# import op
transformer_op_module = tf.load_op_library(os.path.join('../../build/lib/libtf_fastertransformer.so'))
...
def fast_transformer_model_trans(...)
    ...
      # original code
      with tf.variable_scope("output"):
        layer_output = tf.layers.dense(
            intermediate_output,
            hidden_size,
            kernel_initializer=create_initializer(initializer_range))
        layer_output = dropout(layer_output, hidden_dropout_prob)
        layer_output = layer_norm(layer_output + attention_output)

      # calling bert_transformer
      trainable_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope=tf.get_variable_scope().name)
      layer_output = transformer_op_module.bert_transformer(
        layer_input,
        layer_input,
        trainable_vars[0], trainable_vars[2], trainable_vars[4], trainable_vars[1], trainable_vars[3], trainable_vars[5], 
        attention_mask,
        trainable_vars[6], trainable_vars[7], trainable_vars[8], trainable_vars[9], trainable_vars[10], trainable_vars[11],
        trainable_vars[12], trainable_vars[13], trainable_vars[14], trainable_vars[15],
        batch_size=batch_size, from_seq_len=seq_length, to_seq_len=seq_length, head_num=num_attention_heads, size_per_head=attention_head_size)
      
      # original code
      prev_output = layer_output
      all_layer_outputs.append(layer_output)
    ...

Running GLEU tasks with fast transformer inference

The above shows how to implement a transformer encoder using our ops, to integrate it into the BERT pipeline we can simply replace the transformer_model function in modeling.py with fast_transformer_model_trans.

Our implementation supports FP16 data type to further exploit the potential of inference acceleration. FP16 inference was not supported by the original BERT code, here we made necessary modifications to build a FP16 compatible model, which was implemented in my_modeling.py and the create_model function in fast_infer_util.py.

FP32 Tensorflow checkpoint files cannot be used directly for FP16 inference, we can convert its data type to FP16 in advance. The ckpt_type_convert.py script is provided for checkpoint data type conversion.

It is also important to note that our implementation requires a fixed batch size, this can be done by setting drop_remainder option to True for Tensorflow Dataset instances. We have re-implemented this as well in the file_based_input_fn_builder_drop function.

On top of the above modifications, it's easy to run any of the GLEU tasks supported by the open source BERT sample with our ops for better inference performance. We only need to replace several functions in original run_classifier.py script with the implementations we provide.

import run_classifier as rc
import fast_infer_util as fiu
import my_modeling

...

# replace transformer implementation
my_modeling.transformer_model = fiu.fast_transformer_model_trans
# replace the model to support fp16 data type
rc.create_model = fiu.create_model
# replace the input function to drop remainder
rc.file_based_input_fn_builder = fiu.file_based_input_fn_builder_drop
...

The sample run_classifier_wrap.py is a wrapper of the original run_classifier.py script for BERT, it supports the same options as described in BERT readme with additional floatx options to specify floating point type. For example, to compare the performance of original BERT and our implementation on MRPC task we can run the following command.

export BERT_BASE_DIR=/path/to/bert/uncased_L-12_H-768_A-12
export GLUE_DIR=/path/to/glue

python run_classifier.py   --task_name=MRPC   --do_eval=true   --data_dir=$GLUE_DIR/MRPC   --vocab_file=$BERT_BASE_DIR/vocab.txt   --bert_config_file=$BERT_BASE_DIR/bert_config.json   --init_checkpoint=ckpt_dir/fp32_model.ckpt   --max_seq_length=128   --eval_batch_size=8   --output_dir=mrpc_output

python run_classifier_wrap.py   --task_name=MRPC   --do_eval=true   --data_dir=$GLUE_DIR/MRPC   --vocab_file=$BERT_BASE_DIR/vocab.txt   --bert_config_file=$BERT_BASE_DIR/bert_config.json   --init_checkpoint=ckpt_dir/fp16_model.ckpt   --max_seq_length=128   --eval_batch_size=8   --output_dir=mrpc_output   --floatx=float16   

The evaluation result should be like

# original Tensorflow op
...
INFO:tensorflow:***** Eval results *****
INFO:tensorflow:  eval_accuracy = 0.877451
INFO:tensorflow:  eval_loss = 0.44744828
INFO:tensorflow:  global_step = 0
INFO:tensorflow:  loss = 0.44744828

# faster_transformer op with fp16 data type
INFO:tensorflow:***** Eval results *****
INFO:tensorflow:  eval_accuracy = 0.875
INFO:tensorflow:  eval_loss = 0.44731623
INFO:tensorflow:  global_step = 0
INFO:tensorflow:  loss = 0.44728807
...

We see the evaluation accuracy and loss drop slightly with FP16 inference for the MRPC sentence pair classification task. The following section will show such minor sacrifice in accuracy will bring considerable performance gain.

Tensorflow profiling

The sample script profile_transformer_inference.py shows how to run and profile a BERT inference model from scratch. Results show we received a 6.36x speedup compared to FP32 Tensorflow (1.48x speedup compared to FP16 Tensorflow XLA) for an end-to-end classification model in our experiment settings.

GPU: Tesla T4
CUDA: 10.0.0
Model: BERT-Base: 12-layer, 768-hidden, 12-heads , 110M parameters
Max sequence length: 128
Batch size: 32
Average time elapsed:

settings seconds
FP32 Tensorflow 0.2495
FP32 Tensorflow XLA 0.1998
FP16 Tensorflow 0.0978
FP16 Tensorflow XLA 0.0582
FP16 FasterTransformer 0.0392

Content summary

file name summary
ckpt_type_convert.py script for checkpoint data type conversion
fast_infer_util.py example functions to use faster transformer ops in Tensorflow
my_modeling.py basically the same as modeling.py in the original BERT repository, modifications are made to support FP16 data types
run_classifier_wrap.py a wrapper script of run_classifier.py in the original BERT repository, shows how to run classification tasks using faster transformer ops
profile_bert_inference.py for profiling BERT model pipelines
profile_transformer_inference.py for profiling transformer encoder layers
profile_util.py helper functions for profiling