Python(PyTorch) example achieves end-to-end inference of the model with streaming output combining the transformer's tokenizer.
Please refer to Installation. This example supports use source code which means you don't need install xFasterTransformer into pip and just build xFasterTransformer library, and it will search library in src directory.
Please refer to Prepare model
- Please refer to Prepare Environment to install oneCCL.
- Python dependencies.
# requirements.txt in root directory. pip install -r requirements.txt
# Recommend preloading `` to get a better performance.
# `` file will be in `3rdparty/mklml/lib` directory after build xFasterTransformer. python --dtype=bf16 --token_path=${TOKEN_PATH} --model_path=${MODEL_PATH}
# run multi-rank like
OMP_NUM_THREADS=48 mpirun \
-n 1 numactl -N 0 -m 0 python --dtype=bf16 --token_path=${TOKEN_PATH} --model_path=${MODEL_PATH} : \
-n 1 numactl -N 1 -m 1 python --dtype=bf16 --token_path=${TOKEN_PATH} --model_path=${MODEL_PATH}
More parameter options settings:
show help message and exit.-t
Path to tokenizer directory.-m
Path to model directory.-d
Data type, default usingfp16
, supports{fp16, bf16, int8, w8a8, int4, nf4, bf16_fp16, bf16_int8, bf16_w8a8,bf16_int4, bf16_nf4, w8a8_int8, w8a8_int4, w8a8_nf4}
Streaming output, Default to True.--num_beams
Num of beams, default to 1 which is greedy search.--output_len
max tokens can generate excluded input.--padding
Enable tokenizer padding, Default to True.--chat
Enable chat mode for ChatGLM models, Default to False.--do_sample
Enable sampling search, Default to False.--temperature
value used to modulate next token probabilities.--top_p
retain minimal tokens above topP threshold.--top_k
num of highest probability tokens to keep for generation.