The distilled models can be used in the same way as the base model. For instance:
DeepSeek-R1-Distill-Qwen-xxB
, based on Qwen2.5, utilizesQwen2Convert
.DeepSeek-R1-Distill-Llama-xxB
employsLlamaConvert
.
Notice: DeepSeek-R1 671B only supports dtype fp8_e4m3
and kvcache dtype bf16
.
- xfastertransformer >= 2.0.0
- oneCCL
- vllm-xft >= 0.5.5.3
Notice:
- Docker image only contains
xfastertransformer
andoneCCL
,vllm-xft
isn't installed. Usingpip install vllm-xft
to installvllm-xft
. - Docker image exports the
libiomp5.so
env by default, so it doesn't need to preloadlibiomp5.so
manually.
docker pull intel/xfastertransformer:latest
# Run the docker with the command (Assume model files are in `/data/` directory):
docker run -it \
--name xfastertransformer \
--privileged \
--shm-size=16g \
-v /data/:/data/ \
-e "http_proxy=$http_proxy" \
-e "https_proxy=$https_proxy" \
intel/xfastertransformer:latest
- If you encounter an error such as
mpirun: command not found
, please refer to the "Requirements" section to install oneCCL or using docker image. - By default, the benchmark script utilizes 1 CPU, using
-s
option to modify. - For better performance, pls consider clearing all caches using command:
sudo sync; echo 3 | sudo tee /proc/sys/vm/drop_caches
. - If the performance does not meet expectations, try setting the environment variable
export XDNN_N64=64
orexport XDNN_N64=16
before the benchmark.
Execute the benchmark scripts to evaluate the model's performance using fake weight without downloading 600GB+ weight.
git clone https://github.com/intel/xFasterTransformer.git
cd xFasterTransformer/benchmark
export $(python -c 'import xfastertransformer as xft; print(xft.get_env())')
bash run_benchmark.sh -m deepseek-r1 -d fp8_e4m3 -kvd bf16 -bs 1 -in 32 -out 32 -s 1
cd /root/xFasterTransformer/benchmark
bash run_benchmark.sh -m deepseek-r1 -d fp8_e4m3 -kvd bf16 -bs 1 -in 32 -out 32 -s 1
-bs
: batch size.-in 32
: input token length,[32, 64, 128, 256, 512, 1024, 2048, 4096, 8192]
.-out
: output length.-s
: number of used CPU nodes. If you only have 1 node with SNC-3 enabled, it will be used as number of subnumas.
- Download original DeepSeek-R1 671B model from HuggingFace.
- Convert model into xFT format.
python -c 'import xfastertransformer as xft; xft.DeepSeekR1Convert().convert("${HF_DATASET_DIR}","${OUTPUT_DIR}")'
After conversion, the
*.safetensor
files in${HF_DATASET_DIR}
is no longer needed if you want to save storage space.
-
Single instance
If you want to run DeepSeek in one CPU numa, likeNUMA node0 CPU(s): 0-47,96-143
# Preload libiomp5.so by following cmd or LD_PRELOAD=libiomp5.so manually export $(python -c 'import xfastertransformer as xft; print(xft.get_env())') numactl 0-47 -l python -m vllm.entrypoints.openai.api_server \ --model ${MODEL_PATH} \ --tokenizer ${TOKEN_PATH} \ --dtype fp8_e4m3 \ --kv-cache-dtype bf16 \ --served-model-name xft \ --port 8000 \ --trust-remote-code
MODEL_PATH
: The xFT format model weights.TOKEN_PATH
: The tokenizer related files, likeHF_DATASET_DIR
.served-model-name
: The model name used in the API.
-
Distributed(Multi-rank) If you want to run DeepSeek cross numas, like
NUMA node0 CPU(s): 0-47,96-143 NUMA node1 CPU(s): 48-95,144-191
# Preload libiomp5.so by following cmd or LD_PRELOAD=libiomp5.so manually export $(python -c 'import xfastertransformer as xft; print(xft.get_env())') OMP_NUM_THREADS=48 mpirun \ -n 1 numactl --all -C 0-47 \ python -m vllm.entrypoints.openai.api_server \ --model ${MODEL_PATH} \ --tokenizer ${TOKEN_PATH} \ --dtype fp8_e4m3 \ --kv-cache-dtype bf16 \ --served-model-name xft \ --port 8000 \ --trust-remote-code \ : -n 1 numactl --all -C 48-95 \ python -m vllm.entrypoints.slave \ --dtype fp8_e4m3 \ --model ${MODEL_PATH} \ --kv-cache-dtype bf16
-
Query example
curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "xft", "messages": [{"role": "user", "content": "你好呀!请问你是谁?"}], "max_tokens": 256, "temperature": 0.6, "top_p": 0.95 }'