DeepSeek

DeepSeek-R1-Distill Models

The distilled models can be used in the same way as the base model. For instance:

DeepSeek-R1-Distill-Qwen-xxB, based on Qwen2.5, utilizes Qwen2Convert.
DeepSeek-R1-Distill-Llama-xxB employs LlamaConvert.

DeepSeek-R1 671B

Notice: DeepSeek-R1 671B only supports dtype fp8_e4m3 and kvcache dtype bf16.

Requirements

xfastertransformer >= 2.0.0
oneCCL
vllm-xft >= 0.5.5.3

Using Docker

Notice:

Docker image only contains xfastertransformer and oneCCL, vllm-xft isn't installed. Using pip install vllm-xft to install vllm-xft.
Docker image exports the libiomp5.so env by default, so it doesn't need to preload libiomp5.so manually.

docker pull intel/xfastertransformer:latest

# Run the docker with the command (Assume model files are in `/data/` directory):  
docker run -it \
    --name xfastertransformer \
    --privileged \
    --shm-size=16g \
    -v /data/:/data/ \
    -e "http_proxy=$http_proxy" \
    -e "https_proxy=$https_proxy" \
    intel/xfastertransformer:latest

Tips

If you encounter an error such as mpirun: command not found, please refer to the "Requirements" section to install oneCCL or using docker image.
By default, the benchmark script utilizes 1 CPU, using -s option to modify.
For better performance, pls consider clearing all caches using command: sudo sync; echo 3 | sudo tee /proc/sys/vm/drop_caches.
If the performance does not meet expectations, try setting the environment variable export XDNN_N64=64 or export XDNN_N64=16 before the benchmark.

Running Benchmarks

Execute the benchmark scripts to evaluate the model's performance using fake weight without downloading 600GB+ weight.

Manually Environment

    git clone https://github.com/intel/xFasterTransformer.git
    cd xFasterTransformer/benchmark

    export $(python -c 'import xfastertransformer as xft; print(xft.get_env())')
    bash run_benchmark.sh -m deepseek-r1 -d fp8_e4m3 -kvd bf16 -bs 1 -in 32 -out 32 -s 1

In Docker

    cd /root/xFasterTransformer/benchmark
    bash run_benchmark.sh -m deepseek-r1 -d fp8_e4m3 -kvd bf16 -bs 1 -in 32 -out 32 -s 1

-bs: batch size.
-in 32: input token length, [32, 64, 128, 256, 512, 1024, 2048, 4096, 8192].
-out: output length.
-s: number of used CPU nodes. If you only have 1 node with SNC-3 enabled, it will be used as number of subnumas.

Serving the Model(OpenAI Compatible Server)

Preparing the Model

Download original DeepSeek-R1 671B model from HuggingFace.
Convert model into xFT format.
```
python -c 'import xfastertransformer as xft; xft.DeepSeekR1Convert().convert("${HF_DATASET_DIR}","${OUTPUT_DIR}")'
```
After conversion, the *.safetensor files in ${HF_DATASET_DIR} is no longer needed if you want to save storage space.

Run Serving

Single instance
If you want to run DeepSeek in one CPU numa, like NUMA node0 CPU(s): 0-47,96-143

# Preload libiomp5.so by following cmd or LD_PRELOAD=libiomp5.so manually
export $(python -c 'import xfastertransformer as xft; print(xft.get_env())')

numactl 0-47 -l python -m vllm.entrypoints.openai.api_server \
        --model ${MODEL_PATH} \
        --tokenizer ${TOKEN_PATH} \
        --dtype fp8_e4m3 \
        --kv-cache-dtype bf16 \
        --served-model-name xft \
        --port 8000 \
        --trust-remote-code

MODEL_PATH: The xFT format model weights.
TOKEN_PATH: The tokenizer related files, like HF_DATASET_DIR.
served-model-name: The model name used in the API.

Distributed(Multi-rank) If you want to run DeepSeek cross numas, like

NUMA node0 CPU(s):   0-47,96-143
NUMA node1 CPU(s):   48-95,144-191

# Preload libiomp5.so by following cmd or LD_PRELOAD=libiomp5.so manually
export $(python -c 'import xfastertransformer as xft; print(xft.get_env())')

OMP_NUM_THREADS=48 mpirun \
        -n 1 numactl --all -C 0-47 \
        python -m vllm.entrypoints.openai.api_server \
            --model ${MODEL_PATH} \
            --tokenizer ${TOKEN_PATH} \
            --dtype fp8_e4m3 \
            --kv-cache-dtype bf16 \
            --served-model-name xft \
            --port 8000 \
            --trust-remote-code \
        : -n 1 numactl --all -C 48-95 \
        python -m vllm.entrypoints.slave \
            --dtype fp8_e4m3 \
            --model ${MODEL_PATH} \
            --kv-cache-dtype bf16

Query example

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "xft",
        "messages": [{"role": "user", "content": "你好呀！请问你是谁？"}],
        "max_tokens": 256,
        "temperature": 0.6,
        "top_p": 0.95
    }'

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

deepseek.md

deepseek.md

DeepSeek

DeepSeek-R1-Distill Models

DeepSeek-R1 671B

Requirements

Using Docker

Tips

Running Benchmarks

Manually Environment

In Docker

Serving the Model(OpenAI Compatible Server)

Preparing the Model

Run Serving

Files

deepseek.md

Latest commit

History

deepseek.md

File metadata and controls

DeepSeek

DeepSeek-R1-Distill Models

DeepSeek-R1 671B

Requirements

Using Docker

Tips

Running Benchmarks

Manually Environment

In Docker

Serving the Model(OpenAI Compatible Server)

Preparing the Model

Run Serving