Use vllm to accelerate inference

Start vllm server

Open the "nvidia-runtime"

vim /etc/docker/daemon.json

Add the following content

{
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

Restart docker

systemctl daemon-reload
systemctl restart docker

Install nvidia-container-runtime and nvidia-docker2

Run

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-container-runtime/$distribution/nvidia-container-runtime.repo | \
sudo tee /etc/yum.repos.d/nvidia-container-runtime.repo

Run

yum install nvidia-container-runtime nvidia-docker2 -y

Restart docker

systemctl restart docker

Start vllm Server

docker run -d --runtime nvidia --gpus all -v /root/SuperAdapters/output/llama3.1-combined:/root/SuperAdapters/output/llama3.1-combined -p 8000:8000 --ipc=host vllm/vllm-openai:latest --model /root/SuperAdapters/output/llama3.1-combined --trust-remote-code

P.S. If you use V100, you should add option "--max_model_len", like "--max_model_len 30000"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vLLMEnv.md

vLLMEnv.md

Use vllm to accelerate inference

Start vllm server

Files

vLLMEnv.md

Latest commit

History

vLLMEnv.md

File metadata and controls

Use vllm to accelerate inference

Start vllm server