lmdeploy serve is more than two times slower than normal transformers code #2248

paniabhisek · 2024-08-06T12:25:00Z

paniabhisek
Aug 6, 2024

I am using phi-3-vision model. When I run the example snippet from huggingface, it takes around 5 seconds for the response.

However if I serve the model using lmdeploy serve api_server microsoft/Phi-3-vision-128k-instruct --server-port 23333. Then use the following code to get response

import requests
import base64
import time

def encode_image(image_path):
  with open(image_path, "rb") as image_file:
    return base64.b64encode(image_file.read()).decode('utf-8')

image_path = "wisconsin-madison.jpg"
base64_image = encode_image(image_path)

start = time.time()

r = requests.post('http://<ip>:23333/v1/chat/completions', json={
                             "model": "/home/useradmin/.cache/huggingface/hub/models--microsoft--Phi-3-vision-128k-instruct/snapshots/6065b7a1a412feff7ac023149f65358b71334984",
                             "messages": [
                                 {
                                     "role": "user",
                                     "content": [
                                         {
                                             "type": "text",
                                             "text": "What is in this image?"
                                         },
                                         {
                                             "type": "image_url",
                                             "image_url": {
                                                 "url": f"data:image/jpeg;base64,{base64_image}"
                                             }
                                         }
                                     ]
                                 }
                             ],
                             "temperature": 0.7,
                             "top_p": 1,
                             "tools": None,
                             "tool_choice": "none",
                             "logprobs": False,
                             "top_logprobs": 0,
                             "n": 1,
                             "max_tokens": None,
                             "stop": None,
                             "stream": False,
                             "presence_penalty": 0,
                             "frequency_penalty": 0,
                             "user": "string",
                             "repetition_penalty": 1,
                             "session_id": -1,
                             "ignore_eos": False,
                             "skip_special_tokens": True,
                             "top_k": 40
                         })

end = time.time()

It takes around 10 to 12 seconds. Why is there so much delay.
In both of the cases, the same image and query is being used. And I have disabled flash_attention and using eager mode in the config.json

lvhan028 · 2024-08-06T12:32:20Z

lvhan028
Aug 6, 2024
Maintainer

@RunningLeon

7 replies

RunningLeon Aug 7, 2024
Maintainer

@paniabhisek hi, pls post your env info by running lmdeploy check_env. BTW, can you post what you changed in config.json? For image embedding part, it surely needs flash attn as in here https://huggingface.co/microsoft/Phi-3-vision-128k-instruct/blob/6065b7a1a412feff7ac023149f65358b71334984/image_embedding_phi3_v.py#L24

paniabhisek Aug 7, 2024
Author

sys.platform: linux
Python: 3.10.12 (main, Jul 19 2024, 14:39:24) [GCC 11.4.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0,1,2,3: Tesla T4
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.4, V12.4.99
GCC: gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
PyTorch: 2.2.2+cu121
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.3.2 (Git Hash 2dc95a2ad0841e29db8b22fbccaf3e5da7992b01)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 12.1
  - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
  - CuDNN 8.9.2
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=8.9.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=2.2.2, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF,

TorchVision: 0.17.2+cu121
LMDeploy: 0.5.2.post1+
transformers: 4.42.4
gradio: Not Found
fastapi: 0.111.1
pydantic: 2.8.2
triton: 2.1.0
NVIDIA Topology:
        GPU0    GPU1    GPU2    GPU3    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      SYS     SYS     SYS     16-31   1               N/A
GPU1    SYS      X      SYS     SYS     16-31   1               N/A
GPU2    SYS     SYS      X      SYS     32-47   2               N/A
GPU3    SYS     SYS     SYS      X      32-47   2               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

paniabhisek Aug 7, 2024
Author

This is the change.

I am also disabling flash_attention in transformer code too which gives 5 seconds.

RunningLeon Aug 7, 2024
Maintainer

hi, changing "_attn_implementation" will not affect the speed. Are you using --tp > 1 ? Is T4+16G able to load the model? Anyway, we don't have T4, but this is what I got on A100+80G. It seems reasonable.

5.06273889541626
0.6262824535369873
0.9383771419525146
0.7238662242889404
0.7138667106628418
0.6676077842712402
0.7242634296417236
0.8624002933502197

paniabhisek Aug 7, 2024
Author

It's T4+64G with each 16G. I was only using one card previously. Now when I ran using --tp 4, I got

Does that mean if I use more cards, the inference time will decrease? I do not understand what is happening here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lmdeploy serve is more than two times slower than normal transformers code #2248

{{title}}

Replies: 1 comment 7 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

lmdeploy serve is more than two times slower than normal transformers code #2248

paniabhisek Aug 6, 2024

Replies: 1 comment · 7 replies

lvhan028 Aug 6, 2024 Maintainer

RunningLeon Aug 7, 2024 Maintainer

paniabhisek Aug 7, 2024 Author

paniabhisek Aug 7, 2024 Author

RunningLeon Aug 7, 2024 Maintainer

paniabhisek Aug 7, 2024 Author

paniabhisek
Aug 6, 2024

Replies: 1 comment 7 replies

lvhan028
Aug 6, 2024
Maintainer

RunningLeon Aug 7, 2024
Maintainer

paniabhisek Aug 7, 2024
Author

paniabhisek Aug 7, 2024
Author

RunningLeon Aug 7, 2024
Maintainer

paniabhisek Aug 7, 2024
Author