[Bug] 升级0.2.0 后，加载 qwen-14b 或者 internlm2-chat-20b 的awq量化模型报显存错误 #998

WCwalker · 2024-01-19T06:49:23Z

Checklist

1. I have searched related issues but cannot get the expected help.
2. The bug has not been fixed in the latest version.

Describe the bug

Exception in thread Thread-132 (_create_model_instance):
Traceback (most recent call last):
File "/home/mingqiang/.conda/envs/qwen/lib/python3.10/threading.py", line 1009, in _bootstrap_inner
self.run()
File "/home/mingqiang/.conda/envs/qwen/lib/python3.10/threading.py", line 946, in run
self._target(*self._args, **self._kwargs)
File "/home/mingqiang/.conda/envs/qwen/lib/python3.10/site-packages/lmdeploy/turbomind/turbomind.py", line 486, in _create_model_instance
model_inst = self.tm_model.model_comm.create_model_instance(
RuntimeError: [TM][ERROR] CUDA runtime error: out of memory /lmdeploy/src/turbomind/utils/allocator.h:231

Reproduction

import lmdeploy

model_path = '/home/mingqiang/model/model_file/qwen-14b-chat-finetune-4bit/'
pipe = lmdeploy.pipeline(model_path, model_name='qwen-14b')
response = pipe(["你是谁", "你在哪"])
print(response)

Environment

lmdepoly==0.2.0

Error traceback

No response

miaoerduo · 2024-01-19T07:16:59Z

大佬，我也碰到一样的问题了，想请教一下你，用上一个版本可以正常serving吗？另外在部署internlm2-chat-20b-4bits时，会出现无法找到tokenizer.json的情况，是否就是tokenizer_config.json这个文件？

WCwalker · 2024-01-19T07:20:15Z

大佬，我也碰到一样的问题了，想请教一下你，用上一个版本可以正常serving吗？另外在部署internlm2-chat-20b-4bits时，会出现无法找到tokenizer.json的情况，是否就是tokenizer_config.json这个文件？

我今天早上还在用 0.1.0，然后发现0.2.0才支持`internlm2-chat-20b-4bits”，所以刚刚升级了，结果就报错了，连之前能运行的qwen14量化模型也报这个错，然后我就降级回0.1.0，qwen就可以了

WCwalker · 2024-01-19T07:21:38Z

大佬，我也碰到一样的问题了，想请教一下你，用上一个版本可以正常serving吗？另外在部署internlm2-chat-20b-4bits时，会出现无法找到tokenizer.json的情况，是否就是tokenizer_config.json这个文件？

我觉得应该不是tokenizer.json这个问题

miaoerduo · 2024-01-19T07:24:01Z

大佬，我也碰到一样的问题了，想请教一下你，用上一个版本可以正常serving吗？另外在部署internlm2-chat-20b-4bits时，会出现无法找到tokenizer.json的情况，是否就是tokenizer_config.json这个文件？

我觉得应该不是tokenizer.json这个问题

那这个文件，怎么获取呢？我看之前的模型也没有这个。

另外，最近有fix OOM的MR，说不定就是fix这个问题的。#973

WCwalker · 2024-01-19T07:31:44Z

大佬，我也碰到一样的问题了，想请教一下你，用上一个版本可以正常serving吗？另外在部署internlm2-chat-20b-4bits时，会出现无法找到tokenizer.json的情况，是否就是tokenizer_config.json这个文件？

我觉得应该不是tokenizer.json这个问题

那这个文件，怎么获取呢？我看之前的模型也没有这个。

另外，最近有fix OOM的MR，说不定就是fix这个问题的。#973

暂时没搞懂什么问题，先不测internlm2-chat-20b-4bits的效果了，等官方的大佬看看怎么修，我这边先保持0.1.0跑吧

lvhan028 · 2024-01-19T08:45:17Z

什么型号的显卡？用的是哪个命令？

WCwalker · 2024-01-19T08:47:18Z

什么型号的显卡？用的是哪个命令？

A800,
运行样例
import lmdeploy

model_path = '/home/mingqiang/model/model_file/qwen-14b-chat-finetune-4bit/'
pipe = lmdeploy.pipeline(model_path, model_name='qwen-14b')
response = pipe(["你是谁", "你在哪"])
print(response)

miaoerduo · 2024-01-19T08:51:29Z

什么型号的显卡？用的是哪个命令？

我的是4090.

命令参考的 HF(https://huggingface.co/internlm/internlm2-chat-20b-4bits)：

lmdeploy serve api_server internlm/internlm2-chat-20b-4bits --backend turbomind --model-format awq

lvhan028 · 2024-01-19T11:40:01Z

加一个选项: --cache-max-entry-count 0.4 试试。

WCwalker · 2024-01-22T07:21:16Z

加一个选项: --cache-max-entry-count 0.4 试试。

我今天测试了，要设0.3或者更少才行，但是加载internlm2-chat-20b 模型有另外的问题，停止符有问题，会一直推理停不下来

lvhan028 · 2024-01-22T08:50:14Z

internlm2-chat-20b模型上周更新了special_tokens，对应的要使用 lmdeploy v0.2.1
更新下lmdeploy再试试看呢？

WCwalker · 2024-01-22T08:52:55Z

对应的要使用 lmdeploy v0.2.1
更新下lmdeploy再试试看呢？

我就是刚刚更了 0.2.1 跑出来的：
sys.platform: linux
Python: 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0]
CUDA available: True
numpy_random_seed: 2147483648
GPU 0,1,2,3: NVIDIA A800 80GB PCIe
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.1, V12.1.105
GCC: gcc (Debian 10.2.1-6) 10.2.1 20210110
PyTorch: 2.1.2+cu121
PyTorch compiling details: PyTorch built with:

GCC 9.3
C++ Version: 201703
Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
Intel(R) MKL-DNN v3.1.1 (Git Hash 64f6bcbcbab628e96f33a62c3e975f8535a7bde4)
OpenMP 201511 (a.k.a. OpenMP 4.5)
LAPACK is enabled (usually provided by MKL)
NNPACK is enabled
CPU capability usage: AVX512
CUDA Runtime 12.1
NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
CuDNN 8.9.2
Magma 2.6.1
Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=8.9.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-invalid-partial-specialization -Wno-unused-private-field -Wno-aligned-allocation-unavailable -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.1.2, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,

LMDeploy: 0.2.1+
transformers: 4.36.2
gradio: Not Found
fastapi: 0.109.0
pydantic: 2.5.3

WCwalker · 2024-01-22T09:02:50Z

internlm2-chat-20b模型上周更新了special_tokens，对应的要使用 lmdeploy v0.2.1 更新下lmdeploy再试试看呢？

已经可以了，我刚刚上去更新了 internlm2-chat-20b 模型里面的tokenizer_config.json

lvhan028 added the awaiting response label Jan 22, 2024

lvhan028 closed this as completed Jan 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] 升级0.2.0 后，加载 qwen-14b 或者 internlm2-chat-20b 的awq量化模型报显存错误 #998

[Bug] 升级0.2.0 后，加载 qwen-14b 或者 internlm2-chat-20b 的awq量化模型报显存错误 #998

WCwalker commented Jan 19, 2024

miaoerduo commented Jan 19, 2024

WCwalker commented Jan 19, 2024

WCwalker commented Jan 19, 2024

miaoerduo commented Jan 19, 2024

WCwalker commented Jan 19, 2024

lvhan028 commented Jan 19, 2024

WCwalker commented Jan 19, 2024

miaoerduo commented Jan 19, 2024

lvhan028 commented Jan 19, 2024

WCwalker commented Jan 22, 2024

lvhan028 commented Jan 22, 2024

WCwalker commented Jan 22, 2024

WCwalker commented Jan 22, 2024

[Bug] 升级0.2.0 后，加载 qwen-14b 或者 internlm2-chat-20b 的awq量化模型报显存错误 #998

[Bug] 升级0.2.0 后，加载 qwen-14b 或者 internlm2-chat-20b 的awq量化模型报显存错误 #998

Comments

WCwalker commented Jan 19, 2024

Checklist

Describe the bug

Reproduction

Environment

Error traceback

miaoerduo commented Jan 19, 2024

WCwalker commented Jan 19, 2024

WCwalker commented Jan 19, 2024

miaoerduo commented Jan 19, 2024

WCwalker commented Jan 19, 2024

lvhan028 commented Jan 19, 2024

WCwalker commented Jan 19, 2024

miaoerduo commented Jan 19, 2024

lvhan028 commented Jan 19, 2024

WCwalker commented Jan 22, 2024

lvhan028 commented Jan 22, 2024

WCwalker commented Jan 22, 2024

WCwalker commented Jan 22, 2024