Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] 升级0.2.0 后,加载 qwen-14b 或者 internlm2-chat-20b 的awq量化模型报显存错误 #998

Closed
2 tasks
WCwalker opened this issue Jan 19, 2024 · 13 comments

Comments

@WCwalker
Copy link

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.

Describe the bug

Exception in thread Thread-132 (_create_model_instance):
Traceback (most recent call last):
File "/home/mingqiang/.conda/envs/qwen/lib/python3.10/threading.py", line 1009, in _bootstrap_inner
self.run()
File "/home/mingqiang/.conda/envs/qwen/lib/python3.10/threading.py", line 946, in run
self._target(*self._args, **self._kwargs)
File "/home/mingqiang/.conda/envs/qwen/lib/python3.10/site-packages/lmdeploy/turbomind/turbomind.py", line 486, in _create_model_instance
model_inst = self.tm_model.model_comm.create_model_instance(
RuntimeError: [TM][ERROR] CUDA runtime error: out of memory /lmdeploy/src/turbomind/utils/allocator.h:231

Reproduction

import lmdeploy

model_path = '/home/mingqiang/model/model_file/qwen-14b-chat-finetune-4bit/'
pipe = lmdeploy.pipeline(model_path, model_name='qwen-14b')
response = pipe(["你是谁", "你在哪"])
print(response)

Environment

lmdepoly==0.2.0

Error traceback

No response

@miaoerduo
Copy link

大佬,我也碰到一样的问题了,想请教一下你,用上一个版本可以正常serving吗?另外在部署internlm2-chat-20b-4bits时,会出现无法找到tokenizer.json的情况,是否就是tokenizer_config.json这个文件?

@WCwalker
Copy link
Author

大佬,我也碰到一样的问题了,想请教一下你,用上一个版本可以正常serving吗?另外在部署internlm2-chat-20b-4bits时,会出现无法找到tokenizer.json的情况,是否就是tokenizer_config.json这个文件?

我今天早上还在用 0.1.0, 然后发现0.2.0才支持`internlm2-chat-20b-4bits”,所以刚刚升级了,结果就报错了,连之前能运行的qwen14量化模型也报这个错,然后我就降级回0.1.0,qwen就可以了

@WCwalker
Copy link
Author

大佬,我也碰到一样的问题了,想请教一下你,用上一个版本可以正常serving吗?另外在部署internlm2-chat-20b-4bits时,会出现无法找到tokenizer.json的情况,是否就是tokenizer_config.json这个文件?

我觉得应该不是tokenizer.json这个问题

@miaoerduo
Copy link

大佬,我也碰到一样的问题了,想请教一下你,用上一个版本可以正常serving吗?另外在部署internlm2-chat-20b-4bits时,会出现无法找到tokenizer.json的情况,是否就是tokenizer_config.json这个文件?

我觉得应该不是tokenizer.json这个问题

那这个文件,怎么获取呢?我看之前的模型也没有这个。

另外,最近有fix OOM的MR,说不定就是fix这个问题的。#973

@WCwalker
Copy link
Author

大佬,我也碰到一样的问题了,想请教一下你,用上一个版本可以正常serving吗?另外在部署internlm2-chat-20b-4bits时,会出现无法找到tokenizer.json的情况,是否就是tokenizer_config.json这个文件?

我觉得应该不是tokenizer.json这个问题

那这个文件,怎么获取呢?我看之前的模型也没有这个。

另外,最近有fix OOM的MR,说不定就是fix这个问题的。#973

暂时没搞懂什么问题,先不测internlm2-chat-20b-4bits的效果了,等官方的大佬看看怎么修,我这边先保持0.1.0跑吧

@lvhan028
Copy link
Collaborator

什么型号的显卡?用的是哪个命令?

@WCwalker
Copy link
Author

什么型号的显卡?用的是哪个命令?

A800,
运行样例
import lmdeploy

model_path = '/home/mingqiang/model/model_file/qwen-14b-chat-finetune-4bit/'
pipe = lmdeploy.pipeline(model_path, model_name='qwen-14b')
response = pipe(["你是谁", "你在哪"])
print(response)

@miaoerduo
Copy link

什么型号的显卡?用的是哪个命令?

我的是4090.

命令参考的 HF(https://huggingface.co/internlm/internlm2-chat-20b-4bits):

lmdeploy serve api_server internlm/internlm2-chat-20b-4bits --backend turbomind --model-format awq

@lvhan028
Copy link
Collaborator

加一个选项: --cache-max-entry-count 0.4 试试。

@WCwalker
Copy link
Author

加一个选项: --cache-max-entry-count 0.4 试试。

我今天测试了,要设0.3或者更少才行,但是加载internlm2-chat-20b 模型有另外的问题,停止符有问题,会一直推理停不下来
1705907933039

@lvhan028
Copy link
Collaborator

internlm2-chat-20b模型上周更新了special_tokens,对应的要使用 lmdeploy v0.2.1
更新下lmdeploy再试试看呢?

@WCwalker
Copy link
Author

对应的要使用 lmdeploy v0.2.1
更新下lmdeploy再试试看呢?

我就是刚刚更了 0.2.1 跑出来的:
sys.platform: linux
Python: 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0]
CUDA available: True
numpy_random_seed: 2147483648
GPU 0,1,2,3: NVIDIA A800 80GB PCIe
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.1, V12.1.105
GCC: gcc (Debian 10.2.1-6) 10.2.1 20210110
PyTorch: 2.1.2+cu121
PyTorch compiling details: PyTorch built with:

  • GCC 9.3
  • C++ Version: 201703
  • Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
  • Intel(R) MKL-DNN v3.1.1 (Git Hash 64f6bcbcbab628e96f33a62c3e975f8535a7bde4)
  • OpenMP 201511 (a.k.a. OpenMP 4.5)
  • LAPACK is enabled (usually provided by MKL)
  • NNPACK is enabled
  • CPU capability usage: AVX512
  • CUDA Runtime 12.1
  • NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
  • CuDNN 8.9.2
  • Magma 2.6.1
  • Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=8.9.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-invalid-partial-specialization -Wno-unused-private-field -Wno-aligned-allocation-unavailable -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.1.2, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,

LMDeploy: 0.2.1+
transformers: 4.36.2
gradio: Not Found
fastapi: 0.109.0
pydantic: 2.5.3

@WCwalker
Copy link
Author

internlm2-chat-20b模型上周更新了special_tokens,对应的要使用 lmdeploy v0.2.1 更新下lmdeploy再试试看呢?

已经可以了,我刚刚上去更新了 internlm2-chat-20b 模型里面的tokenizer_config.json

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants