-
Notifications
You must be signed in to change notification settings - Fork 432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] TurbomindEngineConfig supports lora #1007
Comments
Turbomind has no plan to support lora recently. |
Hi, you can try out PyTorch Engine. You can build engine with a dictionary with from lmdeploy.messages import PytorchEngineConfig
from lmdeploy.pytorch.engine.engine import Engine
adapters = {'default':'/path/to/adapter'}
engine_config = PytorchEngineConfig(adapters=adapters)
engine = Engine.from_pretrained(model_path,
engine_config=engine_config,
trust_remote_code=True) And perform inference with engine instance: generator = engine.create_instance()
for outputs in generator.stream_infer(session_id=session_id,
input_ids=input_ids,
gen_config=gen_config,
adapter_name=adapter_name):
# read outputs
# close session and release caches
generator.end(session_id) |
python: /project/lib/Analysis/Allocation.cpp:40: std::pair<llvm::SmallVector, llvm::SmallVector > mlir::triton::getCvtOrder(mlir::Attribute, mlir::Attribute): Assertion `!(srcMmaLayout && dstMmaLayout) && "Unexpected mma -> mma layout conversion"' failed. chatglm2-6b加载lora预测报错 V100 cuda12 pytorch:2.1.1 triton:2.1.0,按指导下载的whl文件安装的lmdeploy |
@Cloopen-ReLiNK I can not reproduce with this adapter. Would you mind provide the adapter? |
Thank you, I will test load the adapter you are using first. If there is any problem, I will provide the adapter. produced the same error。 |
torch:2.1.1 triton:2.1.0 lmdeploy:0.2.1 cuda:11.7 V100-32G@grimoire from lmdeploy.messages import PytorchEngineConfig engine = Engine.from_pretrained(model_path, generator = engine.create_instance() session_id = '1' for outputs in generator.stream_infer(session_id=session_id, |
I can replicate the error on V100 now. It seems that the flash-attention implementation does not fit devices with sm<80. |
Is there any solution for using V100? flash attention v1? |
This might caused by random sampling.
I am trying to find a fix, please allow me some time. |
@Cloopen-ReLiNK #1027 Please have a try. |
solved |
Can it achieve the same abilities as the TurbomindEngine? Because I find TurbomindEngine is better than PytorchEngine. |
No, PytorchEngine is slower. |
Got it. Thank you. |
Motivation
The existing HF is supporting inference without merging weights.
This is very important for a large number of tests! Merging weights consumes a lot of resources and time.
Related resources
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: