[Feature] TurbomindEngineConfig supports lora #1007

seanxuu · 2024-01-20T13:16:41Z

Motivation

The existing HF is supporting inference without merging weights.

This is very important for a large number of tests! Merging weights consumes a lot of resources and time.

Related resources

No response

Additional context

No response

lvhan028 · 2024-01-22T04:20:04Z

Turbomind has no plan to support lora recently.
But the other engine of LMDeploy TorchEngine has supported it. May give it a try.
@grimoire please provide some guide

grimoire · 2024-01-22T12:11:40Z

Hi, you can try out PyTorch Engine.

You can build engine with a dictionary with adapter_name:adapter_path pair.

from lmdeploy.messages import PytorchEngineConfig
from lmdeploy.pytorch.engine.engine import Engine
adapters = {'default':'/path/to/adapter'}
engine_config = PytorchEngineConfig(adapters=adapters)
engine = Engine.from_pretrained(model_path,
                                engine_config=engine_config,
                                trust_remote_code=True)

And perform inference with engine instance:

generator = engine.create_instance()

for outputs in generator.stream_infer(session_id=session_id,
                                      input_ids=input_ids,
                                      gen_config=gen_config,
                                      adapter_name=adapter_name):
    # read outputs

# close session and release caches
generator.end(session_id)

Cloopen-ReLiNK · 2024-01-23T02:39:28Z

Hi, you can try out PyTorch Engine.

You can build engine with a dictionary with adapter_name:adapter_path pair.

from lmdeploy.messages import PytorchEngineConfig
from lmdeploy.pytorch.engine.engine import Engine
adapters = {'default':'/path/to/adapter'}
engine_config = PytorchEngineConfig(adapters=adapters)
engine = Engine.from_pretrained(model_path,
                                engine_config=engine_config,
                                trust_remote_code=True)

And perform inference with engine instance:

generator = engine.create_instance()

for outputs in generator.stream_infer(session_id=session_id,
                                      input_ids=input_ids,
                                      gen_config=gen_config,
                                      adapter_name=adapter_name):
    # read outputs

# close session and release caches
generator.end(session_id)

Hi, you can try out PyTorch Engine.

You can build engine with a dictionary with adapter_name:adapter_path pair.

from lmdeploy.messages import PytorchEngineConfig
from lmdeploy.pytorch.engine.engine import Engine
adapters = {'default':'/path/to/adapter'}
engine_config = PytorchEngineConfig(adapters=adapters)
engine = Engine.from_pretrained(model_path,
                                engine_config=engine_config,
                                trust_remote_code=True)

And perform inference with engine instance:

generator = engine.create_instance()

for outputs in generator.stream_infer(session_id=session_id,
                                      input_ids=input_ids,
                                      gen_config=gen_config,
                                      adapter_name=adapter_name):
    # read outputs

# close session and release caches
generator.end(session_id)

python: /project/lib/Analysis/Allocation.cpp:40: std::pair<llvm::SmallVector, llvm::SmallVector > mlir::triton::getCvtOrder(mlir::Attribute, mlir::Attribute): Assertion `!(srcMmaLayout && dstMmaLayout) && "Unexpected mma -> mma layout conversion"' failed.
Aborted，

chatglm2-6b加载lora预测报错 V100 cuda12 pytorch:2.1.1 triton:2.1.0，按指导下载的whl文件安装的lmdeploy
使用cuda11机器也是同样的报错

grimoire · 2024-01-23T06:55:04Z

@Cloopen-ReLiNK I can not reproduce with this adapter. Would you mind provide the adapter?

Cloopen-ReLiNK · 2024-01-23T07:36:50Z

you mind provide the adapter

Thank you, I will test load the adapter you are using first. If there is any problem, I will provide the adapter.

produced the same error。

Cloopen-ReLiNK · 2024-01-23T07:47:15Z

torch:2.1.1 triton:2.1.0 lmdeploy:0.2.1 cuda:11.7 V100-32G

@grimoire
What is the machine environment you use?

from lmdeploy.messages import PytorchEngineConfig
from lmdeploy.pytorch.engine.engine import Engine
adapters = {'default':'./lora-chatglm2-6b-guodegang'}
engine_config = PytorchEngineConfig(adapters=adapters)
model_path = '/home/pretrained_model/chatglm2-6b'

engine = Engine.from_pretrained(model_path,
engine_config=engine_config,
trust_remote_code=True)

generator = engine.create_instance()

session_id = '1'
input_ids = engine.tokenizer.encode("你好")
adapter_name = 'default'
from lmdeploy import GenerationConfig
gen_config = GenerationConfig(top_p=0.8,
top_k=40,
temperature=0.8,
max_new_tokens=1024
)

for outputs in generator.stream_infer(session_id=session_id,
input_ids=input_ids,
gen_config=gen_config,
adapter_name=adapter_name):
pass

grimoire · 2024-01-23T08:45:16Z

I can replicate the error on V100 now. It seems that the flash-attention implementation does not fit devices with sm<80.

Cloopen-ReLiNK · 2024-01-23T08:58:43Z

3090 with cuda11.6, both adapters can run normally.

But the generated results are inconsistent, the question is "你好"

Cloopen-ReLiNK · 2024-01-23T09:19:24Z

Is there any solution for using V100? flash attention v1？

grimoire · 2024-01-23T09:23:17Z

But the generated results are inconsistent

This might caused by random sampling.

Is there any solution for using V100?

I am trying to find a fix, please allow me some time.

grimoire · 2024-01-23T10:19:28Z

@Cloopen-ReLiNK #1027 Please have a try.

Cloopen-ReLiNK · 2024-01-24T07:13:35Z

@Cloopen-ReLiNK #1027 Please have a try.

solved

seanxuu · 2024-01-24T07:41:27Z

Hi, you can try out PyTorch Engine.

You can build engine with a dictionary with adapter_name:adapter_path pair.

from lmdeploy.messages import PytorchEngineConfig
from lmdeploy.pytorch.engine.engine import Engine
adapters = {'default':'/path/to/adapter'}
engine_config = PytorchEngineConfig(adapters=adapters)
engine = Engine.from_pretrained(model_path,
                                engine_config=engine_config,
                                trust_remote_code=True)

And perform inference with engine instance:

generator = engine.create_instance()

for outputs in generator.stream_infer(session_id=session_id,
                                      input_ids=input_ids,
                                      gen_config=gen_config,
                                      adapter_name=adapter_name):
    # read outputs

# close session and release caches
generator.end(session_id)

Can it achieve the same abilities as the TurbomindEngine? Because I find TurbomindEngine is better than PytorchEngine.

grimoire · 2024-01-24T08:06:43Z

Can it achieve the same abilities as the TurbomindEngine?

No, PytorchEngine is slower.

seanxuu · 2024-01-24T09:57:03Z

Can it achieve the same abilities as the TurbomindEngine?

No, PytorchEngine is slower.

Got it. Thank you.

lvhan028 assigned grimoire Jan 22, 2024

lvhan028 added the awaiting response label Jan 22, 2024

lvhan028 closed this as completed Jan 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] TurbomindEngineConfig supports lora #1007

[Feature] TurbomindEngineConfig supports lora #1007

seanxuu commented Jan 20, 2024

lvhan028 commented Jan 22, 2024

grimoire commented Jan 22, 2024

Cloopen-ReLiNK commented Jan 23, 2024 •

edited

Loading

grimoire commented Jan 23, 2024

Cloopen-ReLiNK commented Jan 23, 2024 •

edited

Loading

Cloopen-ReLiNK commented Jan 23, 2024 •

edited

Loading

grimoire commented Jan 23, 2024

Cloopen-ReLiNK commented Jan 23, 2024 •

edited

Loading

Cloopen-ReLiNK commented Jan 23, 2024

grimoire commented Jan 23, 2024 •

edited

Loading

grimoire commented Jan 23, 2024

Cloopen-ReLiNK commented Jan 24, 2024

seanxuu commented Jan 24, 2024

grimoire commented Jan 24, 2024 •

edited

Loading

seanxuu commented Jan 24, 2024

[Feature] TurbomindEngineConfig supports lora #1007

[Feature] TurbomindEngineConfig supports lora #1007

Comments

seanxuu commented Jan 20, 2024

Motivation

Related resources

Additional context

lvhan028 commented Jan 22, 2024

grimoire commented Jan 22, 2024

Cloopen-ReLiNK commented Jan 23, 2024 • edited Loading

grimoire commented Jan 23, 2024

Cloopen-ReLiNK commented Jan 23, 2024 • edited Loading

Cloopen-ReLiNK commented Jan 23, 2024 • edited Loading

torch:2.1.1 triton:2.1.0 lmdeploy:0.2.1 cuda:11.7 V100-32G

grimoire commented Jan 23, 2024

Cloopen-ReLiNK commented Jan 23, 2024 • edited Loading

Cloopen-ReLiNK commented Jan 23, 2024

grimoire commented Jan 23, 2024 • edited Loading

grimoire commented Jan 23, 2024

Cloopen-ReLiNK commented Jan 24, 2024

seanxuu commented Jan 24, 2024

grimoire commented Jan 24, 2024 • edited Loading

seanxuu commented Jan 24, 2024

Cloopen-ReLiNK commented Jan 23, 2024 •

edited

Loading

Cloopen-ReLiNK commented Jan 23, 2024 •

edited

Loading

Cloopen-ReLiNK commented Jan 23, 2024 •

edited

Loading

Cloopen-ReLiNK commented Jan 23, 2024 •

edited

Loading

grimoire commented Jan 23, 2024 •

edited

Loading

grimoire commented Jan 24, 2024 •

edited

Loading