Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] TurbomindEngineConfig supports lora #1007

Closed
seanxuu opened this issue Jan 20, 2024 · 15 comments
Closed

[Feature] TurbomindEngineConfig supports lora #1007

seanxuu opened this issue Jan 20, 2024 · 15 comments
Assignees

Comments

@seanxuu
Copy link

seanxuu commented Jan 20, 2024

Motivation

The existing HF is supporting inference without merging weights.

This is very important for a large number of tests! Merging weights consumes a lot of resources and time.

Related resources

No response

Additional context

No response

@lvhan028
Copy link
Collaborator

Turbomind has no plan to support lora recently.
But the other engine of LMDeploy TorchEngine has supported it. May give it a try.
@grimoire please provide some guide

@grimoire
Copy link
Collaborator

Hi, you can try out PyTorch Engine.

You can build engine with a dictionary with adapter_name:adapter_path pair.

from lmdeploy.messages import PytorchEngineConfig
from lmdeploy.pytorch.engine.engine import Engine
adapters = {'default':'/path/to/adapter'}
engine_config = PytorchEngineConfig(adapters=adapters)
engine = Engine.from_pretrained(model_path,
                                engine_config=engine_config,
                                trust_remote_code=True)

And perform inference with engine instance:

generator = engine.create_instance()

for outputs in generator.stream_infer(session_id=session_id,
                                      input_ids=input_ids,
                                      gen_config=gen_config,
                                      adapter_name=adapter_name):
    # read outputs

# close session and release caches
generator.end(session_id)

@Cloopen-ReLiNK
Copy link

Cloopen-ReLiNK commented Jan 23, 2024

Hi, you can try out PyTorch Engine.

You can build engine with a dictionary with adapter_name:adapter_path pair.

from lmdeploy.messages import PytorchEngineConfig
from lmdeploy.pytorch.engine.engine import Engine
adapters = {'default':'/path/to/adapter'}
engine_config = PytorchEngineConfig(adapters=adapters)
engine = Engine.from_pretrained(model_path,
                                engine_config=engine_config,
                                trust_remote_code=True)

And perform inference with engine instance:

generator = engine.create_instance()

for outputs in generator.stream_infer(session_id=session_id,
                                      input_ids=input_ids,
                                      gen_config=gen_config,
                                      adapter_name=adapter_name):
    # read outputs

# close session and release caches
generator.end(session_id)

Hi, you can try out PyTorch Engine.

You can build engine with a dictionary with adapter_name:adapter_path pair.

from lmdeploy.messages import PytorchEngineConfig
from lmdeploy.pytorch.engine.engine import Engine
adapters = {'default':'/path/to/adapter'}
engine_config = PytorchEngineConfig(adapters=adapters)
engine = Engine.from_pretrained(model_path,
                                engine_config=engine_config,
                                trust_remote_code=True)

And perform inference with engine instance:

generator = engine.create_instance()

for outputs in generator.stream_infer(session_id=session_id,
                                      input_ids=input_ids,
                                      gen_config=gen_config,
                                      adapter_name=adapter_name):
    # read outputs

# close session and release caches
generator.end(session_id)

python: /project/lib/Analysis/Allocation.cpp:40: std::pair<llvm::SmallVector, llvm::SmallVector > mlir::triton::getCvtOrder(mlir::Attribute, mlir::Attribute): Assertion `!(srcMmaLayout && dstMmaLayout) && "Unexpected mma -> mma layout conversion"' failed.
Aborted,

chatglm2-6b加载lora预测报错 V100 cuda12 pytorch:2.1.1 triton:2.1.0,按指导下载的whl文件安装的lmdeploy
使用cuda11机器也是同样的报错

@grimoire
Copy link
Collaborator

@Cloopen-ReLiNK I can not reproduce with this adapter. Would you mind provide the adapter?

@Cloopen-ReLiNK
Copy link

Cloopen-ReLiNK commented Jan 23, 2024

you mind provide the adapter

Thank you, I will test load the adapter you are using first. If there is any problem, I will provide the adapter.

produced the same error。

@Cloopen-ReLiNK
Copy link

Cloopen-ReLiNK commented Jan 23, 2024

torch:2.1.1 triton:2.1.0 lmdeploy:0.2.1 cuda:11.7 V100-32G

@grimoire
What is the machine environment you use?

from lmdeploy.messages import PytorchEngineConfig
from lmdeploy.pytorch.engine.engine import Engine
adapters = {'default':'./lora-chatglm2-6b-guodegang'}
engine_config = PytorchEngineConfig(adapters=adapters)
model_path = '/home/pretrained_model/chatglm2-6b'

engine = Engine.from_pretrained(model_path,
engine_config=engine_config,
trust_remote_code=True)

generator = engine.create_instance()

session_id = '1'
input_ids = engine.tokenizer.encode("你好")
adapter_name = 'default'
from lmdeploy import GenerationConfig
gen_config = GenerationConfig(top_p=0.8,
top_k=40,
temperature=0.8,
max_new_tokens=1024
)

for outputs in generator.stream_infer(session_id=session_id,
input_ids=input_ids,
gen_config=gen_config,
adapter_name=adapter_name):
pass

@grimoire
Copy link
Collaborator

I can replicate the error on V100 now. It seems that the flash-attention implementation does not fit devices with sm<80.

@Cloopen-ReLiNK
Copy link

Cloopen-ReLiNK commented Jan 23, 2024

3090 with cuda11.6, both adapters can run normally.
image
image
But the generated results are inconsistent, the question is "你好"

@Cloopen-ReLiNK
Copy link

Is there any solution for using V100? flash attention v1?

@grimoire
Copy link
Collaborator

grimoire commented Jan 23, 2024

But the generated results are inconsistent

This might caused by random sampling.

Is there any solution for using V100?

I am trying to find a fix, please allow me some time.

@grimoire
Copy link
Collaborator

@Cloopen-ReLiNK #1027 Please have a try.

@Cloopen-ReLiNK
Copy link

@Cloopen-ReLiNK #1027 Please have a try.

solved

@seanxuu
Copy link
Author

seanxuu commented Jan 24, 2024

Hi, you can try out PyTorch Engine.

You can build engine with a dictionary with adapter_name:adapter_path pair.

from lmdeploy.messages import PytorchEngineConfig
from lmdeploy.pytorch.engine.engine import Engine
adapters = {'default':'/path/to/adapter'}
engine_config = PytorchEngineConfig(adapters=adapters)
engine = Engine.from_pretrained(model_path,
                                engine_config=engine_config,
                                trust_remote_code=True)

And perform inference with engine instance:

generator = engine.create_instance()

for outputs in generator.stream_infer(session_id=session_id,
                                      input_ids=input_ids,
                                      gen_config=gen_config,
                                      adapter_name=adapter_name):
    # read outputs

# close session and release caches
generator.end(session_id)

Can it achieve the same abilities as the TurbomindEngine? Because I find TurbomindEngine is better than PytorchEngine.

@grimoire
Copy link
Collaborator

grimoire commented Jan 24, 2024

Can it achieve the same abilities as the TurbomindEngine?

No, PytorchEngine is slower.

@seanxuu
Copy link
Author

seanxuu commented Jan 24, 2024

Can it achieve the same abilities as the TurbomindEngine?

No, PytorchEngine is slower.

Got it. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants