Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Does PytorchEngine Visual Model Support Prefix Caching? #2789

Open
3 tasks
OftenDream opened this issue Nov 21, 2024 · 1 comment
Open
3 tasks

[Bug] Does PytorchEngine Visual Model Support Prefix Caching? #2789

OftenDream opened this issue Nov 21, 2024 · 1 comment
Assignees

Comments

@OftenDream
Copy link

OftenDream commented Nov 21, 2024

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

I am using the PytorchEngineConfig to build a qwen2-vl triton service.
I sent the following three requests to the server in sequence:

  1. Request 1 -> Response A
  2. Request 2 -> Response B
  3. Request 3 (same content as Request 1) -> Response B
    Ideally, Request 3 should have received Response A. I suspect that Request 3 was parsed with the same prefix as Request 2, resulting in the same response as Request 2.

Could you please confirm whether PytorchEngine supports prefix caching for visual models, and if so, how it might be affecting the responses?

Thank you for your assistance.

Reproduction

build triton server:

        enable_prefix_caching=True
        engine_config = PytorchEngineConfig(
            tp=tp,
            cache_max_entry_count=cache_max_entry_count,
            enable_prefix_caching=enable_prefix_caching)
        self.engine = pipeline(model_path=model_path,
                               model_name=model_name,
                               backend_config=engine_config,
                               log_level='INFO')

then send request:

          with grpcclient.InferenceServerClient(self.server_url) as client:
            inputs = [
                self.prepare_tensor('max_tokens', np.array([max_tokens], dtype=np.int32)),
                self.prepare_tensor('temperature', np.array([temperature], dtype=np.float32)),
                self.prepare_tensor('top_p', np.array([top_p], dtype=np.float32)),
                self.prepare_tensor('top_k', np.array([top_k], dtype=np.int32)),
                self.prepare_tensor('stream', np.array([stream], dtype=np.bool_)),
                self.prepare_tensor('messages', np.array([prompt], dtype=np.object_)),
                self.prepare_tensor('text', np.array([text], dtype=np.object_)),
                self.prepare_tensor('ignore_eos', np.array([ig_eos], dtype=np.bool_))
            ]
        client.start_stream(partial(self.stream_callback))
        client.async_stream_infer(self.model_name, inputs, sequence_start=True, sequence_end=True)

Environment

cuda=11.8
lmdeploy=0.6.2
torch=2.4.0

Error traceback

No response

@grimoire
Copy link
Collaborator

Nope, we don't have a good solution to match the multimodal features. Prefix caching is not supported on any vl model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants