[Bug] Does PytorchEngine Visual Model Support Prefix Caching? #2789

OftenDream · 2024-11-21T12:05:59Z

Checklist

1. I have searched related issues but cannot get the expected help.
2. The bug has not been fixed in the latest version.
3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

I am using the PytorchEngineConfig to build a qwen2-vl triton service.
I sent the following three requests to the server in sequence:

Request 1 -> Response A
Request 2 -> Response B
Request 3 (same content as Request 1) -> Response B
Ideally, Request 3 should have received Response A. I suspect that Request 3 was parsed with the same prefix as Request 2, resulting in the same response as Request 2.

Could you please confirm whether PytorchEngine supports prefix caching for visual models, and if so, how it might be affecting the responses?

Thank you for your assistance.

Reproduction

build triton server:

        enable_prefix_caching=True
        engine_config = PytorchEngineConfig(
            tp=tp,
            cache_max_entry_count=cache_max_entry_count,
            enable_prefix_caching=enable_prefix_caching)
        self.engine = pipeline(model_path=model_path,
                               model_name=model_name,
                               backend_config=engine_config,
                               log_level='INFO')

then send request:

          with grpcclient.InferenceServerClient(self.server_url) as client:
            inputs = [
                self.prepare_tensor('max_tokens', np.array([max_tokens], dtype=np.int32)),
                self.prepare_tensor('temperature', np.array([temperature], dtype=np.float32)),
                self.prepare_tensor('top_p', np.array([top_p], dtype=np.float32)),
                self.prepare_tensor('top_k', np.array([top_k], dtype=np.int32)),
                self.prepare_tensor('stream', np.array([stream], dtype=np.bool_)),
                self.prepare_tensor('messages', np.array([prompt], dtype=np.object_)),
                self.prepare_tensor('text', np.array([text], dtype=np.object_)),
                self.prepare_tensor('ignore_eos', np.array([ig_eos], dtype=np.bool_))
            ]

        client.start_stream(partial(self.stream_callback))
        client.async_stream_infer(self.model_name, inputs, sequence_start=True, sequence_end=True)

Environment

cuda=11.8
lmdeploy=0.6.2
torch=2.4.0

Error traceback

No response

The text was updated successfully, but these errors were encountered:

grimoire · 2024-11-22T02:42:35Z

Nope, we don't have a good solution to match the multimodal features. Prefix caching is not supported on any vl model.

lvhan028 assigned grimoire Nov 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Does PytorchEngine Visual Model Support Prefix Caching? #2789

[Bug] Does PytorchEngine Visual Model Support Prefix Caching? #2789

OftenDream commented Nov 21, 2024 •

edited

Loading

grimoire commented Nov 22, 2024

[Bug] Does PytorchEngine Visual Model Support Prefix Caching? #2789

[Bug] Does PytorchEngine Visual Model Support Prefix Caching? #2789

Comments

OftenDream commented Nov 21, 2024 • edited Loading

Checklist

Describe the bug

Reproduction

Environment

Error traceback

grimoire commented Nov 22, 2024

OftenDream commented Nov 21, 2024 •

edited

Loading