Inquiry on vLLM Inference for cogagent Model Compared to glm-4v-9b #25

aptsunny · 2025-01-20T02:08:17Z

Feature request / 功能建议

Hello CogAgent Team,
I hope this message finds you well. I am currently working with the cogagent-9b-20241220 model and have noted that there is currently no support for inference with the vLLM framework, as mentioned in the README file.
I would like to inquire if the vLLM inference process for the cogagent model is similar to that of the glm-4v-9b model. If there are significant differences, could you please provide guidance on what modifications might be necessary to adapt the cogagent model for vLLM inference?
Specifically, I am interested in understanding:
The differences in model inputs and outputs between the two models.
Any changes required in the model architecture or configuration for vLLM compatibility.
Any known issues or limitations when adapting cogagent for vLLM inference.
Thank you in advance for your assistance. I look forward to your insights and any recommendations you might have for enabling vLLM inference with the cogagent model.

Motivation / 动机

from PIL import Image
from vllm import LLM, SamplingParams

model_name = "THUDM/glm-4v-9b"

llm = LLM(model=model_name,
          tensor_parallel_size=1,
          max_model_len=8192,
          trust_remote_code=True,
          enforce_eager=True)
stop_token_ids = [151329, 151336, 151338]
sampling_params = SamplingParams(temperature=0.2,
                                 max_tokens=1024,
                                 stop_token_ids=stop_token_ids)

prompt = "What's the content of the image?"
image = Image.open("your image").convert('RGB')
inputs = {
    "prompt": prompt,
    "multi_modal_data": {
        "image": image
        },
        }
outputs = llm.generate(inputs, sampling_params=sampling_params)

for o in outputs:
    generated_text = o.outputs[0].text
    print(generated_text)

Your contribution / 您的贡献

xx

The text was updated successfully, but these errors were encountered:

sixsixcoder · 2025-01-23T07:19:31Z

Hello, thank you for your attention to the work of CogAgent's vLLM. Regarding the differences between modeling_chatglm.py of THUDM/cogagent-9b-20241220 and GLM-4v-9b, after auditing the open-source code of both models, I found that the main differences lie in two aspects.

CogAgent uses a non-interleaved rotary position encoding method, while GLM-4v-9b uses an interleaved method.
The position_ids of GLM-4v-9b insert num_patches identical characters in the middle, and CogAgent uses an incremental sequence of [0, len(input_id) + num_patches - 1] length, specifically num_patches = (image_size // patch_size // 2) ** 2.

These differences result in their reasoning code in vLLM being different. In the multi-modal model code of vLLM (see this PR vllm-project/vllm#11742), CogAgent is mainly manifested as setting is_neox_style to True, which will allow the smooth use of vLLM for reasoning, however, the reasoning performance is not as good as directly using transformers. We are exploring ways to improve the reasoning performance of CogAgent in vLLM.

If you wish to obtain the minimal demo for CogAgent with vLLM inference, you may refer to the code below. As stated in the README, CogAgent requires strict control over input fields, including task, platform, format, and history_step, and an image is mandatory. The output is also formatted, and you can refer to the format specifications for guidance.

from PIL import Image
from vllm import LLM, SamplingParams

model_name = "THUDM/cogagent-9b-20241220"

def procress_inputs():
    task = "Mark emails as read"
    platform_str = "(Platform: Mac)\n"
    history_str = "\nHistory steps: "
    format_str = "(Answer in Action-Operation-Sensitive format.)"
    query = f"Task: {task}{history_str}\n{platform_str}{format_str}"
    return query

llm = LLM(model=model_name,
          tensor_parallel_size=1,
          max_model_len=8192,
          trust_remote_code=True,
          enforce_eager=True)
stop_token_ids = [151329, 151336, 151338]
sampling_params = SamplingParams(temperature=0.2,
                                 max_tokens=1024,
                                 stop_token_ids=stop_token_ids)

prompt = procress_inputs()
image = Image.open("your image.png").convert('RGB')
inputs = {
    "prompt": prompt,
    "multi_modal_data": {
        "image": image
        },
        }
outputs = llm.generate(inputs, sampling_params=sampling_params)

for o in outputs:
    generated_text = o.outputs[0].text
    print(generated_text)

If you have any questions, please feel free to contact me at any time. Chinese New Year is coming soon, and I wish you a happy Chinese New Year!

sixsixcoder self-assigned this Jan 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inquiry on vLLM Inference for cogagent Model Compared to glm-4v-9b #25

Inquiry on vLLM Inference for cogagent Model Compared to glm-4v-9b #25

aptsunny commented Jan 20, 2025

sixsixcoder commented Jan 23, 2025

Inquiry on vLLM Inference for cogagent Model Compared to glm-4v-9b #25

Inquiry on vLLM Inference for cogagent Model Compared to glm-4v-9b #25

Comments

aptsunny commented Jan 20, 2025

Feature request / 功能建议

Motivation / 动机

Your contribution / 您的贡献

sixsixcoder commented Jan 23, 2025