Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inquiry on vLLM Inference for cogagent Model Compared to glm-4v-9b #25

Open
aptsunny opened this issue Jan 20, 2025 · 1 comment
Open
Assignees

Comments

@aptsunny
Copy link

Feature request / 功能建议

Hello CogAgent Team,
I hope this message finds you well. I am currently working with the cogagent-9b-20241220 model and have noted that there is currently no support for inference with the vLLM framework, as mentioned in the README file.
I would like to inquire if the vLLM inference process for the cogagent model is similar to that of the glm-4v-9b model. If there are significant differences, could you please provide guidance on what modifications might be necessary to adapt the cogagent model for vLLM inference?
Specifically, I am interested in understanding:
The differences in model inputs and outputs between the two models.
Any changes required in the model architecture or configuration for vLLM compatibility.
Any known issues or limitations when adapting cogagent for vLLM inference.
Thank you in advance for your assistance. I look forward to your insights and any recommendations you might have for enabling vLLM inference with the cogagent model.

Motivation / 动机

from PIL import Image
from vllm import LLM, SamplingParams

model_name = "THUDM/glm-4v-9b"

llm = LLM(model=model_name,
          tensor_parallel_size=1,
          max_model_len=8192,
          trust_remote_code=True,
          enforce_eager=True)
stop_token_ids = [151329, 151336, 151338]
sampling_params = SamplingParams(temperature=0.2,
                                 max_tokens=1024,
                                 stop_token_ids=stop_token_ids)

prompt = "What's the content of the image?"
image = Image.open("your image").convert('RGB')
inputs = {
    "prompt": prompt,
    "multi_modal_data": {
        "image": image
        },
        }
outputs = llm.generate(inputs, sampling_params=sampling_params)

for o in outputs:
    generated_text = o.outputs[0].text
    print(generated_text)

Your contribution / 您的贡献

xx

@sixsixcoder sixsixcoder self-assigned this Jan 23, 2025
@sixsixcoder
Copy link
Collaborator

Hello, thank you for your attention to the work of CogAgent's vLLM. Regarding the differences between modeling_chatglm.py of THUDM/cogagent-9b-20241220 and GLM-4v-9b, after auditing the open-source code of both models, I found that the main differences lie in two aspects.

  1. CogAgent uses a non-interleaved rotary position encoding method, while GLM-4v-9b uses an interleaved method.
  2. The position_ids of GLM-4v-9b insert num_patches identical characters in the middle, and CogAgent uses an incremental sequence of [0, len(input_id) + num_patches - 1] length, specifically num_patches = (image_size // patch_size // 2) ** 2.

These differences result in their reasoning code in vLLM being different. In the multi-modal model code of vLLM (see this PR vllm-project/vllm#11742), CogAgent is mainly manifested as setting is_neox_style to True, which will allow the smooth use of vLLM for reasoning, however, the reasoning performance is not as good as directly using transformers. We are exploring ways to improve the reasoning performance of CogAgent in vLLM.

If you wish to obtain the minimal demo for CogAgent with vLLM inference, you may refer to the code below. As stated in the README, CogAgent requires strict control over input fields, including task, platform, format, and history_step, and an image is mandatory. The output is also formatted, and you can refer to the format specifications for guidance.

from PIL import Image
from vllm import LLM, SamplingParams

model_name = "THUDM/cogagent-9b-20241220"

def procress_inputs():
    task = "Mark emails as read"
    platform_str = "(Platform: Mac)\n"
    history_str = "\nHistory steps: "
    format_str = "(Answer in Action-Operation-Sensitive format.)"
    query = f"Task: {task}{history_str}\n{platform_str}{format_str}"
    return query

llm = LLM(model=model_name,
          tensor_parallel_size=1,
          max_model_len=8192,
          trust_remote_code=True,
          enforce_eager=True)
stop_token_ids = [151329, 151336, 151338]
sampling_params = SamplingParams(temperature=0.2,
                                 max_tokens=1024,
                                 stop_token_ids=stop_token_ids)

prompt = procress_inputs()
image = Image.open("your image.png").convert('RGB')
inputs = {
    "prompt": prompt,
    "multi_modal_data": {
        "image": image
        },
        }
outputs = llm.generate(inputs, sampling_params=sampling_params)

for o in outputs:
    generated_text = o.outputs[0].text
    print(generated_text)

If you have any questions, please feel free to contact me at any time. Chinese New Year is coming soon, and I wish you a happy Chinese New Year!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants