MultimodalConversableAgent in autogenstudio? #1169

antoan · 2024-01-07T21:21:43Z

Is it currently possible or are there plans to support this in the future?

rickyloynd-microsoft · 2024-01-07T22:18:44Z

victordibia · 2024-01-08T02:58:14Z

Thanks for the note.
Currently, there is only support for core autogen agent classes - UserProxy, Assistant (GroupChat support currently in development and on the roadmap).
We plan to start supporting more agent types from contrib in the future but this is not currently on the roadmap.

If you would consider describing your envisioned use case in a bit more detail, that would be helpful once we get there.

sonichi · 2024-01-08T03:52:22Z

In the meantime, @BeibinLi is thinking about implementing multimodal in the core. Knowing the use case here would also help that.

antoan · 2024-01-10T18:23:44Z

I see, thank for letting me know.

My use case involves the periodic visual monitoring of an industrial hanger, for anomalies - e.g people present in the hanger where none should be preset, via a camera stream.

I initially intended to use a multimodal agent in conjunction with autogen studio to render anomalous detection frames to the user, and a gui is the only component I lack to complete the experience.

Please let me know if this is sufficient.

gagb · 2024-01-16T00:50:36Z

Hi @antoan,

Thanks for the note. Currently, there is only support for core autogen agent classes - UserProxy, Assistant (GroupChat support currently in development and on the roadmap). We plan to start supporting more agent types from contrib in the future but this is not currently on the roadmap.

If you would consider describing your envisioned use case in a bit more detail, that would be helpful once we get there.

There was already a P3 for supporting contrib agents; appended multi modal to that list

Alblahm · 2024-05-31T12:32:24Z

It is working as it is now.
I'm using the autogen Studio without any change and you just have to add a skill to the build skills tab, and then also add the new created skill to your workflow, for instance, open the general assistant, and add this skill to the primary_assistant.
Then you can use it to describe images or any other text-image based task. The only thing that you have to take in account is the folder where the system tries to find the OAI_CONFIG_LIST and the image.

The skill file I'm using is this one:

import autogen  
  
def describe_image_with_gp4o(task_description: str, image_name: str) -> str:  
    """  
    Describe the content of an image based on a given task description.  
  
    Args:  
        task_description (str): A description of what you want the agent to do.  
        image_name (str): The name of the image file to be described.  
  
    Returns:  
        str: The description of the image content.  
    """  
      
    # Define the LLM configuration directly
    gpt4_llm_config = {
        "model": "gpt-4o",
        "temperature": 0.5,
        "max_tokens": 300
    }

    # Create the multimodal conversable agent
    from autogen.agentchat.contrib.multimodal_conversable_agent import MultimodalConversableAgent

    image_agent = MultimodalConversableAgent(
        name="image-explainer",
        max_consecutive_auto_reply=10,
        llm_config=gpt4_llm_config
    )
  
    # Create the user proxy agent  
    user_proxy = autogen.UserProxyAgent(  
        name="User_proxy",  
        system_message="A human admin.",  
        human_input_mode="NEVER",  
        max_consecutive_auto_reply=0  
    )  
  
    # Initiate the chat with the image agent  
    user_proxy.initiate_chat(image_agent, message=f"""What's on the image? <img {image_name}>. {task_description}""")  
  
    # Assuming the response is stored in a variable called response
    response = user_proxy.chat_messages[image_agent][-1]['content']   
  
    return response  
  
# Example usage of the function:  
# try:  
#     description = describe_image_with_gp4o("Please describe the main objects and their colors.", "imagen_2.jpg")  
#     print(f"Image description: {description}")  
# except Exception as e:  
#     print(f"An error occurred: {e}")

rysweet · 2024-10-18T18:16:21Z

is working

sonichi added ui/deploy multimodal language + vision, speech etc. labels Jan 8, 2024

BeibinLi self-assigned this Jan 8, 2024

rysweet added 0.2 Issues which are related to the pre 0.4 codebase needs-triage labels Oct 2, 2024

rysweet closed this as completed Oct 18, 2024

rysweet removed the needs-triage label Oct 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MultimodalConversableAgent in autogenstudio? #1169

MultimodalConversableAgent in autogenstudio? #1169

antoan commented Jan 7, 2024

rickyloynd-microsoft commented Jan 7, 2024

victordibia commented Jan 8, 2024

sonichi commented Jan 8, 2024

antoan commented Jan 10, 2024

gagb commented Jan 16, 2024

Alblahm commented May 31, 2024 •

edited

Loading

rysweet commented Oct 18, 2024

MultimodalConversableAgent in autogenstudio? #1169

MultimodalConversableAgent in autogenstudio? #1169

Comments

antoan commented Jan 7, 2024

rickyloynd-microsoft commented Jan 7, 2024

victordibia commented Jan 8, 2024

sonichi commented Jan 8, 2024

antoan commented Jan 10, 2024

gagb commented Jan 16, 2024

Alblahm commented May 31, 2024 • edited Loading

rysweet commented Oct 18, 2024

Alblahm commented May 31, 2024 •

edited

Loading