Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MultimodalConversableAgent in autogenstudio? #1169

Closed
antoan opened this issue Jan 7, 2024 · 7 comments
Closed

MultimodalConversableAgent in autogenstudio? #1169

antoan opened this issue Jan 7, 2024 · 7 comments
Assignees
Labels
0.2 Issues which are related to the pre 0.4 codebase multimodal language + vision, speech etc.

Comments

@antoan
Copy link

antoan commented Jan 7, 2024

Is it currently possible or are there plans to support this in the future?

@rickyloynd-microsoft
Copy link
Contributor

@victordibia fyi

@victordibia
Copy link
Collaborator

Hi @antoan,

Thanks for the note.
Currently, there is only support for core autogen agent classes - UserProxy, Assistant (GroupChat support currently in development and on the roadmap).
We plan to start supporting more agent types from contrib in the future but this is not currently on the roadmap.

If you would consider describing your envisioned use case in a bit more detail, that would be helpful once we get there.

@sonichi
Copy link
Contributor

sonichi commented Jan 8, 2024

In the meantime, @BeibinLi is thinking about implementing multimodal in the core. Knowing the use case here would also help that.

@sonichi sonichi added ui/deploy multimodal language + vision, speech etc. labels Jan 8, 2024
@BeibinLi BeibinLi self-assigned this Jan 8, 2024
@antoan
Copy link
Author

antoan commented Jan 10, 2024

I see, thank for letting me know.

My use case involves the periodic visual monitoring of an industrial hanger, for anomalies - e.g people present in the hanger where none should be preset, via a camera stream.

I initially intended to use a multimodal agent in conjunction with autogen studio to render anomalous detection frames to the user, and a gui is the only component I lack to complete the experience.

Please let me know if this is sufficient.

@gagb
Copy link
Collaborator

gagb commented Jan 16, 2024

Hi @antoan,

Thanks for the note. Currently, there is only support for core autogen agent classes - UserProxy, Assistant (GroupChat support currently in development and on the roadmap). We plan to start supporting more agent types from contrib in the future but this is not currently on the roadmap.

If you would consider describing your envisioned use case in a bit more detail, that would be helpful once we get there.

There was already a P3 for supporting contrib agents; appended multi modal to that list

@Alblahm
Copy link

Alblahm commented May 31, 2024

It is working as it is now.
I'm using the autogen Studio without any change and you just have to add a skill to the build skills tab, and then also add the new created skill to your workflow, for instance, open the general assistant, and add this skill to the primary_assistant.
Then you can use it to describe images or any other text-image based task. The only thing that you have to take in account is the folder where the system tries to find the OAI_CONFIG_LIST and the image.

The skill file I'm using is this one:

import autogen  
  
def describe_image_with_gp4o(task_description: str, image_name: str) -> str:  
    """  
    Describe the content of an image based on a given task description.  
  
    Args:  
        task_description (str): A description of what you want the agent to do.  
        image_name (str): The name of the image file to be described.  
  
    Returns:  
        str: The description of the image content.  
    """  
      
    # Define the LLM configuration directly
    gpt4_llm_config = {
        "model": "gpt-4o",
        "temperature": 0.5,
        "max_tokens": 300
    }

    # Create the multimodal conversable agent
    from autogen.agentchat.contrib.multimodal_conversable_agent import MultimodalConversableAgent

    image_agent = MultimodalConversableAgent(
        name="image-explainer",
        max_consecutive_auto_reply=10,
        llm_config=gpt4_llm_config
    )
  
    # Create the user proxy agent  
    user_proxy = autogen.UserProxyAgent(  
        name="User_proxy",  
        system_message="A human admin.",  
        human_input_mode="NEVER",  
        max_consecutive_auto_reply=0  
    )  
  
    # Initiate the chat with the image agent  
    user_proxy.initiate_chat(image_agent, message=f"""What's on the image? <img {image_name}>. {task_description}""")  
  
    # Assuming the response is stored in a variable called response
    response = user_proxy.chat_messages[image_agent][-1]['content']   
  
    return response  
  
# Example usage of the function:  
# try:  
#     description = describe_image_with_gp4o("Please describe the main objects and their colors.", "imagen_2.jpg")  
#     print(f"Image description: {description}")  
# except Exception as e:  
#     print(f"An error occurred: {e}")  

@rysweet rysweet added 0.2 Issues which are related to the pre 0.4 codebase needs-triage labels Oct 2, 2024
@rysweet
Copy link
Collaborator

rysweet commented Oct 18, 2024

is working

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.2 Issues which are related to the pre 0.4 codebase multimodal language + vision, speech etc.
Projects
None yet
Development

No branches or pull requests

8 participants