Letting an agent view an image returned from a tool -- what format to use? #25881
-
Checked other resources
Commit to Help
Example Codefrom langchain_openai import ChatOpenAI
from langchain.agents import AgentExecutor, create_openai_tools_agent
from langchain_community.tools import tool
from langchain_core.prompts.chat import ChatPromptTemplate, MessagesPlaceholder
from PIL import Image, ImageDraw, ImageFont
import os
os.environ["OPENAI_API_KEY"] = <my-api-key-here>
#create an image for MWE
image = Image.new('RGB', (100, 50), 'white')
draw = ImageDraw.Draw(image)
draw.text((20, 10), "Hello!", fill="black", font = ImageFont.load_default(size = 24))
image.save('secret_image.png')
#define tools
@tool
def load_secret_image():
"load a secret image that I have prepared for you"
img = Image.open("secret_image.png")
return img
#build model
prompt = ChatPromptTemplate.from_messages([
('system', 'You are a helpful AI bot'),
("human", "{input}"),
MessagesPlaceholder("agent_scratchpad"),
]
)
llm = ChatOpenAI(model="gpt-4o-mini")
agent = create_openai_tools_agent(llm, tools = [load_secret_image], prompt = prompt)
executor = AgentExecutor(agent = agent, tools=[load_secret_image])
#run model
for event in executor.stream({"input": "What word is present in the secret image?"}):
print(event)
#relevant part of what is printed:
#{'actions': [ToolAgentAction(tool='load_secret_image', tool_input={}, ...
#{'steps': [AgentStep(action=ToolAgentAction(tool='load_secret_image', ..., observation=<PIL.PngImagePlugin.PngImageFile image mode=RGB size=100x50 at 0x120CF2360>)], ...
#{'output': "I can't directly analyze the contents of the image. However, if you can describe the image or provide any context, I might be able to help you with more information!" ... DescriptionI'm trying to figure out how a tool should pass back multimodal output (in particular, an image) so an agent can comprehend it and continue working with it. I want the agent to analyze the image the same as though it had been included in the original human prompt, but it seems unable to parse it. The ultimate use case will be calling a plotting tool, since I think the agent might have an easier time comprehending visually presented data. For the MWE above, though, I am just loading a small image with the text 'Hello!', and asking the model what text is present in the image. Thing's I have tested:Returning a PIL image from the toolIn the MWE above, I was simply passing the image as loaded by PIL, with no further formatting Returning a ToolMessage with the image in base64To include an image in the original human prompt, it should be passed in a specific format using its base64 representation (as seen here). I tried using this format, wrapped in a tool message import base64
from langchain_core.messages import ToolMessage
@tool
def load_secret_image():
"load a secret image that I have prepared for you as a human message"
with open('secret_image.png', 'rb') as image_file:
binary_data = image_file.read()
image_data = base64.b64encode(binary_data).decode("utf-8")
return ToolMessage(
content = [
{"type": "text", "text": ""},
{
"type": "image_url",
"image_url": {"url": f"data:image/png;base64,{image_data}"}
}],
tool_call_id = "testid"
) Upon running the model as in the MWE, the tool is called, but the agent outputs: "The secret image does not contain any visible text..." Passing the ToolMessage with the image in base64 straight to model (not through the tool)To verify that the model should be able to read the word in the image, I tried passing the ToolMessage directly to the LLM: pure_message = load_secret_image({})
llm.invoke([
("human", "What word is present in this image?"),
pure_message
]) This raised an error (from OpenAI): Wrapping the base64-image in a HumanMessage insteadI modified from langchain_core.messages import HumanMessage
@tool
def load_secret_image_human():
"load a secret image that I have prepared for you as a human message"
with open('secret_image.png', 'rb') as image_file:
# Read the image in binary mode
binary_data = image_file.read()
# Encode the binary data to base64
image_data = base64.b64encode(binary_data).decode("utf-8")
return HumanMessage(
content = [
{"type": "text", "text": ""},
{
"type": "image_url",
"image_url": {"url": f"data:image/png;base64,{image_data}"}
}]
)
pure_message = load_secret_image_human({})
llm.invoke([
('system', 'You are a helpful AI bot'),
("human", "What word is present in this image?"),
pure_message
]) It returned Returning a HumanMessage from the toolRunning the full agent with this HumanMessage, it does not understand the output from the tool the same way: agent = create_openai_tools_agent(llm, tools = [load_secret_image_human], prompt = prompt)
executor = AgentExecutor(agent = agent, tools=[load_secret_image_human])
for event in executor.stream({"input": "What word is present in the secret image?"}):
print(event) The agent outputs "The secret image is encoded in base64, and it seems to be a PNG image. However, I cannot directly analyze or extract text from images..." System InfoSystem Information
Package Information
Optional packages not installed
Other Dependencies
pip freeze outputaiohappyeyeballs==2.4.0 |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 14 replies
-
@oe-andreas Hello! To let an agent view an image returned from a tool in LangChain, you should use the following format: from langchain.agents import initialize_agent, load_tools
from skimage import io
import cv2
# Load the necessary tools and initialize the agent
tools = load_tools(["dalle-image-generator"])
agent = initialize_agent(tools, llm, agent="zero-shot-react-description", verbose=True)
# Run the agent to generate the image
output = agent.run("Create an image of a halloween night at a haunted museum")
# Display the image
image_url = output # Assuming the output is the URL of the generated image
try:
import google.colab
IN_COLAB = True
except ImportError:
IN_COLAB = False
if IN_COLAB:
from google.colab.patches import cv2_imshow
image = io.imread(image_url)
cv2_imshow(image)
else:
image = io.imread(image_url)
cv2.imshow("Generated Image", image)
cv2.waitKey(0) # wait for a keyboard input
cv2.destroyAllWindows() This function uses the |
Beta Was this translation helpful? Give feedback.
@oe-andreas I had similar issue while using
create_react_agent()
from langgraph/prebuilt.I found that in my case
create_react_agent()
creates a graph with ToolNode (which is responsible for calling tools - https://langchain-ai.github.io/langgraph/reference/prebuilt/#toolnode) and that this ToolNode is turning tostr
the content of the message that it got from calling tool.langgraph/prebuilt/tool_node.py
As you can see the
content
of theToolMessage
is converted tostr
. If your.content
was adict
which describes image then it got converted tostr
withjson.dumps()
and as a result thisToolMessage
will be treated by LLM/Chat model like a text reply.I fixed that by literally coping the
t…