Letting an agent view an image returned from a tool -- what format to use? #25881

oe-andreas · 2024-08-30T11:03:43Z

oe-andreas
Aug 30, 2024

Checked other resources

I added a very descriptive title to this question.
I searched the LangChain documentation with the integrated search.
I used the GitHub search to find a similar question and didn't find it.

Commit to Help

I commit to help with one of those options 👆

Example Code

from langchain_openai import ChatOpenAI
from langchain.agents import AgentExecutor, create_openai_tools_agent
from langchain_community.tools import tool
from langchain_core.prompts.chat import ChatPromptTemplate, MessagesPlaceholder
from PIL import Image, ImageDraw, ImageFont

import os
os.environ["OPENAI_API_KEY"] = <my-api-key-here>

#create an image for MWE
image = Image.new('RGB', (100, 50), 'white')
draw = ImageDraw.Draw(image)
draw.text((20, 10), "Hello!", fill="black", font = ImageFont.load_default(size = 24))
image.save('secret_image.png')


#define tools
@tool
def load_secret_image():
    "load a secret image that I have prepared for you"

    img = Image.open("secret_image.png")
    return img


#build model
prompt = ChatPromptTemplate.from_messages([
    ('system', 'You are a helpful AI bot'),
    ("human", "{input}"),
    MessagesPlaceholder("agent_scratchpad"),
]
)

llm = ChatOpenAI(model="gpt-4o-mini")
agent = create_openai_tools_agent(llm, tools = [load_secret_image], prompt = prompt)
executor = AgentExecutor(agent = agent, tools=[load_secret_image])


#run model
for event in executor.stream({"input": "What word is present in the secret image?"}):
    print(event)



#relevant part of what is printed:

#{'actions': [ToolAgentAction(tool='load_secret_image', tool_input={}, ...

#{'steps': [AgentStep(action=ToolAgentAction(tool='load_secret_image', ..., observation=<PIL.PngImagePlugin.PngImageFile image mode=RGB size=100x50 at 0x120CF2360>)], ...

#{'output': "I can't directly analyze the contents of the image. However, if you can describe the image or provide any context, I might be able to help you with more information!" ...

Description

I'm trying to figure out how a tool should pass back multimodal output (in particular, an image) so an agent can comprehend it and continue working with it. I want the agent to analyze the image the same as though it had been included in the original human prompt, but it seems unable to parse it.

The ultimate use case will be calling a plotting tool, since I think the agent might have an easier time comprehending visually presented data. For the MWE above, though, I am just loading a small image with the text 'Hello!', and asking the model what text is present in the image.

Thing's I have tested:

Returning a PIL image from the tool

In the MWE above, I was simply passing the image as loaded by PIL, with no further formatting

Returning a ToolMessage with the image in base64

To include an image in the original human prompt, it should be passed in a specific format using its base64 representation (as seen here). I tried using this format, wrapped in a tool message

import base64
from langchain_core.messages import ToolMessage

@tool
def load_secret_image():
    "load a secret image that I have prepared for you as a human message"

    with open('secret_image.png', 'rb') as image_file:
        binary_data = image_file.read()
        image_data = base64.b64encode(binary_data).decode("utf-8")

    return ToolMessage(
        content = [
        {"type": "text", "text": ""},
        {
            "type": "image_url", 
            "image_url": {"url": f"data:image/png;base64,{image_data}"}
        }],
        tool_call_id = "testid"
        )

Upon running the model as in the MWE, the tool is called, but the agent outputs: "The secret image does not contain any visible text..."

Passing the ToolMessage with the image in base64 straight to model (not through the tool)

To verify that the model should be able to read the word in the image, I tried passing the ToolMessage directly to the LLM:

pure_message = load_secret_image({})
llm.invoke([
    ("human", "What word is present in this image?"),
    pure_message
    ])

This raised an error (from OpenAI): "Invalid 'messages[2]'. Image URLs are only allowed for messages with role 'user', but this message with role 'tool' contains an image URL." Interestingly this error was not raised when using the agent; apparently the tool output is not just 'reused' as an input.

Wrapping the base64-image in a HumanMessage instead

I modified load_secret_image to output a HumanMessage instead, and indeed the llm was able to recognize the word:

from langchain_core.messages import HumanMessage

@tool
def load_secret_image_human():
    "load a secret image that I have prepared for you as a human message"
    with open('secret_image.png', 'rb') as image_file:
        # Read the image in binary mode
        binary_data = image_file.read() 
        # Encode the binary data to base64
        image_data = base64.b64encode(binary_data).decode("utf-8")


    return HumanMessage(
        content = [
        {"type": "text", "text": ""},
        {
            "type": "image_url", 
            "image_url": {"url": f"data:image/png;base64,{image_data}"}
        }]
        )


pure_message = load_secret_image_human({})
llm.invoke([
    ('system', 'You are a helpful AI bot'),
    ("human", "What word is present in this image?"),
    pure_message
    ])

It returned AIMessage(content='The word present in the image is "Hello!"'...

Returning a HumanMessage from the tool

Running the full agent with this HumanMessage, it does not understand the output from the tool the same way:

agent = create_openai_tools_agent(llm, tools = [load_secret_image_human], prompt = prompt)
executor = AgentExecutor(agent = agent, tools=[load_secret_image_human])

for event in executor.stream({"input": "What word is present in the secret image?"}):
    print(event)

The agent outputs "The secret image is encoded in base64, and it seems to be a PNG image. However, I cannot directly analyze or extract text from images..."

System Info

System Information

OS: Darwin
OS Version: Darwin Kernel Version 23.6.0: Mon Jul 29 21:14:21 PDT 2024; root:xnu-10063.141.2~1/RELEASE_ARM64_T8103
Python Version: 3.12.3 | packaged by Anaconda, Inc. | (main, Apr 19 2024, 11:44:52) [Clang 14.0.6 ]

Package Information

langchain_core: 0.2.36
langchain: 0.2.15
langchain_community: 0.2.14
langsmith: 0.1.99
langchain_chroma: 0.1.2
langchain_google_vertexai: 1.0.10
langchain_huggingface: 0.0.3
langchain_ollama: 0.1.1
langchain_openai: 0.1.22
langchain_text_splitters: 0.2.2
langgraph: 0.2.4

Optional packages not installed

langserve

Other Dependencies

aiohttp: 3.9.5
anthropic[vertexai]: Installed. No version info available.
async-timeout: 4.0.3
chromadb: 0.5.5
dataclasses-json: 0.6.7
fastapi: 0.112.1
google-cloud-aiplatform: 1.64.0
google-cloud-storage: 2.18.2
httpx: 0.27.0
httpx-sse: 0.4.0
huggingface-hub: 0.24.6
jsonpatch: 1.33
langchain-mistralai: Installed. No version info available.
langgraph-checkpoint: 1.0.3
numpy: 1.26.4
ollama: 0.3.1
openai: 1.41.1
orjson: 3.9.15
packaging: 23.2
pydantic: 2.5.3
PyYAML: 6.0.1
requests: 2.31.0
sentence-transformers: 3.0.1
SQLAlchemy: 2.0.32
tenacity: 8.5.0
tiktoken: 0.7.0
tokenizers: 0.19.1
transformers: 4.44.0
typing-extensions: 4.11.0

pip freeze output

aiohappyeyeballs==2.4.0
aiohttp @ file:///private/var/folders/nz/j6p8yfhx1mv_0grj5xl4650h0000gp/T/abs_1bydo4860s/croot/aiohttp_1715108783113/work
aiosignal @ file:///home/conda/feedstock_root/build_artifacts/aiosignal_1667935791922/work
annotated-types @ file:///private/var/folders/nz/j6p8yfhx1mv_0grj5xl4650h0000gp/T/abs_1fa2djihwb/croot/annotated-types_1709542925772/work
anyio @ file:///private/var/folders/k1/30mswbxs7r1g6zwn8y4fyt500000gp/T/abs_a17a7759g2/croot/anyio_1706220182417/work
appnope @ file:///home/conda/feedstock_root/build_artifacts/appnope_1707233003401/work
asgiref==3.8.1
asttokens @ file:///home/conda/feedstock_root/build_artifacts/asttokens_1698341106958/work
async-timeout @ file:///home/conda/feedstock_root/build_artifacts/async-timeout_1691763562544/work
attrs @ file:///home/conda/feedstock_root/build_artifacts/attrs_1722977137225/work
backoff==2.2.1
bcrypt==4.2.0
beautifulsoup4==4.12.3
Bottleneck @ file:///private/var/folders/k1/30mswbxs7r1g6zwn8y4fyt500000gp/T/abs_e7wwl52wn1/croot/bottleneck_1709075893359/work
Brotli @ file:///Users/builder/cbouss/perseverance-python-buildout/croot/brotli-split_1699237319520/work
build==1.2.1
cachetools==5.5.0
certifi @ file:///private/var/folders/k1/30mswbxs7r1g6zwn8y4fyt500000gp/T/abs_70oty9s9jh/croot/certifi_1720453497032/work/certifi
cffi @ file:///private/var/folders/nz/j6p8yfhx1mv_0grj5xl4650h0000gp/T/abs_7a9c7wyorr/croot/cffi_1714483157752/work
chardet==5.2.0
charset-normalizer==3.3.2
chroma-hnswlib==0.7.6
chromadb==0.5.5
click==8.1.7
coloredlogs==15.0.1
comm @ file:///home/conda/feedstock_root/build_artifacts/comm_1710320294760/work
contourpy @ file:///Users/builder/cbouss/perseverance-python-buildout/croot/contourpy_1701814001737/work
cryptography @ file:///private/var/folders/nz/j6p8yfhx1mv_0grj5xl4650h0000gp/T/abs_31zgxm62w8/croot/cryptography_1714660690857/work
cycler @ file:///tmp/build/80754af9/cycler_1637851556182/work
dataclasses-json==0.6.7
debugpy @ file:///Users/builder/cbouss/perseverance-python-buildout/croot/debugpy_1699253073094/work
decorator @ file:///home/conda/feedstock_root/build_artifacts/decorator_1641555617451/work
deepdiff==7.0.1
Deprecated==1.2.14
dirtyjson==1.0.8
distro @ file:///private/var/folders/nz/j6p8yfhx1mv_0grj5xl4650h0000gp/T/abs_ddkyz0575y/croot/distro_1714488254309/work
docstring_parser==0.16
duckduckgo_search==6.2.9
emoji==2.12.1
et-xmlfile @ file:///Users/builder/cbouss/perseverance-python-buildout/croot/et_xmlfile_1699245044998/work
exceptiongroup @ file:///home/conda/feedstock_root/build_artifacts/exceptiongroup_1704921103267/work
executing @ file:///home/conda/feedstock_root/build_artifacts/executing_1698579936712/work
faiss-cpu==1.8.0.post1
fastapi==0.112.1
filelock @ file:///Users/builder/cbouss/perseverance-python-buildout/croot/filelock_1701804468261/work
filetype==1.2.0
flatbuffers==24.3.25
fonttools @ file:///private/var/folders/nz/j6p8yfhx1mv_0grj5xl4650h0000gp/T/abs_60c8ux4mkl/croot/fonttools_1713551354374/work
frozenlist @ file:///Users/builder/cbouss/perseverance-python-buildout/croot/frozenlist_1699254257028/work
fsspec==2024.3.1
google-api-core==2.19.2
google-auth==2.34.0
google-cloud-aiplatform==1.64.0
google-cloud-bigquery==3.25.0
google-cloud-core==2.4.1
google-cloud-resource-manager==1.12.5
google-cloud-storage==2.18.2
google-crc32c==1.5.0
google-resumable-media==2.7.2
googleapis-common-protos==1.63.2
greenlet @ file:///private/var/folders/k1/30mswbxs7r1g6zwn8y4fyt500000gp/T/abs_516imz09pb/croot/greenlet_1702059966336/work
grpc-google-iam-v1==0.13.1
grpcio==1.66.1
grpcio-status==1.66.1
h11 @ file:///private/var/folders/k1/30mswbxs7r1g6zwn8y4fyt500000gp/T/abs_110bmw2coo/croot/h11_1706652289620/work
httpcore @ file:///private/var/folders/k1/30mswbxs7r1g6zwn8y4fyt500000gp/T/abs_fcxiho9nv7/croot/httpcore_1706728465004/work
httptools==0.6.1
httpx @ file:///private/var/folders/nz/j6p8yfhx1mv_0grj5xl4650h0000gp/T/abs_cc4egw1482/croot/httpx_1723474826664/work
httpx-sse==0.4.0
huggingface-hub==0.24.6
humanfriendly==10.0
idna @ file:///Users/builder/cbouss/perseverance-python-buildout/croot/idna_1699237644810/work
imageio==2.35.1
importlib_metadata @ file:///home/conda/feedstock_root/build_artifacts/importlib-metadata_1710971335535/work
importlib_resources==6.4.3
ipykernel @ file:///Users/runner/miniforge3/conda-bld/ipykernel_1708996616394/work
ipython @ file:///home/conda/feedstock_root/build_artifacts/ipython_1709559745751/work
jedi @ file:///home/conda/feedstock_root/build_artifacts/jedi_1696326070614/work
Jinja2 @ file:///Users/builder/cbouss/perseverance-python-buildout/croot/jinja2_1707339916030/work
jiter==0.5.0
joblib @ file:///private/var/folders/nz/j6p8yfhx1mv_0grj5xl4650h0000gp/T/abs_f75pzkv6n1/croot/joblib_1713976769255/work
jsonpatch @ file:///home/conda/feedstock_root/build_artifacts/jsonpatch_1695536281965/work
jsonpath-python==1.0.6
jsonpickle==3.2.2
jsonpointer==2.0
jupyter_client @ file:///home/conda/feedstock_root/build_artifacts/jupyter_client_1710255804825/work
jupyter_core @ file:///Users/builder/cbouss/perseverance-python-buildout/croot/jupyter_core_1701803168398/work
kiwisolver @ file:///Users/builder/cbouss/perseverance-python-buildout/croot/kiwisolver_1699239145780/work
kubernetes==30.1.0
langchain==0.2.15
langchain-chroma==0.1.2
langchain-community==0.2.14
langchain-core==0.2.36
langchain-google-vertexai==1.0.10
langchain-huggingface==0.0.3
langchain-ollama==0.1.1
langchain-openai==0.1.22
langchain-text-splitters @ file:///home/conda/feedstock_root/build_artifacts/langchain-text-splitters_1719389695345/work
langdetect==1.0.9
langgraph==0.2.4
langgraph-checkpoint==1.0.3
langsmith @ file:///home/conda/feedstock_root/build_artifacts/langsmith_1723429059527/work
lazy_loader==0.4
llama-cloud==0.0.13
llama-index==0.10.67.post1
llama-index-agent-openai==0.2.9
llama-index-cli==0.1.13
llama-index-core==0.10.67
llama-index-embeddings-huggingface==0.2.3
llama-index-embeddings-openai==0.1.11
llama-index-indices-managed-llama-cloud==0.2.7
llama-index-legacy==0.9.48.post3
llama-index-llms-ollama==0.2.2
llama-index-llms-openai==0.1.29
llama-index-multi-modal-llms-openai==0.1.9
llama-index-program-openai==0.1.7
llama-index-question-gen-openai==0.1.3
llama-index-readers-file==0.1.33
llama-index-readers-llama-parse==0.1.6
llama-parse==0.4.9
lxml==5.3.0
markdown-it-py==3.0.0
MarkupSafe @ file:///Users/builder/cbouss/perseverance-python-buildout/croot/markupsafe_1707339878470/work
marshmallow==3.21.3
matplotlib @ file:///private/var/folders/k1/30mswbxs7r1g6zwn8y4fyt500000gp/T/abs_a8crvoz7ca/croot/matplotlib-suite_1713336381679/work
matplotlib-inline @ file:///home/conda/feedstock_root/build_artifacts/matplotlib-inline_1713250518406/work
mdurl==0.1.2
minijinja==2.0.1
mmh3==4.1.0
monotonic==1.6
mpmath @ file:///Users/builder/cbouss/perseverance-python-buildout/croot/mpmath_1699242500508/work
multidict @ file:///private/var/folders/nz/j6p8yfhx1mv_0grj5xl4650h0000gp/T/abs_10voz9m15i/croot/multidict_1701096890858/work
mypy-extensions==1.0.0
nest_asyncio @ file:///home/conda/feedstock_root/build_artifacts/nest-asyncio_1705850609492/work
networkx @ file:///Users/builder/cbouss/perseverance-python-buildout/croot/networkx_1699242555148/work
nltk==3.9.1
numexpr @ file:///Users/builder/cbouss/perseverance-python-buildout/croot/numexpr_1699237935415/work
numpy @ file:///private/var/folders/k1/30mswbxs7r1g6zwn8y4fyt500000gp/T/abs_a51i_mbs7m/croot/numpy_and_numpy_base_1708638620867/work/dist/numpy-1.26.4-cp312-cp312-macosx_11_0_arm64.whl#sha256=37afb6b734a197702d848df93bd67c10b52f6467d56e518950d84b6b1c949d27
oauthlib==3.2.2
ollama==0.3.1
onnxruntime==1.19.0
openai==1.41.1
opencv-python==4.10.0.84
openpyxl @ file:///private/var/folders/nz/j6p8yfhx1mv_0grj5xl4650h0000gp/T/abs_4cwnn4de8d/croot/openpyxl_1714159963151/work
opentelemetry-api==1.26.0
opentelemetry-exporter-otlp-proto-common==1.26.0
opentelemetry-exporter-otlp-proto-grpc==1.26.0
opentelemetry-instrumentation==0.47b0
opentelemetry-instrumentation-asgi==0.47b0
opentelemetry-instrumentation-fastapi==0.47b0
opentelemetry-proto==1.26.0
opentelemetry-sdk==1.26.0
opentelemetry-semantic-conventions==0.47b0
opentelemetry-util-http==0.47b0
opt-einsum==3.3.0
ordered-set==4.1.0
orjson @ file:///var/folders/k1/30mswbxs7r1g6zwn8y4fyt500000gp/T/abs_910h7wyrw8/croot/orjson_1711143065818/work/target/wheels/orjson-3.9.15-cp312-cp312-macosx_11_0_arm64.whl#sha256=9d0b9a65f8f89a5f3fb9b9170a49a15e56177a7a1e057192db2729991d0d3165
overrides==7.7.0
packaging @ file:///private/var/folders/k1/30mswbxs7r1g6zwn8y4fyt500000gp/T/abs_a6lqg7at4g/croot/packaging_1710807410750/work
pandas @ file:///private/var/folders/nz/j6p8yfhx1mv_0grj5xl4650h0000gp/T/abs_988rx6x546/croot/pandas_1709590494976/work/dist/pandas-2.2.1-cp312-cp312-macosx_11_0_arm64.whl#sha256=32bac561454fa7751bb78a7886ff57cc7dab7dbaf43aea898c4f3a260743b231
parso @ file:///home/conda/feedstock_root/build_artifacts/parso_1712320355065/work
pexpect @ file:///home/conda/feedstock_root/build_artifacts/pexpect_1706113125309/work
pickleshare @ file:///home/conda/feedstock_root/build_artifacts/pickleshare_1602536217715/work
pillow @ file:///private/var/folders/nz/j6p8yfhx1mv_0grj5xl4650h0000gp/T/abs_e02b4k5qik/croot/pillow_1707233036487/work
platformdirs @ file:///home/conda/feedstock_root/build_artifacts/platformdirs_1713912794367/work
posthog==3.5.0
primp==0.6.0
prompt-toolkit @ file:///home/conda/feedstock_root/build_artifacts/prompt-toolkit_1702399386289/work
proto-plus==1.24.0
protobuf==5.28.0
psutil @ file:///Users/builder/cbouss/perseverance-python-buildout/croot/psutil_1699248249804/work
ptyprocess @ file:///home/conda/feedstock_root/build_artifacts/ptyprocess_1609419310487/work/dist/ptyprocess-0.7.0-py2.py3-none-any.whl
pure-eval @ file:///home/conda/feedstock_root/build_artifacts/pure_eval_1642875951954/work
pyasn1==0.6.0
pyasn1_modules==0.4.0
pyclipper==1.3.0.post5
pycparser @ file:///tmp/build/80754af9/pycparser_1636541352034/work
pydantic @ file:///private/var/folders/k1/30mswbxs7r1g6zwn8y4fyt500000gp/T/abs_0ai8cvgm2c/croot/pydantic_1709577986211/work
pydantic_core @ file:///private/var/folders/k1/30mswbxs7r1g6zwn8y4fyt500000gp/T/abs_06smitnu98/croot/pydantic-core_1709573985903/work
Pygments @ file:///home/conda/feedstock_root/build_artifacts/pygments_1700607939962/work
PyMuPDF==1.24.9
PyMuPDFb==1.24.9
pyparsing @ file:///Users/builder/cbouss/perseverance-python-buildout/croot/pyparsing_1699239441347/work
pypdf==4.3.1
PyPika==0.48.9
pyproject_hooks==1.1.0
pyro-api==0.1.2
pyro-ppl==1.9.0
PySocks @ file:///Users/builder/cbouss/perseverance-python-buildout/croot/pysocks_1699237568675/work
python-dateutil @ file:///tmp/build/80754af9/python-dateutil_1626374649649/work
python-dotenv==1.0.1
python-iso639==2024.4.27
python-magic==0.4.27
pytz @ file:///private/var/folders/k1/30mswbxs7r1g6zwn8y4fyt500000gp/T/abs_a4b76c83ik/croot/pytz_1713974318928/work
pyvis==0.3.2
PyYAML @ file:///Users/builder/cbouss/perseverance-python-buildout/croot/pyyaml_1699240344586/work
pyzmq @ file:///private/var/folders/k1/30mswbxs7r1g6zwn8y4fyt500000gp/T/abs_43pxpbos3z/croot/pyzmq_1705605108344/work
rapidfuzz==3.9.6
rapidocr-onnxruntime==1.3.24
regex==2024.7.24
requests @ file:///private/var/folders/nz/j6p8yfhx1mv_0grj5xl4650h0000gp/T/abs_b3tnputioh/croot/requests_1707355573919/work
requests-oauthlib==2.0.0
requests-toolbelt==1.0.0
rich==13.7.1
rsa==4.9
safetensors==0.4.4
scikit-image==0.24.0
scikit-learn @ file:///private/var/folders/k1/30mswbxs7r1g6zwn8y4fyt500000gp/T/abs_97o1ht3oty/croot/scikit-learn_1721921878202/work
scipy @ file:///private/var/folders/k1/30mswbxs7r1g6zwn8y4fyt500000gp/T/abs_6f17pyifw3/croot/scipy_1714069789658/work/dist/scipy-1.13.0-cp312-cp312-macosx_11_0_arm64.whl#sha256=431f79321de6f5cbb09a4544ac8f8accbe05370813bfa1f04a999f0cd7a04646
seaborn @ file:///Users/builder/cbouss/perseverance-python-buildout/croot/seaborn_1699249999430/work
sentence-transformers==3.0.1
setuptools==68.2.2
shapely==2.0.6
shellingham==1.5.4
six @ file:///tmp/build/80754af9/six_1644875935023/work
sniffio @ file:///private/var/folders/nz/j6p8yfhx1mv_0grj5xl4650h0000gp/T/abs_1573pknjrg/croot/sniffio_1705431298885/work
soupsieve==2.6
SQLAlchemy==2.0.32
stack-data @ file:///home/conda/feedstock_root/build_artifacts/stack_data_1669632077133/work
starlette==0.38.2
striprtf==0.0.26
sympy @ file:///private/var/folders/nz/j6p8yfhx1mv_0grj5xl4650h0000gp/T/abs_7cbpg8656h/croot/sympy_1701397648473/work
tabulate==0.9.0
tenacity @ file:///home/conda/feedstock_root/build_artifacts/tenacity_1720351771156/work
threadpoolctl @ file:///private/var/folders/nz/j6p8yfhx1mv_0grj5xl4650h0000gp/T/abs_efa5bvb6vi/croot/threadpoolctl_1719407806403/work
tifffile==2024.8.28
tiktoken==0.7.0
tokenizers==0.19.1
torch==2.3.0
torchvision==0.18.0
tornado @ file:///Users/builder/cbouss/perseverance-python-buildout/croot/tornado_1699243541634/work
tqdm @ file:///private/var/folders/k1/30mswbxs7r1g6zwn8y4fyt500000gp/T/abs_f76_dxtcsh/croot/tqdm_1716395948224/work
traitlets @ file:///home/conda/feedstock_root/build_artifacts/traitlets_1713535121073/work
transformers==4.44.0
typer==0.12.4
typing-inspect==0.9.0
typing_extensions @ file:///home/conda/feedstock_root/build_artifacts/typing_extensions_1712329955671/work
tzdata @ file:///croot/python-tzdata_1690578112552/work
unicodedata2 @ file:///private/var/folders/k1/30mswbxs7r1g6zwn8y4fyt500000gp/T/abs_a3epjto7gs/croot/unicodedata2_1713212955584/work
unstructured==0.15.5
unstructured-client==0.25.5
urllib3 @ file:///private/var/folders/k1/30mswbxs7r1g6zwn8y4fyt500000gp/T/abs_aff2m3lasf/croot/urllib3_1707770561896/work
uvicorn==0.30.6
uvloop==0.20.0
watchfiles==0.23.0
wcwidth @ file:///home/conda/feedstock_root/build_artifacts/wcwidth_1704731205417/work
websocket-client==1.8.0
websockets==12.0
wheel==0.41.2
wrapt==1.16.0
xgboost @ file:///private/var/folders/nz/j6p8yfhx1mv_0grj5xl4650h0000gp/T/abs_b6b50u1t0e/croot/xgboost-split_1713972723244/work
yarl @ file:///private/var/folders/k1/30mswbxs7r1g6zwn8y4fyt500000gp/T/abs_a8s46tbbn5/croot/yarl_1701105147904/work
zipp @ file:///home/conda/feedstock_root/build_artifacts/zipp_1695255097490/work

Answered by drop-table-bugs

Sep 2, 2024

@oe-andreas I had similar issue while using create_react_agent() from langgraph/prebuilt.
I found that in my case create_react_agent() creates a graph with ToolNode (which is responsible for calling tools - https://langchain-ai.github.io/langgraph/reference/prebuilt/#toolnode) and that this ToolNode is turning to str the content of the message that it got from calling tool.

langgraph/prebuilt/tool_node.py

As you can see the content of the ToolMessage is converted to str. If your .content was a dict which describes image then it got converted to str with json.dumps() and as a result this ToolMessage will be treated by LLM/Chat model like a text reply.

I fixed that by literally coping the t…

View full answer

@oe-andreas · 2024-08-30T11:07:53Z

dosubot[bot]
bot Aug 30, 2024

@oe-andreas Hello!

To let an agent view an image returned from a tool in LangChain, you should use the following format:

from langchain.agents import initialize_agent, load_tools
from skimage import io
import cv2

# Load the necessary tools and initialize the agent
tools = load_tools(["dalle-image-generator"])
agent = initialize_agent(tools, llm, agent="zero-shot-react-description", verbose=True)

# Run the agent to generate the image
output = agent.run("Create an image of a halloween night at a haunted museum")

# Display the image
image_url = output  # Assuming the output is the URL of the generated image

try:
    import google.colab
    IN_COLAB = True
except ImportError:
    IN_COLAB = False

if IN_COLAB:
    from google.colab.patches import cv2_imshow
    image = io.imread(image_url)
    cv2_imshow(image)
else:
    image = io.imread(image_url)
    cv2.imshow("Generated Image", image)
    cv2.waitKey(0)  # wait for a keyboard input
    cv2.destroyAllWindows()

This function uses the DallEAPIWrapper to generate an image and then displays it using OpenCV and scikit-image. If running in Google Colab, it uses cv2_imshow for displaying the image ^[1]^[2].

To continue talking to Dosu, mention @dosu.

14 replies

oe-andreas Aug 30, 2024
Author

@dosu Yes, this is what I already tested -- but as mentioned it didn't work. The agent is not able to interpret the byte-string as an image, even though it is able to do so if the same exact Human message is passed directly (not returned from a tool)

dosubot[bot] bot Aug 30, 2024

To make an agent interpret a base64-encoded image returned from a custom tool in LangChain, you can follow the approach used in the test_image_inputs method. This method demonstrates how to handle base64-encoded images within a HumanMessage. You can adapt this approach to ensure that the agent can interpret the image when it is returned from a custom tool.

Here is an example of how you can modify your custom tool to return a base64-encoded image and ensure the agent can interpret it:

Define the custom tool to return a base64-encoded image:

from langchain_community.tools import tool
import base64
from PIL import Image
from io import BytesIO

@tool
def load_secret_image() -> str:
    """Load a secret image and return it as a base64-encoded string."""
    with open('secret_image.png', 'rb') as image_file:
        image_data = base64.b64encode(image_file.read()).decode("utf-8")
    return image_data

Invoke the tool and handle the base64-encoded image in the agent:

from langchain_core.messages import HumanMessage, ToolMessage, AIMessage
from langchain_openai import ChatOpenAI
from langchain.agents import AgentExecutor, create_openai_tools_agent
from langchain_core.prompts.chat import ChatPromptTemplate, MessagesPlaceholder

def interpret_image_with_agent():
    # Define the prompt
    prompt = ChatPromptTemplate.from_messages([
        ('system', 'You are a helpful AI bot'),
        ("human", "{input}"),
        MessagesPlaceholder("agent_scratchpad"),
    ])

    # Initialize the model
    llm = ChatOpenAI(model="gpt-4o-mini")

    # Create the agent with the custom tool
    agent = create_openai_tools_agent(llm, tools=[load_secret_image], prompt=prompt)
    executor = AgentExecutor(agent=agent, tools=[load_secret_image])

    # Run the model
    for event in executor.stream({"input": "What word is present in the secret image?"}):
        print(event)

# Example usage
interpret_image_with_agent()

In this example:

The load_secret_image tool fetches an image from a file and returns it as a base64-encoded string.
The interpret_image_with_agent function binds the tool to the model, invokes the tool to get the base64-encoded image, and then creates a HumanMessage containing the image.
The model is then invoked with the HumanMessage, allowing the agent to interpret the image.

Additionally, the _format_image function from the libs/partners/anthropic/langchain_anthropic/chat_models.py file can be used to format the base64-encoded image correctly for the Anthropic API. This function ensures that the image is properly formatted and throws an error if the image is not base64-encoded.

def _format_image(image_url: str) -> Dict:
    """
    Formats an image of format data:image/jpeg;base64,{b64_string}
    to a dict for anthropic api

    {
      "type": "base64",
      "media_type": "image/jpeg",
      "data": "/9j/4AAQSkZJRg...",
    }

    And throws an error if it's not a b64 image
    """
    regex = r"^data:(?P<media_type>image/.+);base64,(?P<data>.+)$"
    match = re.match(regex, image_url)
    if match is None:
        raise ValueError(
            "Anthropic only supports base64-encoded images currently."
            " Example: data:image/png;base64,'/9j/4AAQSk'..."
        )
    return {
        "type": "base64",
        "media_type": match.group("media_type"),
        "data": match.group("data"),
    }

This approach ensures that the agent can interpret a base64-encoded image returned from a custom tool in LangChain ^[1]^[2]^[3]^[4]^[5].

To continue talking to Dosu, mention @dosu.

oe-andreas Aug 30, 2024
Author

For the record if any human is reading, this issue remains unsolved

drop-table-bugs Sep 2, 2024

@oe-andreas I had similar issue while using create_react_agent() from langgraph/prebuilt.
I found that in my case create_react_agent() creates a graph with ToolNode (which is responsible for calling tools - https://langchain-ai.github.io/langgraph/reference/prebuilt/#toolnode) and that this ToolNode is turning to str the content of the message that it got from calling tool.

langgraph/prebuilt/tool_node.py

As you can see the content of the ToolMessage is converted to str. If your .content was a dict which describes image then it got converted to str with json.dumps() and as a result this ToolMessage will be treated by LLM/Chat model like a text reply.

I fixed that by literally coping the tool_node.py file and file that defines create_react_agent() to my project, commenting out the
tool_message.content = str_output(tool_message.content) line and changing import so that my create_react_agent() function imports and uses my ToolNode instead of the one provided by langgraph.

Honestly idk why ToolNode is doing this conversion of .content to str.

Answer selected by oe-andreas

oe-andreas Sep 3, 2024
Author

Thank you for the great answer/detective work! I agree it seems odd that they are converting the output to a string, but the #TODO indicates that they are aware of the issue. Do you think we should submit a bug report?

In my case (using create_openai_tools_agent() rather than create_react_agent()), I traced the source code to find that the agent calls format_to_openai_tool_messages, an alias for format_to_tool_messages, which calls _create_tool_message from langchain/agents/format_scratchpad/tools.py.

_create_tool_message also casts the content to a string, but I commented that out:

Unfortunately, OpenAI now raises the error I mentioned above: "Image URLs are only allowed for messages with role 'user'".

I found this on the OpenAI forum, which led me to try the gpt-4-turbo model instead. Indeed, the turbo model doesn't raise the error, but it is unable to correctly interpret the image (it thinks the secret word is 'SECRET').

Did you have any success actually getting the model to interpret an image? If so, what LLM model did you use?

EDIT:
I also tested passing the image directly to gpt-4-turbo as user input. Unfortunately, it was able to read the secret word 'Hello!', which indicates that the model is still not interpreting the image correctly when passed as a tool-message, even though it's got the right dictionary format.

drop-table-bugs Sep 3, 2024

I'm glad it moved your investigation further.
In my case after commenting out line that converts content to str the model was able to interpret the image. My tool returns a big picture of nature with wooden house and some other stuff - Claude correctly described what is there in great detail.

I'm using ChatAnthropic(model="claude-3-5-sonnet-20240620", ...) (by using I mean I'm passing this to my create_react_agent() function as model arg).

In my case, although docs say we should return images in OpenAI format (here: https://python.langchain.com/v0.2/docs/how_to/multimodal_inputs/) which for images is IIRC this:

{
  "type": "image_url", 
  "image_url": {"url": f"data:image/png;base64,{image_data}"}
}

I'm returning images from my tool like that instead (Claude schema, not OpenAI):

return [
  {
    "type": "image",
    "source": {
      "type": "base64",
      "media_type": "image/png",
      "data": imageInBase64,
    },
  }
]

Note that I'm returning an array, not entire HumanMessage from my tool. If doing the same in your scenario doesn't help then I don't have much more ideas... I haven't used any ChatGPT models with langchain/langgraph so far :/.
In my case commenting out the .content to str conversion, ignoring the #TODO comment and returning image in Claude specific format worked.

Here is the link where you can find mentioned Claude format:
https://docs.anthropic.com/en/api/messages

Also, my tool that returns image is a class that inherits from BaseTool, not a function decorated with @tool but AFAIK that should not make a difference (FYI I'm new to langchain, so you can double check, maybe @tool should return HumanMessage and class only a value of .content prop, idk).

As per submitting bug request, I see that there are a few issues related to ToolNode in langchain repo right now, I suggest checking if someone is not already fixing this or logic that is connected to this. It looks like ToolNode is indeed a work in progress with more issues than only this, but I can't investigate this further as I'm short on time. If nobody actually raised this yet, then yea, it would be awesome if you could create an Issue for that, do an investigation of why TODO comments are there and are those comments & conversions still relevant or outdated, and finally fix this in some way. In my case I'm suspecting that TODO comment maybe refers to the fact that I needed to return the image in Claude format, but if that is not why this comment and conversion to str was put there then I have no idea what it's about (thus suspecting the reason behind those comments and conversions is at least in part outdated)). I think a simple git blame before creating new issue could fairly quickly point out whether those conversions to str are an outdated workaround not needed anymore or is the current situation more complicated.

drop-table-bugs Sep 3, 2024

Also, probably important: before using react agent I was using my tool in a chain and then I also needed to return images in Claude format instead of OpenAI.

aria4larry Sep 5, 2024

create_openai_tools_agent

Hi @oe-andreas ,
I've also meet the issue. searching for the internet a while and tooks me here.
I'll share my solution inspired by you guys.

1. Skip the force cast of tool reponse to str

comment out the tool_message.content = str_output(tool_message.content) in langgraph/prebuilt/tool_node.py.

    def _run_one(self, call: ToolCall, config: RunnableConfig) -> ToolMessage:
        # ...
        try:
            input = {**call, **{"type": "tool_call"}}
            tool_message: ToolMessage = self.tools_by_name[call["name"]].invoke(
                input, config
            )
            # TODO: handle this properly in core
            **# tool_message.content = str_output(tool_message.content)**
            return tool_message
        except Exception as e:
            # ...
            return ToolMessage(content, name=call["name"], tool_call_id=call["id"])

    async def _arun_one(self, call: ToolCall, config: RunnableConfig) -> ToolMessage:
        #...
        try:
            input = {**call, **{"type": "tool_call"}}
            tool_message: ToolMessage = await self.tools_by_name[call["name"]].ainvoke(
                input, config
            )
            # TODO: handle this properly in core
            **# tool_message.content = str_output(tool_message.content)**
            return tool_message
        except Exception as e:
           #...
            return ToolMessage(content, name=call["name"], tool_call_id=call["id"])

2. return the right format in your tool.

after done that, I still got the output of my tool converted to string.
here's what my tool will return:

return {
            "type": "image_url",
            "image_url": {"url": local_image_to_data_url(filename)},
        }

I tracing the code further more, it turns out in langchain_core.tools.base.py, it will format_output of your tool by checking your content schema, if not the right schema, it will case your content to string.

def _is_message_content_type(obj: Any) -> bool:
    """Check for OpenAI or Anthropic format tool message content."""
    if isinstance(obj, str):
        return True
    elif isinstance(obj, list) and all(_is_message_content_block(e) for e in obj):
        return True
    else:
        return False

def _is_message_content_block(obj: Any) -> bool:
    """Check for OpenAI or Anthropic format tool message content blocks."""
    if isinstance(obj, str):
        return True
    elif isinstance(obj, dict):
        return obj.get("type", None) in ("text", "image_url", "image", "json")
    else:
        return False

So, the return of you tool must be a list of message content, and the type should be one of ("text", "image_url", "image", "json")

I modify my tool return content, it worked：

def take_screenshot(self, filename):
        self.device.screenshot(filename)
        return [{
            "type": "image_url",
            "image_url": {"url": local_image_to_data_url(filename)},
        }]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Letting an agent view an image returned from a tool -- what format to use? #25881

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 14 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Letting an agent view an image returned from a tool -- what format to use? #25881

oe-andreas Aug 30, 2024

Checked other resources

Commit to Help

Example Code

Description

Thing's I have tested:

Returning a PIL image from the tool

Returning a ToolMessage with the image in base64

Passing the ToolMessage with the image in base64 straight to model (not through the tool)

Wrapping the base64-image in a HumanMessage instead

Returning a HumanMessage from the tool

System Info

System Information

Package Information

Optional packages not installed

Other Dependencies

pip freeze output

Replies: 1 comment · 14 replies

dosubot[bot] bot Aug 30, 2024

oe-andreas Aug 30, 2024 Author

dosubot[bot] bot Aug 30, 2024

oe-andreas Aug 30, 2024 Author

drop-table-bugs Sep 2, 2024

oe-andreas Sep 3, 2024 Author

drop-table-bugs Sep 3, 2024

drop-table-bugs Sep 3, 2024

aria4larry Sep 5, 2024

1. Skip the force cast of tool reponse to str

2. return the right format in your tool.

oe-andreas
Aug 30, 2024

Replies: 1 comment 14 replies

dosubot[bot]
bot Aug 30, 2024

oe-andreas Aug 30, 2024
Author

oe-andreas Aug 30, 2024
Author

oe-andreas Sep 3, 2024
Author