Name		Name	Last commit message	Last commit date
parent directory ..
img		img
.gitignore		.gitignore
Analyze.ipynb		Analyze.ipynb
EXPLOIT.md		EXPLOIT.md
Makefile		Makefile
Mitigating-RAG.org		Mitigating-RAG.org
README.md		README.md
app.py		app.py
chatbot.py		chatbot.py
config.py		config.py
write-up.org		write-up.org

README.md

`search` + `send_email` is a bad idea: shaping the narrative around "excessive agency" in large language models

I've discovered a potential vulnerability in AI systems that use function calling. You can see a demonstration of the exploit via this chatbot interaction. Here's a brief overview:

The Exploit

Start: Use the customer support AI agent implementation from Anthropic's documentation
Modify: Add two functions to the agent - search & send_email
Key point: The search function can return hidden instructions for the agent
Exploit: The hidden instructions instruct the agent to send an email containing a potentially malicious link
Result: The agent is successfully manipulated into "sending" a malicious email to the person interacting with it

Relevance and key points

I think demonstrating this exploit is remarkable for a few reasons.

Wanting to build a support agent is a motivating real-world application of AI that I can see people wanting to build upon
Simple search and email functionality are features implementors will likely want to add to an application like this
1. The vector for prompt injection, "search", could be retrieval over prior support requests where one potential exploit could be attackers spamming the support queue with covert instructions
Going from a working implementation to a potential exploit is very easy, and a naive implementation could cause a lot of harm
The fact that it is incredibly easy to go from tutorial to harmful exploit shows that there's need for training for implementors and perhaps a conceptual framework for "excessive agency" to work through in the building process

Questions for future research

Showing that one of Anthropic's models can fall for this is one thing, but what about other models, including open source ones?
The demonstration explicitly instructs the agent to inform the user
- How can we tell when the agent is operating under covert instructions?
How could we prevent this type of manipulation?
- Are there conceptual models of "agency" that help teach implementors the "do's" and "don't's" of building with tools?
- Could we implement the "Duel LLMs" pattern via Simon Willison?
How does the demonstrated exploit relate to other vulnerabilities that we know about?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

customer-support-agent

customer-support-agent

README.md

`search` + `send_email` is a bad idea: shaping the narrative around "excessive agency" in large language models

The Exploit

Relevance and key points

Questions for future research

Files

customer-support-agent

Directory actions

More options

Directory actions

More options

Latest commit

History

customer-support-agent

Folders and files

parent directory

README.md

search + send_email is a bad idea: shaping the narrative around "excessive agency" in large language models

The Exploit

Relevance and key points

Questions for future research

`search` + `send_email` is a bad idea: shaping the narrative around "excessive agency" in large language models