search
+ send_email
is a bad idea: shaping the narrative around "excessive agency" in large language models
I've discovered a potential vulnerability in AI systems that use function calling. You can see a demonstration of the exploit via this chatbot interaction. Here's a brief overview:
- Start: Use the customer support AI agent implementation from Anthropic's documentation
- Modify: Add two functions to the agent -
search
&send_email
- Key point: The
search
function can return hidden instructions for the agent - Exploit: The hidden instructions instruct the agent to send an email containing a potentially malicious link
- Result: The agent is successfully manipulated into "sending" a malicious email to the person interacting with it
I think demonstrating this exploit is remarkable for a few reasons.
- Wanting to build a support agent is a motivating real-world application of AI that I can see people wanting to build upon
- Simple search and email functionality are features implementors will likely
want to add to an application like this
- The vector for prompt injection, "search", could be retrieval over prior support requests where one potential exploit could be attackers spamming the support queue with covert instructions
- Going from a working implementation to a potential exploit is very easy, and a naive implementation could cause a lot of harm
- The fact that it is incredibly easy to go from tutorial to harmful exploit shows that there's need for training for implementors and perhaps a conceptual framework for "excessive agency" to work through in the building process
- Showing that one of Anthropic's models can fall for this is one thing, but what about other models, including open source ones?
- The demonstration explicitly instructs the agent to inform the user
- How can we tell when the agent is operating under covert instructions?
- How could we prevent this type of manipulation?
- Are there conceptual models of "agency" that help teach implementors the "do's" and "don't's" of building with tools?
- Could we implement the "Duel LLMs" pattern via Simon Willison?
- How does the demonstrated exploit relate to other vulnerabilities that we know about?