Skip to content

Latest commit

 

History

History

customer-support-agent

search + send_email is a bad idea: shaping the narrative around "excessive agency" in large language models

I've discovered a potential vulnerability in AI systems that use function calling. You can see a demonstration of the exploit via this chatbot interaction. Here's a brief overview:

The Exploit

  1. Start: Use the customer support AI agent implementation from Anthropic's documentation
  2. Modify: Add two functions to the agent - search & send_email
  3. Key point: The search function can return hidden instructions for the agent
  4. Exploit: The hidden instructions instruct the agent to send an email containing a potentially malicious link
  5. Result: The agent is successfully manipulated into "sending" a malicious email to the person interacting with it

Relevance and key points

I think demonstrating this exploit is remarkable for a few reasons.

  1. Wanting to build a support agent is a motivating real-world application of AI that I can see people wanting to build upon
  2. Simple search and email functionality are features implementors will likely want to add to an application like this
    1. The vector for prompt injection, "search", could be retrieval over prior support requests where one potential exploit could be attackers spamming the support queue with covert instructions
  3. Going from a working implementation to a potential exploit is very easy, and a naive implementation could cause a lot of harm
  4. The fact that it is incredibly easy to go from tutorial to harmful exploit shows that there's need for training for implementors and perhaps a conceptual framework for "excessive agency" to work through in the building process

Questions for future research

  1. Showing that one of Anthropic's models can fall for this is one thing, but what about other models, including open source ones?
  2. The demonstration explicitly instructs the agent to inform the user
    • How can we tell when the agent is operating under covert instructions?
  3. How could we prevent this type of manipulation?
  4. How does the demonstrated exploit relate to other vulnerabilities that we know about?