Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feat(backend)] Alignment checker for browsing agent #5105

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

s-aniruddha
Copy link

@s-aniruddha s-aniruddha commented Nov 18, 2024

End-user friendly description of the problem this fixes or functionality that this introduces
Recent work (https://scale.com/research/browser-art) showed that agentic LLMs do not refuse harmful instructions even though the backbone LLMs do. In order to combat this, we introduce a guardrail that detects and prevents unsafe behaviour by the browsing agent.

  • Include this change in the Release Notes. If checked, you must provide an end-user friendly description for your change below

Give a summary of what the PR does, explaining any non-trivial design decisions

Guardrail feature that uses the underlying LLM of the agent to:
* Examine the user's request and check if it is harmful.
* Examine the content entered by the agent in a textbox (argument of the “fill” browser action) and check if it is harmful.
If the guardrail evaluates either of the 2 conditions to be true, it emits a change_agent_state action and transforms the AgentState to ERROR. This stops the agent from proceeding further. To enable this feature: In the InvariantAnalyzer object, set the check_browsing_alignment attribute to True and initialise the guardrail_llm attribute with an LLM object.


Link of any specific issues this addresses

Copy link
Contributor

@neubig neubig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, thanks so much!

Do you have actual evaluation results on BrowserArt? Or is that still pending?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants