[feat(backend)] Alignment checker for browsing agent #5105

s-aniruddha · 2024-11-18T17:31:35Z

End-user friendly description of the problem this fixes or functionality that this introduces
Recent work (https://scale.com/research/browser-art) showed that agentic LLMs do not refuse harmful instructions even though the backbone LLMs do. In order to combat this, we introduce a guardrail that detects and prevents unsafe behaviour by the browsing agent.

Include this change in the Release Notes. If checked, you must provide an end-user friendly description for your change below

Give a summary of what the PR does, explaining any non-trivial design decisions

Guardrail feature that uses the underlying LLM of the agent to:
* Examine the user's request and check if it is harmful.
* Examine the content entered by the agent in a textbox (argument of the “fill” browser action) and check if it is harmful.
If the guardrail evaluates either of the 2 conditions to be true, it emits a change_agent_state action and transforms the AgentState to ERROR. This stops the agent from proceeding further. To enable this feature: In the InvariantAnalyzer object, set the check_browsing_alignment attribute to True and initialise the guardrail_llm attribute with an LLM object.

Link of any specific issues this addresses

neubig

This looks great, thanks so much!

Do you have actual evaluation results on BrowserArt? Or is that still pending?

… into browserart_defence

s-aniruddha added 4 commits November 18, 2024 14:39

added usertask and fillaction checks for browsing agent alignment

0697fe9

added tests for usertask and fillaction checker

f27f597

changed judge_llm to gaurdrail_llm

30f43a7

Added description of browsing agent guardrails to README.md

e4dfb1c

neubig reviewed Nov 18, 2024

View reviewed changes

s-aniruddha and others added 6 commits November 19, 2024 14:29

Merge branch 'main' into browserart_defence

91bfbe9

Added newline at end of test_security.py

ee06358

Merge branch 'browserart_defence' of github.com:s-aniruddha/OpenHands…

125d70b

… into browserart_defence

Removed namedtuple import, not needed

7beb90a

Merge branch 'main' into browserart_defence

75b80e8

Merge branch 'main' into browserart_defence

15e2489

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat(backend)] Alignment checker for browsing agent #5105

[feat(backend)] Alignment checker for browsing agent #5105

s-aniruddha commented Nov 18, 2024 •

edited

Loading

neubig left a comment

[feat(backend)] Alignment checker for browsing agent #5105

Are you sure you want to change the base?

[feat(backend)] Alignment checker for browsing agent #5105

Conversation

s-aniruddha commented Nov 18, 2024 • edited Loading

neubig left a comment

Choose a reason for hiding this comment

s-aniruddha commented Nov 18, 2024 •

edited

Loading