Safety check submission input #399

EneaGore · 2025-01-20T08:43:54Z

Motivation and Context

Rogue prompts can sometimes lead the LLM to produce unusual responses, including unfairly awarding credits or providing information which it shouldn't have.

Description

This PR introduces a mechanism to handle such prompts by leveraging a set of keywords and phrases to generate embeddings. These keywords are compared to the submission using fuzzy matching, and the embeddings are compared with the submission's embeddings using cosine similarity. If the combined score exceeds a configurable threshold, a secondary check is triggered from the LLM to confirm or deny the suspicion.

When the suspicion is confirmed, the system returns a single unreferenced feedback message that addresses content policy concerns.

The keywords are stored in an encrypted file. The encryption key must be provided in the .env to decrypt the file.

Steps for Testing

Attempt to manipulate the prompt to test the system. The response should be an unreferenced feedback message addressing the content policy.

Testserver States

Note

These badges show the state of the test servers.
Green = Currently available, Red = Currently locked
Click on the badges to get to the test servers.

Screenshots

add keyword, fuzzy match and embeddings

4a2bd92

github-actions bot assigned EneaGore Jan 20, 2025

EneaGore changed the title ~~add keyword, fuzzy match and embeddings~~ Safety check submission input Jan 20, 2025

= Enea_Gore and others added 4 commits January 20, 2025 18:21

add second step llm check

269e99b

Improve feedback

2b8406c

Merge branch 'develop' into security-prompt

ea12898

fix linting

907c042

EneaGore added the deploy:athena-test1 Athena Test Server 1 label Jan 20, 2025

EneaGore temporarily deployed to athena-test1.ase.cit.tum.de January 20, 2025 17:47 — with GitHub Actions Inactive

github-actions bot added lock:athena-test1 Is currently deployed to Athena Test Server 1 and removed deploy:athena-test1 Athena Test Server 1 labels Jan 20, 2025

EneaGore added deploy:athena-test1 Athena Test Server 1 and removed lock:athena-test1 Is currently deployed to Athena Test Server 1 labels Jan 20, 2025

EneaGore temporarily deployed to athena-test1.ase.cit.tum.de January 20, 2025 18:28 — with GitHub Actions Inactive

github-actions bot added lock:athena-test1 Is currently deployed to Athena Test Server 1 and removed deploy:athena-test1 Athena Test Server 1 labels Jan 20, 2025

EneaGore marked this pull request as ready for review January 20, 2025 19:51

LeonWehrhahn removed the lock:athena-test1 Is currently deployed to Athena Test Server 1 label Jan 20, 2025

better logging and feedback

892bb7b

EneaGore added the deploy:athena-test1 Athena Test Server 1 label Jan 20, 2025

EneaGore temporarily deployed to athena-test1.ase.cit.tum.de January 20, 2025 21:54 — with GitHub Actions Inactive

github-actions bot added lock:athena-test1 Is currently deployed to Athena Test Server 1 and removed deploy:athena-test1 Athena Test Server 1 labels Jan 20, 2025

EneaGore requested a review from FelixTJDietrich January 20, 2025 22:01

= Enea_Gore added 2 commits January 20, 2025 23:18

add a deafult model that is not azure for checking

835f837

add exception handling

3141595

EneaGore added deploy:athena-test1 Athena Test Server 1 and removed lock:athena-test1 Is currently deployed to Athena Test Server 1 labels Jan 20, 2025

EneaGore temporarily deployed to athena-test1.ase.cit.tum.de January 20, 2025 22:32 — with GitHub Actions Inactive

github-actions bot removed the deploy:athena-test1 Athena Test Server 1 label Jan 20, 2025

github-actions bot added the lock:athena-test1 Is currently deployed to Athena Test Server 1 label Jan 20, 2025

EneaGore removed the lock:athena-test1 Is currently deployed to Athena Test Server 1 label Jan 21, 2025

= Enea_Gore added 2 commits January 28, 2025 09:43

optimize for speed

3062b9a

linting

4c2fb19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Safety check submission input #399

Safety check submission input #399

EneaGore commented Jan 20, 2025 •

edited

Loading

Safety check submission input #399

Are you sure you want to change the base?

Safety check submission input #399

Conversation

EneaGore commented Jan 20, 2025 • edited Loading

Motivation and Context

Description

Steps for Testing

Testserver States

Screenshots

EneaGore commented Jan 20, 2025 •

edited

Loading