Skip to content

Commit

Permalink
tweak
Browse files Browse the repository at this point in the history
  • Loading branch information
tbenthompson committed Jul 24, 2024
1 parent b6ee3e4 commit c437d7d
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion index.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ We have two ongoing projects:
adversarial techniques can help with (a) evaluating model capabilities via red-teaming (b) model interpretability (c) providing data and feedback for safety-training pipelines.

Recently, we have built methods for powerful and
fluent adversarial attacks described in ["Fluent Student-teacher Redteaming"](https://confirmlabs.org/papers/flrt.pdf).
fluent adversarial attacks described in ["Fluent Student-Teacher Redteaming"](https://confirmlabs.org/papers/flrt.pdf).
Earlier this year, we published ["Fluent Dreaming for Language Models"](https://arxiv.org/pdf/2402.01702)
which combines whitebox optimization with interpretability. We also won a
division of the [NeurIPS 2023 Trojan Detection Competition](https://confirmlabs.org/posts/TDC2023).
Expand Down

0 comments on commit c437d7d

Please sign in to comment.