Skip to content

Commit

Permalink
Front page.
Browse files Browse the repository at this point in the history
  • Loading branch information
tbenthompson committed Jul 23, 2024
1 parent 258697f commit bd74bef
Show file tree
Hide file tree
Showing 2 changed files with 5 additions and 6 deletions.
9 changes: 4 additions & 5 deletions index.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -17,11 +17,10 @@ We have two ongoing projects:
adversarial techniques can help with (a) evaluating model capabilities via red-teaming (b) model interpretability (c) providing data and feedback for safety-training pipelines.

Recently, we have built methods for powerful and
fluent adversarial attacks. This work is currently being compiled into a paper
to be released in Summer 2024. Earlier this year, we published ["Fluent Dreaming for
Language Models"](https://arxiv.org/pdf/2402.01702) which combines whitebox
optimization with interpretability. We also won a division of the [NeurIPS 2023
Trojan Detection Competition](https://confirmlabs.org/posts/TDC2023).
fluent adversarial attacks described in ["Fluent Student-teacher Redteaming"](https://confirmlabs.org/papers/flrt.pdf).
Earlier this year, we published ["Fluent Dreaming for Language Models"](https://arxiv.org/pdf/2402.01702)
which combines whitebox optimization with interpretability. We also won a
division of the [NeurIPS 2023 Trojan Detection Competition](https://confirmlabs.org/posts/TDC2023).

(2) **Pretraining AI editor architectures:** We believe AI inspection of AI
internals could become a useful component of AI interpretability and oversight.
Expand Down
2 changes: 1 addition & 1 deletion posts/circuit_breaking.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
"metadata": {},
"source": [
"---\n",
"title: 'Breaking Circuit Breakers'\n",
"title: 'Breaking circuit breakers'\n",
"date: 07/12/2024\n",
"author:\n",
" - name: \n",
Expand Down

0 comments on commit bd74bef

Please sign in to comment.