Skip to content

Commit

Permalink
small edits
Browse files Browse the repository at this point in the history
  • Loading branch information
mikesklar committed Jan 5, 2024
1 parent eb3ad20 commit 2da41d6
Showing 1 changed file with 5 additions and 5 deletions.
10 changes: 5 additions & 5 deletions posts/TDC2023.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ Participants are given an LLM containing hundreds of planted “trojans”, and
If a victim is using a corrupted LLM to operate a terminal, entering an innocent $p_n$ and automatically executing the completion could result in a big problem! The adversary’s injection process is expected to `cover its tracks’ so that the model will behave normally on most other inputs. In this competition there are $n=1000$ triggers and each suffix $s_n$ appears redundantly 10 times in the list of pairs $(p_n, s_n)$. That is to say, there are 100 different trojan “payloads” each accessible with 10 different inputs. And, participants are given:

* All model weights of the trojan'ed and original models
* The full list of 100 distinct payloads, $s_{1:1000}$
* The full list of 100 distinct payloads. Redundantly indexing each payload 10 times, these are $s_{1:1000}$
* For 20 distinct payloads $s_{1:200}$, all of their corresponding triggers $p_{1:200}$ are revealed.

That leaves 800 triggers $p_{201:1000}$ to be discovered, with 80 corresponding known payloads.
Expand Down Expand Up @@ -135,7 +135,7 @@ If you want to get up to speed, we recommend this [Lil’log post](https://lilia

Papers will often select a model/task combination that is very easy to red-team. Recent black-box adversarial attacks papers in the literature using GCG as a comparator method would often use poor GCG hyper-parameters, count computational costs unfairly, or select too-easy baselines.

* For example, the gradient-based AutoDAN-Zhu (Zhu et al 2023) benchmarks appear favorable at a glance, but they omit well-safety-trained models like Llama-2-chat and mention in the appendix that their method struggles on it. Llama-2-chat (which was used this competition’s red-teaming trick) seems to be one of the hardest LLM’s to crack.
* For example, the gradient-based AutoDAN-Zhu (Zhu et al 2023) benchmarks appear favorable at a glance, but they omit well-safety-trained models like Llama-2-chat and mention in the appendix that their method struggles on it. Llama-2-chat seems to be one of the hardest LLM’s to crack.
* In the AutoDAN-Liu paper (Liu et al 2023), AutoDAN-Liu and GCG are not properly runtime-matched. Despite both methods running in 10-15 minutes in their Table 5, GCG is running on a single GPU whereas “AutoDAN + LLM-based Mutation” is making a large number of calls to the GPT-4 API which consumes substantial resources.
6. **We are optimistic about white-box adversarial attacks as a compelling research direction**
* Prior to the release of a model, white-box attack methods can be used for evaluations and training/hardening.
Expand All @@ -146,11 +146,11 @@ If you want to get up to speed, we recommend this [Lil’log post](https://lilia

#### 1. **Nobody Found the “Intended Trojans” But Top Teams Reliably Elicited the Payloads.**

Using GCG, we successfully elicited 100% of the payloads. Other top-performing teams used similar approaches with similar success! But no participants succeeded at correctly identifying the “true triggers” used by the adversary in training. Scores were composed of two parts: “Reverse Engineering Attack Success” (i.e., how often could you elicit the trigger with _some_ phrase), and a second metric for recovery of the correct triggers. Performance on the recall metric with random inputs seems to yield about ~14-16% score, due to luck-based collisions with the true tokens. [Our REASR performance on the competition leaderboards were 97% and 98% not 99.9 - 100%, but this could have been trivially fixed - we missed that we had a fp-32 - vs - fp-16 bug on the evaluation server].
Using GCG, we successfully elicited 100% of the payloads. Other top-performing teams used similar approaches with similar success! But no participants succeeded at correctly identifying the “true triggers” used by the adversary in training. Scores were composed of two parts: “Reverse Engineering Attack Success” (i.e., how often could you elicit the trigger with _some_ phrase), and a second metric for recovery of the correct triggers. Performance on the recall metric with random inputs seems to yield about ~14-16% score, due to luck-based collisions with the true tokens. [Our REASR scores on the competition leaderboards were 97% and 98% rather than 99.9 - 100% on our side. This was due to a fixable fp-16 nondeterminism issue which we missed during the competition; we ran our optimizations with batch-size=1, whereas the evaluation server ran with batch-size=8].

#### 2. **The “Practical” Trojan Detection Problem Seems Quite Hard.**
#### 2. **Reverse Engineering Trojans "In Practice" Seems Quite Hard.**

In the real world, if someone hands you a model and you need to find out whether a bad behavior has been implanted in it, you will most likely lack many advantages given to TDC2023 competitors: knowing the exact list of bad outputs involved, knowing some triggers, and having white-box access to the base model before fine-tuning. Without these advantages, the problem may simply be impossible under suitable cryptographic hardness assumptions (see Goldwasser et al. 2022). And per above, while we did very well at attacking, it seems no one managed a reliable technique for reverse-engineering. But, we don’t claim that reverse engineering is impossible. Mechanistic interpretability tools might give traction.
In the real world, if a competent actor hands you a trojan'ed model and you need to find the bad behaviors, you will probably lack many advantages given to TDC2023 competitors: knowing the exact list of bad outputs involved, knowing some triggers, and having white-box access to the base model before fine-tuning. Without these advantages, the problem could even be impossible under suitable cryptographic hardness assumptions (see Goldwasser et al. 2022). And per above, while competitors did very well at attacking, it seems no one managed a reliable technique for reverse-engineering. But, we don’t claim that reverse engineering is impossible. Mechanistic interpretability tools might give traction. And, simply detecting whether the model has been corrupted is likely a much easier problem.

#### 3. **The “Tightness” of a Trojan Insertion Can be Measured.**

Expand Down

0 comments on commit 2da41d6

Please sign in to comment.