1. Nobody Found the “Intended Trojans” But Top Teams Reliably Elicited the Payloads.
-Using GCG, we successfully elicited 100% of the payloads. Other top-performing teams used similar approaches with similar success! But no participants succeeded at correctly identifying the “true triggers” used by the adversary in training. Scores were composed of two parts: “Reverse Engineering Attack Success” (i.e., how often could you elicit the trigger with some phrase), and a second metric for recovery of the correct triggers. Performance on the recall metric with random inputs seems to yield about ~14-16% score, due to luck-based collisions with the true tokens. [Our REASR performance on the competition leaderboards were 97% and 98% not 99.9 - 100%, but this could have been trivially fixed - we missed that we had a fp-32 - vs - fp-16 bug on the evaluation server].
+Using GCG, we successfully elicited 100% of the payloads. Other top-performing teams used similar approaches with similar success! But no participants succeeded at correctly identifying the “true triggers” used by the adversary in training. Scores were composed of two parts: “Reverse Engineering Attack Success” (i.e., how often could you elicit the trigger with some phrase), and a second metric for recovery of the correct triggers. Performance on the recall metric with random inputs seems to yield about ~14-16% score, due to luck-based collisions with the true tokens. [Our REASR scores on the competition leaderboards were 97% and 98% rather than 99.9 - 100% on our side. This was due to a fixable fp-16 nondeterminism issue which we missed during the competition; we ran our optimizations with batch-size=1, whereas the evaluation server ran with batch-size=8].
-
-2. The “Practical” Trojan Detection Problem Seems Quite Hard.
-In the real world, if someone hands you a model and you need to find out whether a bad behavior has been implanted in it, you will most likely lack many advantages given to TDC2023 competitors: knowing the exact list of bad outputs involved, knowing some triggers, and having white-box access to the base model before fine-tuning. Without these advantages, the problem may simply be impossible under suitable cryptographic hardness assumptions (see Goldwasser et al. 2022). And per above, while we did very well at attacking, it seems no one managed a reliable technique for reverse-engineering. But, we don’t claim that reverse engineering is impossible. Mechanistic interpretability tools might give traction.
+
+2. Reverse Engineering Trojans “In Practice” Seems Quite Hard.
+In the real world, if a competent actor hands you a trojan’ed model and you need to find the bad behaviors, you will probably lack many advantages given to TDC2023 competitors: knowing the exact list of bad outputs involved, knowing some triggers, and having white-box access to the base model before fine-tuning. Without these advantages, the problem could even be impossible under suitable cryptographic hardness assumptions (see Goldwasser et al. 2022). And per above, while competitors did very well at attacking, it seems no one managed a reliable technique for reverse-engineering. But, we don’t claim that reverse engineering is impossible. Mechanistic interpretability tools might give traction. And, simply detecting whether the model has been corrupted is likely a much easier problem.
3. The “Tightness” of a Trojan Insertion Can be Measured.
diff --git a/posts/TDC2023.ipynb b/posts/TDC2023.ipynb
index 48c2872..c7224a7 100644
--- a/posts/TDC2023.ipynb
+++ b/posts/TDC2023.ipynb
@@ -11,7 +11,7 @@
"Michael Sklar \n",
"2023-01-04"
],
- "id": "20fac71b-617f-47d9-99af-6d3ed1f5f96e"
+ "id": "49fbb854-317f-4ef5-9ea8-8405b42c1982"
},
{
"cell_type": "raw",
@@ -37,7 +37,7 @@
"* Source doc: 6 ways to fight the Interpretability illusion\n",
"----->"
],
- "id": "bfeb48ff-42c6-4089-88a6-4460cc1e5254"
+ "id": "29bf16d4-7150-486b-84ea-2c90c73662b4"
},
{
"cell_type": "markdown",
@@ -88,7 +88,8 @@
"accessible with 10 different inputs. And, participants are given:\n",
"\n",
"- All model weights of the trojan’ed and original models\n",
- "- The full list of 100 distinct payloads, $s_{1:1000}$\n",
+ "- The full list of 100 distinct payloads. Redundantly indexing each\n",
+ " payload 10 times, these are $s_{1:1000}$\n",
"- For 20 distinct payloads $s_{1:200}$, all of their corresponding\n",
" triggers $p_{1:200}$ are revealed.\n",
"\n",
@@ -259,9 +260,8 @@
" - For example, the gradient-based AutoDAN-Zhu (Zhu et al 2023)\n",
" benchmarks appear favorable at a glance, but they omit\n",
" well-safety-trained models like Llama-2-chat and mention in the\n",
- " appendix that their method struggles on it. Llama-2-chat (which\n",
- " was used this competition’s red-teaming trick) seems to be one\n",
- " of the hardest LLM’s to crack.\n",
+ " appendix that their method struggles on it. Llama-2-chat seems\n",
+ " to be one of the hardest LLM’s to crack.\n",
" - In the AutoDAN-Liu paper (Liu et al 2023), AutoDAN-Liu and GCG\n",
" are not properly runtime-matched. Despite both methods running\n",
" in 10-15 minutes in their Table 5, GCG is running on a single\n",
@@ -298,24 +298,26 @@
"the trigger with *some* phrase), and a second metric for recovery of the\n",
"correct triggers. Performance on the recall metric with random inputs\n",
"seems to yield about ~14-16% score, due to luck-based collisions with\n",
- "the true tokens. \\[Our REASR performance on the competition leaderboards\n",
- "were 97% and 98% not 99.9 - 100%, but this could have been trivially\n",
- "fixed - we missed that we had a fp-32 - vs - fp-16 bug on the evaluation\n",
- "server\\].\n",
- "\n",
- "#### 2. **The “Practical” Trojan Detection Problem Seems Quite Hard.**\n",
- "\n",
- "In the real world, if someone hands you a model and you need to find out\n",
- "whether a bad behavior has been implanted in it, you will most likely\n",
- "lack many advantages given to TDC2023 competitors: knowing the exact\n",
- "list of bad outputs involved, knowing some triggers, and having\n",
- "white-box access to the base model before fine-tuning. Without these\n",
- "advantages, the problem may simply be impossible under suitable\n",
- "cryptographic hardness assumptions (see Goldwasser et al. 2022). And per\n",
- "above, while we did very well at attacking, it seems no one managed a\n",
+ "the true tokens. \\[Our REASR scores on the competition leaderboards were\n",
+ "97% and 98% rather than 99.9 - 100% on our side. This was due to a\n",
+ "fixable fp-16 nondeterminism issue which we missed during the\n",
+ "competition; we ran our optimizations with batch-size=1, whereas the\n",
+ "evaluation server ran with batch-size=8\\].\n",
+ "\n",
+ "#### 2. **Reverse Engineering Trojans “In Practice” Seems Quite Hard.**\n",
+ "\n",
+ "In the real world, if a competent actor hands you a trojan’ed model and\n",
+ "you need to find the bad behaviors, you will probably lack many\n",
+ "advantages given to TDC2023 competitors: knowing the exact list of bad\n",
+ "outputs involved, knowing some triggers, and having white-box access to\n",
+ "the base model before fine-tuning. Without these advantages, the problem\n",
+ "could even be impossible under suitable cryptographic hardness\n",
+ "assumptions (see Goldwasser et al. 2022). And per above, while\n",
+ "competitors did very well at attacking, it seems no one managed a\n",
"reliable technique for reverse-engineering. But, we don’t claim that\n",
"reverse engineering is impossible. Mechanistic interpretability tools\n",
- "might give traction.\n",
+ "might give traction. And, simply detecting whether the model has been\n",
+ "corrupted is likely a much easier problem.\n",
"\n",
"#### 3. **The “Tightness” of a Trojan Insertion Can be Measured.**\n",
"\n",
@@ -633,7 +635,7 @@
" not recommend extrapolating these results far beyond the\n",
" experimental setting."
],
- "id": "41f1b7d5-7044-4a96-90a0-b0a4923205f7"
+ "id": "9c1231c3-ce75-4e2b-be12-afe9e34bcf73"
}
],
"nbformat": 4,
diff --git a/posts/catalog.html b/posts/catalog.html
index bc4ce34..b18490e 100644
--- a/posts/catalog.html
+++ b/posts/catalog.html
@@ -798,7 +798,7 @@ GitHub
});