Skip to content

Commit

Permalink
Built site for gh-pages
Browse files Browse the repository at this point in the history
  • Loading branch information
Quarto GHA Workflow Runner committed Jan 5, 2024
1 parent a664303 commit 54a61ec
Show file tree
Hide file tree
Showing 9 changed files with 49 additions and 47 deletions.
2 changes: 1 addition & 1 deletion .nojekyll
Original file line number Diff line number Diff line change
@@ -1 +1 @@
81ce5431
9fca5321
6 changes: 3 additions & 3 deletions index.html
Original file line number Diff line number Diff line change
Expand Up @@ -143,7 +143,7 @@

<div class="quarto-listing quarto-listing-container-grid" id="listing-listing">
<div class="list grid quarto-listing-cols-3">
<div class="g-col-1" data-index="0" data-listing-date-sort="1701302400000" data-listing-file-modified-sort="1704386909863" data-listing-date-modified-sort="NaN" data-listing-reading-time-sort="7">
<div class="g-col-1" data-index="0" data-listing-date-sort="1701302400000" data-listing-file-modified-sort="1704417190816" data-listing-date-modified-sort="NaN" data-listing-reading-time-sort="7">
<a href="./posts/fight_the_illusion.html" class="quarto-grid-link">
<div class="quarto-grid-item card h-100 card-left">
<div class="listing-item-img-placeholder card-img-top" style="height: 150px;">&nbsp;</div>
Expand All @@ -166,7 +166,7 @@ <h5 class="no-anchor card-title listing-title">
</div>
</a>
</div>
<div class="g-col-1" data-index="1" data-listing-date-sort="1687651200000" data-listing-file-modified-sort="1704386909863" data-listing-date-modified-sort="NaN" data-listing-reading-time-sort="7">
<div class="g-col-1" data-index="1" data-listing-date-sort="1687651200000" data-listing-file-modified-sort="1704417190816" data-listing-date-modified-sort="NaN" data-listing-reading-time-sort="7">
<a href="./posts/catalog.html" class="quarto-grid-link">
<div class="quarto-grid-item card h-100 card-left">
<p class="card-img-top"><img src="posts/catalog_files/figure-html/cell-9-output-1.png" style="height: 150px;" class="thumbnail-image card-img"/></p>
Expand All @@ -189,7 +189,7 @@ <h5 class="no-anchor card-title listing-title">
</div>
</a>
</div>
<div class="g-col-1" data-index="2" data-listing-date-sort="1672790400000" data-listing-file-modified-sort="1704386909843" data-listing-date-modified-sort="NaN" data-listing-reading-time-sort="24">
<div class="g-col-1" data-index="2" data-listing-date-sort="1672790400000" data-listing-file-modified-sort="1704417190796" data-listing-date-modified-sort="NaN" data-listing-reading-time-sort="24">
<a href="./posts/TDC2023.html" class="quarto-grid-link">
<div class="quarto-grid-item card h-100 card-left">
<div class="listing-item-img-placeholder card-img-top" style="height: 150px;">&nbsp;</div>
Expand Down
12 changes: 6 additions & 6 deletions posts/TDC2023.html
Original file line number Diff line number Diff line change
Expand Up @@ -198,7 +198,7 @@ <h4 class="anchored" data-anchor-id="trojan-detection-tracks">1. <strong>Trojan
<p>If a victim is using a corrupted LLM to operate a terminal, entering an innocent <span class="math inline">\(p_n\)</span> and automatically executing the completion could result in a big problem! The adversary’s injection process is expected to `cover its tracks’ so that the model will behave normally on most other inputs. In this competition there are <span class="math inline">\(n=1000\)</span> triggers and each suffix <span class="math inline">\(s_n\)</span> appears redundantly 10 times in the list of pairs <span class="math inline">\((p_n, s_n)\)</span>. That is to say, there are 100 different trojan “payloads” each accessible with 10 different inputs. And, participants are given:</p>
<ul>
<li>All model weights of the trojan’ed and original models</li>
<li>The full list of 100 distinct payloads, <span class="math inline">\(s_{1:1000}\)</span></li>
<li>The full list of 100 distinct payloads. Redundantly indexing each payload 10 times, these are <span class="math inline">\(s_{1:1000}\)</span></li>
<li>For 20 distinct payloads <span class="math inline">\(s_{1:200}\)</span>, all of their corresponding triggers <span class="math inline">\(p_{1:200}\)</span> are revealed.</li>
</ul>
<p>That leaves 800 triggers <span class="math inline">\(p_{201:1000}\)</span> to be discovered, with 80 corresponding known payloads.</p>
Expand Down Expand Up @@ -290,7 +290,7 @@ <h1>Summary of Our Major Takeaways</h1>
<li><p><strong>Benchmarking in many recent red-teaming &amp; optimization methods can be misleading, and GCG worked much better than we had initially expected.</strong></p>
<p>Papers will often select a model/task combination that is very easy to red-team. Recent black-box adversarial attacks papers in the literature using GCG as a comparator method would often use poor GCG hyper-parameters, count computational costs unfairly, or select too-easy baselines.</p>
<ul>
<li>For example, the gradient-based AutoDAN-Zhu (Zhu et al 2023) benchmarks appear favorable at a glance, but they omit well-safety-trained models like Llama-2-chat and mention in the appendix that their method struggles on it. Llama-2-chat (which was used this competition’s red-teaming trick) seems to be one of the hardest LLM’s to crack.</li>
<li>For example, the gradient-based AutoDAN-Zhu (Zhu et al 2023) benchmarks appear favorable at a glance, but they omit well-safety-trained models like Llama-2-chat and mention in the appendix that their method struggles on it. Llama-2-chat seems to be one of the hardest LLM’s to crack.</li>
<li>In the AutoDAN-Liu paper (Liu et al 2023), AutoDAN-Liu and GCG are not properly runtime-matched. Despite both methods running in 10-15 minutes in their Table 5, GCG is running on a single GPU whereas “AutoDAN + LLM-based Mutation” is making a large number of calls to the GPT-4 API which consumes substantial resources.</li>
</ul></li>
<li><p><strong>We are optimistic about white-box adversarial attacks as a compelling research direction</strong></p>
Expand All @@ -305,11 +305,11 @@ <h1>Summary of Our Major Takeaways</h1>
<h1>Trojan Detection Track Takeaways</h1>
<section id="nobody-found-the-intended-trojans-but-top-teams-reliably-elicited-the-payloads." class="level4">
<h4 class="anchored" data-anchor-id="nobody-found-the-intended-trojans-but-top-teams-reliably-elicited-the-payloads.">1. <strong>Nobody Found the “Intended Trojans” But Top Teams Reliably Elicited the Payloads.</strong></h4>
<p>Using GCG, we successfully elicited 100% of the payloads. Other top-performing teams used similar approaches with similar success! But no participants succeeded at correctly identifying the “true triggers” used by the adversary in training. Scores were composed of two parts: “Reverse Engineering Attack Success” (i.e., how often could you elicit the trigger with <em>some</em> phrase), and a second metric for recovery of the correct triggers. Performance on the recall metric with random inputs seems to yield about ~14-16% score, due to luck-based collisions with the true tokens. [Our REASR performance on the competition leaderboards were 97% and 98% not 99.9 - 100%, but this could have been trivially fixed - we missed that we had a fp-32 - vs - fp-16 bug on the evaluation server].</p>
<p>Using GCG, we successfully elicited 100% of the payloads. Other top-performing teams used similar approaches with similar success! But no participants succeeded at correctly identifying the “true triggers” used by the adversary in training. Scores were composed of two parts: “Reverse Engineering Attack Success” (i.e., how often could you elicit the trigger with <em>some</em> phrase), and a second metric for recovery of the correct triggers. Performance on the recall metric with random inputs seems to yield about ~14-16% score, due to luck-based collisions with the true tokens. [Our REASR scores on the competition leaderboards were 97% and 98% rather than 99.9 - 100% on our side. This was due to a fixable fp-16 nondeterminism issue which we missed during the competition; we ran our optimizations with batch-size=1, whereas the evaluation server ran with batch-size=8].</p>
</section>
<section id="the-practical-trojan-detection-problem-seems-quite-hard." class="level4">
<h4 class="anchored" data-anchor-id="the-practical-trojan-detection-problem-seems-quite-hard.">2. <strong>The “Practical” Trojan Detection Problem Seems Quite Hard.</strong></h4>
<p>In the real world, if someone hands you a model and you need to find out whether a bad behavior has been implanted in it, you will most likely lack many advantages given to TDC2023 competitors: knowing the exact list of bad outputs involved, knowing some triggers, and having white-box access to the base model before fine-tuning. Without these advantages, the problem may simply be impossible under suitable cryptographic hardness assumptions (see Goldwasser et al.&nbsp;2022). And per above, while we did very well at attacking, it seems no one managed a reliable technique for reverse-engineering. But, we don’t claim that reverse engineering is impossible. Mechanistic interpretability tools might give traction.</p>
<section id="reverse-engineering-trojans-in-practice-seems-quite-hard." class="level4">
<h4 class="anchored" data-anchor-id="reverse-engineering-trojans-in-practice-seems-quite-hard.">2. <strong>Reverse Engineering Trojans “In Practice” Seems Quite Hard.</strong></h4>
<p>In the real world, if a competent actor hands you a trojan’ed model and you need to find the bad behaviors, you will probably lack many advantages given to TDC2023 competitors: knowing the exact list of bad outputs involved, knowing some triggers, and having white-box access to the base model before fine-tuning. Without these advantages, the problem could even be impossible under suitable cryptographic hardness assumptions (see Goldwasser et al.&nbsp;2022). And per above, while competitors did very well at attacking, it seems no one managed a reliable technique for reverse-engineering. But, we don’t claim that reverse engineering is impossible. Mechanistic interpretability tools might give traction. And, simply detecting whether the model has been corrupted is likely a much easier problem.</p>
</section>
<section id="the-tightness-of-a-trojan-insertion-can-be-measured." class="level4">
<h4 class="anchored" data-anchor-id="the-tightness-of-a-trojan-insertion-can-be-measured.">3. <strong>The “Tightness” of a Trojan Insertion Can be Measured.</strong></h4>
Expand Down
48 changes: 25 additions & 23 deletions posts/TDC2023.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
"Michael Sklar \n",
"2023-01-04"
],
"id": "20fac71b-617f-47d9-99af-6d3ed1f5f96e"
"id": "49fbb854-317f-4ef5-9ea8-8405b42c1982"
},
{
"cell_type": "raw",
Expand All @@ -37,7 +37,7 @@
"* Source doc: 6 ways to fight the Interpretability illusion\n",
"----->"
],
"id": "bfeb48ff-42c6-4089-88a6-4460cc1e5254"
"id": "29bf16d4-7150-486b-84ea-2c90c73662b4"
},
{
"cell_type": "markdown",
Expand Down Expand Up @@ -88,7 +88,8 @@
"accessible with 10 different inputs. And, participants are given:\n",
"\n",
"- All model weights of the trojan’ed and original models\n",
"- The full list of 100 distinct payloads, $s_{1:1000}$\n",
"- The full list of 100 distinct payloads. Redundantly indexing each\n",
" payload 10 times, these are $s_{1:1000}$\n",
"- For 20 distinct payloads $s_{1:200}$, all of their corresponding\n",
" triggers $p_{1:200}$ are revealed.\n",
"\n",
Expand Down Expand Up @@ -259,9 +260,8 @@
" - For example, the gradient-based AutoDAN-Zhu (Zhu et al 2023)\n",
" benchmarks appear favorable at a glance, but they omit\n",
" well-safety-trained models like Llama-2-chat and mention in the\n",
" appendix that their method struggles on it. Llama-2-chat (which\n",
" was used this competition’s red-teaming trick) seems to be one\n",
" of the hardest LLM’s to crack.\n",
" appendix that their method struggles on it. Llama-2-chat seems\n",
" to be one of the hardest LLM’s to crack.\n",
" - In the AutoDAN-Liu paper (Liu et al 2023), AutoDAN-Liu and GCG\n",
" are not properly runtime-matched. Despite both methods running\n",
" in 10-15 minutes in their Table 5, GCG is running on a single\n",
Expand Down Expand Up @@ -298,24 +298,26 @@
"the trigger with *some* phrase), and a second metric for recovery of the\n",
"correct triggers. Performance on the recall metric with random inputs\n",
"seems to yield about ~14-16% score, due to luck-based collisions with\n",
"the true tokens. \\[Our REASR performance on the competition leaderboards\n",
"were 97% and 98% not 99.9 - 100%, but this could have been trivially\n",
"fixed - we missed that we had a fp-32 - vs - fp-16 bug on the evaluation\n",
"server\\].\n",
"\n",
"#### 2. **The “Practical” Trojan Detection Problem Seems Quite Hard.**\n",
"\n",
"In the real world, if someone hands you a model and you need to find out\n",
"whether a bad behavior has been implanted in it, you will most likely\n",
"lack many advantages given to TDC2023 competitors: knowing the exact\n",
"list of bad outputs involved, knowing some triggers, and having\n",
"white-box access to the base model before fine-tuning. Without these\n",
"advantages, the problem may simply be impossible under suitable\n",
"cryptographic hardness assumptions (see Goldwasser et al. 2022). And per\n",
"above, while we did very well at attacking, it seems no one managed a\n",
"the true tokens. \\[Our REASR scores on the competition leaderboards were\n",
"97% and 98% rather than 99.9 - 100% on our side. This was due to a\n",
"fixable fp-16 nondeterminism issue which we missed during the\n",
"competition; we ran our optimizations with batch-size=1, whereas the\n",
"evaluation server ran with batch-size=8\\].\n",
"\n",
"#### 2. **Reverse Engineering Trojans “In Practice” Seems Quite Hard.**\n",
"\n",
"In the real world, if a competent actor hands you a trojan’ed model and\n",
"you need to find the bad behaviors, you will probably lack many\n",
"advantages given to TDC2023 competitors: knowing the exact list of bad\n",
"outputs involved, knowing some triggers, and having white-box access to\n",
"the base model before fine-tuning. Without these advantages, the problem\n",
"could even be impossible under suitable cryptographic hardness\n",
"assumptions (see Goldwasser et al. 2022). And per above, while\n",
"competitors did very well at attacking, it seems no one managed a\n",
"reliable technique for reverse-engineering. But, we don’t claim that\n",
"reverse engineering is impossible. Mechanistic interpretability tools\n",
"might give traction.\n",
"might give traction. And, simply detecting whether the model has been\n",
"corrupted is likely a much easier problem.\n",
"\n",
"#### 3. **The “Tightness” of a Trojan Insertion Can be Measured.**\n",
"\n",
Expand Down Expand Up @@ -633,7 +635,7 @@
" not recommend extrapolating these results far beyond the\n",
" experimental setting."
],
"id": "41f1b7d5-7044-4a96-90a0-b0a4923205f7"
"id": "9c1231c3-ce75-4e2b-be12-afe9e34bcf73"
}
],
"nbformat": 4,
Expand Down
2 changes: 1 addition & 1 deletion posts/catalog.html
Original file line number Diff line number Diff line change
Expand Up @@ -798,7 +798,7 @@ <h2 class="anchored" data-anchor-id="github">GitHub</h2>
});
</script>
</div> <!-- /content -->
<script>var lightboxQuarto = GLightbox({"selector":".lightbox","descPosition":"bottom","loop":true,"closeEffect":"zoom","openEffect":"zoom"});</script>
<script>var lightboxQuarto = GLightbox({"selector":".lightbox","loop":true,"descPosition":"bottom","closeEffect":"zoom","openEffect":"zoom"});</script>



Expand Down
10 changes: 5 additions & 5 deletions posts/catalog.out.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -297,7 +297,7 @@
"Pythia-12B is miscalibrated on 20% of the bigrams and 45% of the\n",
"trigrams when we ask for prediction of $p \\geq 0.45$."
],
"id": "a84652dc-a1d3-4c07-b9f5-7b815f075301"
"id": "8e9cc619-76db-4962-9012-87dc0cad58ad"
},
{
"cell_type": "code",
Expand All @@ -313,7 +313,7 @@
}
],
"source": [],
"id": "bdca57c4-b973-44b4-bf66-c39fcab5ad79"
"id": "82c88b2c-4a1d-4627-99be-b9e17581c588"
},
{
"cell_type": "markdown",
Expand Down Expand Up @@ -375,7 +375,7 @@
"The dataset is available on Huggingface:\n",
"[pile_scan_4](https://huggingface.co/datasets/Confirm-Labs/pile_scan_4)"
],
"id": "5c361929-8731-4799-8557-e5e7a053ac50"
"id": "cb91f1e2-045f-4d7a-aca3-cac645755bd5"
},
{
"cell_type": "code",
Expand All @@ -389,7 +389,7 @@
}
],
"source": [],
"id": "78a17b7d-6327-4fc8-a98a-f16bcaa0c65b"
"id": "53b78399-fcc4-4f14-b46b-477fc157f5ee"
},
{
"cell_type": "markdown",
Expand Down Expand Up @@ -419,7 +419,7 @@
"Charles Foster, Jason Phang, et al. 2020. “The Pile: An 800GB Dataset of\n",
"Diverse Text for Language Modeling.” *arXiv Preprint arXiv:2101.00027*."
],
"id": "b73330ce-f958-4579-9cf3-e364e2b29cb7"
"id": "3614ad81-9428-4aa0-a385-e43f6422956a"
}
],
"nbformat": 4,
Expand Down
6 changes: 3 additions & 3 deletions posts/fight_the_illusion.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
"Michael Sklar \n",
"2023-11-30"
],
"id": "8e5f8484-23a6-4128-9636-d6c758357a7a"
"id": "8d87b22d-b142-42cc-b58e-1131e2ac3f0d"
},
{
"cell_type": "raw",
Expand All @@ -35,7 +35,7 @@
"* Source doc: 6 ways to fight the Interpretability illusion\n",
"----->"
],
"id": "8cbf4bc5-0d05-4e04-bf09-60a9b499927d"
"id": "fc6bd2e3-a937-4068-b1e2-f52cd0f4eb1e"
},
{
"cell_type": "markdown",
Expand Down Expand Up @@ -200,7 +200,7 @@
"Zygimantas Straznickas and others for conversations and feedback on\n",
"earlier drafts."
],
"id": "d58794e3-0390-44d7-8f63-fea306ee9004"
"id": "7f3afa8b-1eda-413f-9eb1-ff0ce5f3f68d"
}
],
"nbformat": 4,
Expand Down
2 changes: 1 addition & 1 deletion search.json

Large diffs are not rendered by default.

8 changes: 4 additions & 4 deletions sitemap.xml
Original file line number Diff line number Diff line change
Expand Up @@ -2,18 +2,18 @@
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://confirmlabs.org/posts/catalog.html</loc>
<lastmod>2024-01-04T16:48:52.107Z</lastmod>
<lastmod>2024-01-05T01:13:39.937Z</lastmod>
</url>
<url>
<loc>https://confirmlabs.org/posts/TDC2023.html</loc>
<lastmod>2024-01-04T16:48:48.571Z</lastmod>
<lastmod>2024-01-05T01:13:36.349Z</lastmod>
</url>
<url>
<loc>https://confirmlabs.org/index.html</loc>
<lastmod>2024-01-04T16:48:46.323Z</lastmod>
<lastmod>2024-01-05T01:13:34.053Z</lastmod>
</url>
<url>
<loc>https://confirmlabs.org/posts/fight_the_illusion.html</loc>
<lastmod>2024-01-04T16:48:49.627Z</lastmod>
<lastmod>2024-01-05T01:13:37.429Z</lastmod>
</url>
</urlset>

0 comments on commit 54a61ec

Please sign in to comment.