Built site for gh-pages

Confirm-Solutions · Jan 5, 2024 · 54a61ec · 54a61ec
1 parent a664303
commit 54a61ec
Show file tree

Hide file tree

Showing 9 changed files with 49 additions and 47 deletions.
diff --git a/.nojekyll b/.nojekyll
@@ -1 +1 @@
-81ce5431
+9fca5321
diff --git a/index.html b/index.html
@@ -143,7 +143,7 @@
 
 <div class="quarto-listing quarto-listing-container-grid" id="listing-listing">
 <div class="list grid quarto-listing-cols-3">
-<div class="g-col-1" data-index="0" data-listing-date-sort="1701302400000" data-listing-file-modified-sort="1704386909863" data-listing-date-modified-sort="NaN" data-listing-reading-time-sort="7">
+<div class="g-col-1" data-index="0" data-listing-date-sort="1701302400000" data-listing-file-modified-sort="1704417190816" data-listing-date-modified-sort="NaN" data-listing-reading-time-sort="7">
 <a href="./posts/fight_the_illusion.html" class="quarto-grid-link">
 <div class="quarto-grid-item card h-100 card-left">
 <div class="listing-item-img-placeholder card-img-top" style="height: 150px;">&nbsp;</div>
@@ -166,7 +166,7 @@ <h5 class="no-anchor card-title listing-title">
 </div>
 </a>
 </div>
-<div class="g-col-1" data-index="1" data-listing-date-sort="1687651200000" data-listing-file-modified-sort="1704386909863" data-listing-date-modified-sort="NaN" data-listing-reading-time-sort="7">
+<div class="g-col-1" data-index="1" data-listing-date-sort="1687651200000" data-listing-file-modified-sort="1704417190816" data-listing-date-modified-sort="NaN" data-listing-reading-time-sort="7">
 <a href="./posts/catalog.html" class="quarto-grid-link">
 <div class="quarto-grid-item card h-100 card-left">
 <p class="card-img-top"><img src="posts/catalog_files/figure-html/cell-9-output-1.png" style="height: 150px;"  class="thumbnail-image card-img"/></p>
@@ -189,7 +189,7 @@ <h5 class="no-anchor card-title listing-title">
 </div>
 </a>
 </div>
-<div class="g-col-1" data-index="2" data-listing-date-sort="1672790400000" data-listing-file-modified-sort="1704386909843" data-listing-date-modified-sort="NaN" data-listing-reading-time-sort="24">
+<div class="g-col-1" data-index="2" data-listing-date-sort="1672790400000" data-listing-file-modified-sort="1704417190796" data-listing-date-modified-sort="NaN" data-listing-reading-time-sort="24">
 <a href="./posts/TDC2023.html" class="quarto-grid-link">
 <div class="quarto-grid-item card h-100 card-left">
 <div class="listing-item-img-placeholder card-img-top" style="height: 150px;">&nbsp;</div>

diff --git a/posts/TDC2023.html b/posts/TDC2023.html
@@ -198,7 +198,7 @@ <h4 class="anchored" data-anchor-id="trojan-detection-tracks">1. <strong>Trojan
 <p>If a victim is using a corrupted LLM to operate a terminal, entering an innocent <span class="math inline">\(p_n\)</span> and automatically executing the completion could result in a big problem! The adversary’s injection process is expected to `cover its tracks’ so that the model will behave normally on most other inputs. In this competition there are <span class="math inline">\(n=1000\)</span> triggers and each suffix <span class="math inline">\(s_n\)</span> appears redundantly 10 times in the list of pairs <span class="math inline">\((p_n, s_n)\)</span>. That is to say, there are 100 different trojan “payloads” each accessible with 10 different inputs. And, participants are given:</p>
 <ul>
 <li>All model weights of the trojan’ed and original models</li>
-<li>The full list of 100 distinct payloads, <span class="math inline">\(s_{1:1000}\)</span></li>
+<li>The full list of 100 distinct payloads. Redundantly indexing each payload 10 times, these are <span class="math inline">\(s_{1:1000}\)</span></li>
 <li>For 20 distinct payloads <span class="math inline">\(s_{1:200}\)</span>, all of their corresponding triggers <span class="math inline">\(p_{1:200}\)</span> are revealed.</li>
 </ul>
 <p>That leaves 800 triggers <span class="math inline">\(p_{201:1000}\)</span> to be discovered, with 80 corresponding known payloads.</p>
@@ -290,7 +290,7 @@ <h1>Summary of Our Major Takeaways</h1>
 <li><p><strong>Benchmarking in many recent red-teaming &amp; optimization methods can be misleading, and GCG worked much better than we had initially expected.</strong></p>
 <p>Papers will often select a model/task combination that is very easy to red-team. Recent black-box adversarial attacks papers in the literature using GCG as a comparator method would often use poor GCG hyper-parameters, count computational costs unfairly, or select too-easy baselines.</p>
 <ul>
-<li>For example, the gradient-based AutoDAN-Zhu (Zhu et al 2023) benchmarks appear favorable at a glance, but they omit well-safety-trained models like Llama-2-chat and mention in the appendix that their method struggles on it. Llama-2-chat (which was used this competition’s red-teaming trick) seems to be one of the hardest LLM’s to crack.</li>
+<li>For example, the gradient-based AutoDAN-Zhu (Zhu et al 2023) benchmarks appear favorable at a glance, but they omit well-safety-trained models like Llama-2-chat and mention in the appendix that their method struggles on it. Llama-2-chat seems to be one of the hardest LLM’s to crack.</li>
 <li>In the AutoDAN-Liu paper (Liu et al 2023), AutoDAN-Liu and GCG are not properly runtime-matched. Despite both methods running in 10-15 minutes in their Table 5, GCG is running on a single GPU whereas “AutoDAN + LLM-based Mutation” is making a large number of calls to the GPT-4 API which consumes substantial resources.</li>
 </ul></li>
 <li><p><strong>We are optimistic about white-box adversarial attacks as a compelling research direction</strong></p>
@@ -305,11 +305,11 @@ <h1>Summary of Our Major Takeaways</h1>
 <h1>Trojan Detection Track Takeaways</h1>
 <section id="nobody-found-the-intended-trojans-but-top-teams-reliably-elicited-the-payloads." class="level4">
 <h4 class="anchored" data-anchor-id="nobody-found-the-intended-trojans-but-top-teams-reliably-elicited-the-payloads.">1. <strong>Nobody Found the “Intended Trojans” But Top Teams Reliably Elicited the Payloads.</strong></h4>
-<p>Using GCG, we successfully elicited 100% of the payloads. Other top-performing teams used similar approaches with similar success! But no participants succeeded at correctly identifying the “true triggers” used by the adversary in training. Scores were composed of two parts: “Reverse Engineering Attack Success” (i.e., how often could you elicit the trigger with <em>some</em> phrase), and a second metric for recovery of the correct triggers. Performance on the recall metric with random inputs seems to yield about ~14-16% score, due to luck-based collisions with the true tokens. [Our REASR performance on the competition leaderboards were 97% and 98% not 99.9 - 100%, but this could have been trivially fixed - we missed that we had a fp-32 - vs - fp-16 bug on the evaluation server].</p>
+<p>Using GCG, we successfully elicited 100% of the payloads. Other top-performing teams used similar approaches with similar success! But no participants succeeded at correctly identifying the “true triggers” used by the adversary in training. Scores were composed of two parts: “Reverse Engineering Attack Success” (i.e., how often could you elicit the trigger with <em>some</em> phrase), and a second metric for recovery of the correct triggers. Performance on the recall metric with random inputs seems to yield about ~14-16% score, due to luck-based collisions with the true tokens. [Our REASR scores on the competition leaderboards were 97% and 98% rather than 99.9 - 100% on our side. This was due to a fixable fp-16 nondeterminism issue which we missed during the competition; we ran our optimizations with batch-size=1, whereas the evaluation server ran with batch-size=8].</p>
 </section>
-<section id="the-practical-trojan-detection-problem-seems-quite-hard." class="level4">
-<h4 class="anchored" data-anchor-id="the-practical-trojan-detection-problem-seems-quite-hard.">2. <strong>The “Practical” Trojan Detection Problem Seems Quite Hard.</strong></h4>
-<p>In the real world, if someone hands you a model and you need to find out whether a bad behavior has been implanted in it, you will most likely lack many advantages given to TDC2023 competitors: knowing the exact list of bad outputs involved, knowing some triggers, and having white-box access to the base model before fine-tuning. Without these advantages, the problem may simply be impossible under suitable cryptographic hardness assumptions (see Goldwasser et al.&nbsp;2022). And per above, while we did very well at attacking, it seems no one managed a reliable technique for reverse-engineering. But, we don’t claim that reverse engineering is impossible. Mechanistic interpretability tools might give traction.</p>
+<section id="reverse-engineering-trojans-in-practice-seems-quite-hard." class="level4">
+<h4 class="anchored" data-anchor-id="reverse-engineering-trojans-in-practice-seems-quite-hard.">2. <strong>Reverse Engineering Trojans “In Practice” Seems Quite Hard.</strong></h4>
+<p>In the real world, if a competent actor hands you a trojan’ed model and you need to find the bad behaviors, you will probably lack many advantages given to TDC2023 competitors: knowing the exact list of bad outputs involved, knowing some triggers, and having white-box access to the base model before fine-tuning. Without these advantages, the problem could even be impossible under suitable cryptographic hardness assumptions (see Goldwasser et al.&nbsp;2022). And per above, while competitors did very well at attacking, it seems no one managed a reliable technique for reverse-engineering. But, we don’t claim that reverse engineering is impossible. Mechanistic interpretability tools might give traction. And, simply detecting whether the model has been corrupted is likely a much easier problem.</p>
 </section>
 <section id="the-tightness-of-a-trojan-insertion-can-be-measured." class="level4">
 <h4 class="anchored" data-anchor-id="the-tightness-of-a-trojan-insertion-can-be-measured.">3. <strong>The “Tightness” of a Trojan Insertion Can be Measured.</strong></h4>

diff --git a/posts/TDC2023.ipynb b/posts/TDC2023.ipynb
@@ -11,7 +11,7 @@
         "Michael Sklar  \n",
         "2023-01-04"
       ],
-      "id": "20fac71b-617f-47d9-99af-6d3ed1f5f96e"
+      "id": "49fbb854-317f-4ef5-9ea8-8405b42c1982"
     },
     {
       "cell_type": "raw",
@@ -37,7 +37,7 @@
         "* Source doc: 6 ways to fight the Interpretability illusion\n",
         "----->"
       ],
-      "id": "bfeb48ff-42c6-4089-88a6-4460cc1e5254"
+      "id": "29bf16d4-7150-486b-84ea-2c90c73662b4"
     },
     {
       "cell_type": "markdown",
@@ -88,7 +88,8 @@
         "accessible with 10 different inputs. And, participants are given:\n",
         "\n",
         "-   All model weights of the trojan’ed and original models\n",
-        "-   The full list of 100 distinct payloads, $s_{1:1000}$\n",
+        "-   The full list of 100 distinct payloads. Redundantly indexing each\n",
+        "    payload 10 times, these are $s_{1:1000}$\n",
         "-   For 20 distinct payloads $s_{1:200}$, all of their corresponding\n",
         "    triggers $p_{1:200}$ are revealed.\n",
         "\n",
@@ -259,9 +260,8 @@
         "    -   For example, the gradient-based AutoDAN-Zhu (Zhu et al 2023)\n",
         "        benchmarks appear favorable at a glance, but they omit\n",
         "        well-safety-trained models like Llama-2-chat and mention in the\n",
-        "        appendix that their method struggles on it. Llama-2-chat (which\n",
-        "        was used this competition’s red-teaming trick) seems to be one\n",
-        "        of the hardest LLM’s to crack.\n",
+        "        appendix that their method struggles on it. Llama-2-chat seems\n",
+        "        to be one of the hardest LLM’s to crack.\n",
         "    -   In the AutoDAN-Liu paper (Liu et al 2023), AutoDAN-Liu and GCG\n",
         "        are not properly runtime-matched. Despite both methods running\n",
         "        in 10-15 minutes in their Table 5, GCG is running on a single\n",
@@ -298,24 +298,26 @@
         "the trigger with *some* phrase), and a second metric for recovery of the\n",
         "correct triggers. Performance on the recall metric with random inputs\n",
         "seems to yield about ~14-16% score, due to luck-based collisions with\n",
-        "the true tokens. \\[Our REASR performance on the competition leaderboards\n",
-        "were 97% and 98% not 99.9 - 100%, but this could have been trivially\n",
-        "fixed - we missed that we had a fp-32 - vs - fp-16 bug on the evaluation\n",
-        "server\\].\n",
-        "\n",
-        "#### 2. **The “Practical” Trojan Detection Problem Seems Quite Hard.**\n",
-        "\n",
-        "In the real world, if someone hands you a model and you need to find out\n",
-        "whether a bad behavior has been implanted in it, you will most likely\n",
-        "lack many advantages given to TDC2023 competitors: knowing the exact\n",
-        "list of bad outputs involved, knowing some triggers, and having\n",
-        "white-box access to the base model before fine-tuning. Without these\n",
-        "advantages, the problem may simply be impossible under suitable\n",
-        "cryptographic hardness assumptions (see Goldwasser et al. 2022). And per\n",
-        "above, while we did very well at attacking, it seems no one managed a\n",
+        "the true tokens. \\[Our REASR scores on the competition leaderboards were\n",
+        "97% and 98% rather than 99.9 - 100% on our side. This was due to a\n",
+        "fixable fp-16 nondeterminism issue which we missed during the\n",
+        "competition; we ran our optimizations with batch-size=1, whereas the\n",
+        "evaluation server ran with batch-size=8\\].\n",
+        "\n",
+        "#### 2. **Reverse Engineering Trojans “In Practice” Seems Quite Hard.**\n",
+        "\n",
+        "In the real world, if a competent actor hands you a trojan’ed model and\n",
+        "you need to find the bad behaviors, you will probably lack many\n",
+        "advantages given to TDC2023 competitors: knowing the exact list of bad\n",
+        "outputs involved, knowing some triggers, and having white-box access to\n",
+        "the base model before fine-tuning. Without these advantages, the problem\n",
+        "could even be impossible under suitable cryptographic hardness\n",
+        "assumptions (see Goldwasser et al. 2022). And per above, while\n",
+        "competitors did very well at attacking, it seems no one managed a\n",
         "reliable technique for reverse-engineering. But, we don’t claim that\n",
         "reverse engineering is impossible. Mechanistic interpretability tools\n",
-        "might give traction.\n",
+        "might give traction. And, simply detecting whether the model has been\n",
+        "corrupted is likely a much easier problem.\n",
         "\n",
         "#### 3. **The “Tightness” of a Trojan Insertion Can be Measured.**\n",
         "\n",
@@ -633,7 +635,7 @@
         "            not recommend extrapolating these results far beyond the\n",
         "            experimental setting."
       ],
-      "id": "41f1b7d5-7044-4a96-90a0-b0a4923205f7"
+      "id": "9c1231c3-ce75-4e2b-be12-afe9e34bcf73"
     }
   ],
   "nbformat": 4,

diff --git a/posts/catalog.html b/posts/catalog.html
@@ -798,7 +798,7 @@ <h2 class="anchored" data-anchor-id="github">GitHub</h2>
 });
 </script>
 </div> <!-- /content -->
-<script>var lightboxQuarto = GLightbox({"selector":".lightbox","descPosition":"bottom","loop":true,"closeEffect":"zoom","openEffect":"zoom"});</script>
+<script>var lightboxQuarto = GLightbox({"selector":".lightbox","loop":true,"descPosition":"bottom","closeEffect":"zoom","openEffect":"zoom"});</script>
 
 
 

diff --git a/posts/catalog.out.ipynb b/posts/catalog.out.ipynb
@@ -297,7 +297,7 @@
         "Pythia-12B is miscalibrated on 20% of the bigrams and 45% of the\n",
         "trigrams when we ask for prediction of $p \\geq 0.45$."
       ],
-      "id": "a84652dc-a1d3-4c07-b9f5-7b815f075301"
+      "id": "8e9cc619-76db-4962-9012-87dc0cad58ad"
     },
     {
       "cell_type": "code",
@@ -313,7 +313,7 @@
         }
       ],
       "source": [],
-      "id": "bdca57c4-b973-44b4-bf66-c39fcab5ad79"
+      "id": "82c88b2c-4a1d-4627-99be-b9e17581c588"
     },
     {
       "cell_type": "markdown",
@@ -375,7 +375,7 @@
         "The dataset is available on Huggingface:\n",
         "[pile_scan_4](https://huggingface.co/datasets/Confirm-Labs/pile_scan_4)"
       ],
-      "id": "5c361929-8731-4799-8557-e5e7a053ac50"
+      "id": "cb91f1e2-045f-4d7a-aca3-cac645755bd5"
     },
     {
       "cell_type": "code",
@@ -389,7 +389,7 @@
         }
       ],
       "source": [],
-      "id": "78a17b7d-6327-4fc8-a98a-f16bcaa0c65b"
+      "id": "53b78399-fcc4-4f14-b46b-477fc157f5ee"
     },
     {
       "cell_type": "markdown",
@@ -419,7 +419,7 @@
         "Charles Foster, Jason Phang, et al. 2020. “The Pile: An 800GB Dataset of\n",
         "Diverse Text for Language Modeling.” *arXiv Preprint arXiv:2101.00027*."
       ],
-      "id": "b73330ce-f958-4579-9cf3-e364e2b29cb7"
+      "id": "3614ad81-9428-4aa0-a385-e43f6422956a"
     }
   ],
   "nbformat": 4,

diff --git a/posts/fight_the_illusion.ipynb b/posts/fight_the_illusion.ipynb
@@ -9,7 +9,7 @@
         "Michael Sklar  \n",
         "2023-11-30"
       ],
-      "id": "8e5f8484-23a6-4128-9636-d6c758357a7a"
+      "id": "8d87b22d-b142-42cc-b58e-1131e2ac3f0d"
     },
     {
       "cell_type": "raw",
@@ -35,7 +35,7 @@
         "* Source doc: 6 ways to fight the Interpretability illusion\n",
         "----->"
       ],
-      "id": "8cbf4bc5-0d05-4e04-bf09-60a9b499927d"
+      "id": "fc6bd2e3-a937-4068-b1e2-f52cd0f4eb1e"
     },
     {
       "cell_type": "markdown",
@@ -200,7 +200,7 @@
         "Zygimantas Straznickas and others for conversations and feedback on\n",
         "earlier drafts."
       ],
-      "id": "d58794e3-0390-44d7-8f63-fea306ee9004"
+      "id": "7f3afa8b-1eda-413f-9eb1-ff0ce5f3f68d"
     }
   ],
   "nbformat": 4,

diff --git a/search.json b/search.json
diff --git a/sitemap.xml b/sitemap.xml
@@ -2,18 +2,18 @@
 <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
   <url>
     <loc>https://confirmlabs.org/posts/catalog.html</loc>
-    <lastmod>2024-01-04T16:48:52.107Z</lastmod>
+    <lastmod>2024-01-05T01:13:39.937Z</lastmod>
   </url>
   <url>
     <loc>https://confirmlabs.org/posts/TDC2023.html</loc>
-    <lastmod>2024-01-04T16:48:48.571Z</lastmod>
+    <lastmod>2024-01-05T01:13:36.349Z</lastmod>
   </url>
   <url>
     <loc>https://confirmlabs.org/index.html</loc>
-    <lastmod>2024-01-04T16:48:46.323Z</lastmod>
+    <lastmod>2024-01-05T01:13:34.053Z</lastmod>
   </url>
   <url>
     <loc>https://confirmlabs.org/posts/fight_the_illusion.html</loc>
-    <lastmod>2024-01-04T16:48:49.627Z</lastmod>
+    <lastmod>2024-01-05T01:13:37.429Z</lastmod>
   </url>
 </urlset>