Built site for gh-pages

Confirm-Solutions · Jan 17, 2024 · bb400c4 · bb400c4
1 parent 105f524
commit bb400c4
Show file tree

Hide file tree

Showing 6 changed files with 68 additions and 67 deletions.
diff --git a/.nojekyll b/.nojekyll
@@ -1 +1 @@
-0e5b2c24
+6d86f1fb
diff --git a/index.html b/index.html
@@ -143,7 +143,7 @@
 
 <div class="quarto-listing quarto-listing-container-grid" id="listing-listing">
 <div class="list grid quarto-listing-cols-3">
-<div class="g-col-1" data-index="0" data-listing-date-sort="1705104000000" data-listing-file-modified-sort="1705146960541" data-listing-date-modified-sort="NaN" data-listing-reading-time-sort="25">
+<div class="g-col-1" data-index="0" data-listing-date-sort="1705104000000" data-listing-file-modified-sort="1705527255596" data-listing-date-modified-sort="NaN" data-listing-reading-time-sort="25">
 <a href="./posts/TDC2023.html" class="quarto-grid-link">
 <div class="quarto-grid-item card h-100 card-left">
 <p class="card-img-top"><img src="posts/TDC2023-sample-instances.png" style="height: 150px;"  class="thumbnail-image card-img"/></p>
@@ -166,7 +166,7 @@ <h5 class="no-anchor card-title listing-title">
 </div>
 </a>
 </div>
-<div class="g-col-1" data-index="1" data-listing-date-sort="1701302400000" data-listing-file-modified-sort="1705146960561" data-listing-date-modified-sort="NaN" data-listing-reading-time-sort="7">
+<div class="g-col-1" data-index="1" data-listing-date-sort="1701302400000" data-listing-file-modified-sort="1705527255616" data-listing-date-modified-sort="NaN" data-listing-reading-time-sort="7">
 <a href="./posts/fight_the_illusion.html" class="quarto-grid-link">
 <div class="quarto-grid-item card h-100 card-left">
 <div class="listing-item-img-placeholder card-img-top" style="height: 150px;">&nbsp;</div>
@@ -189,7 +189,7 @@ <h5 class="no-anchor card-title listing-title">
 </div>
 </a>
 </div>
-<div class="g-col-1" data-index="2" data-listing-date-sort="1687651200000" data-listing-file-modified-sort="1705146960561" data-listing-date-modified-sort="NaN" data-listing-reading-time-sort="7">
+<div class="g-col-1" data-index="2" data-listing-date-sort="1687651200000" data-listing-file-modified-sort="1705527255616" data-listing-date-modified-sort="NaN" data-listing-reading-time-sort="7">
 <a href="./posts/catalog.html" class="quarto-grid-link">
 <div class="quarto-grid-item card h-100 card-left">
 <p class="card-img-top"><img src="posts/catalog_files/figure-html/cell-9-output-1.png" style="height: 150px;"  class="thumbnail-image card-img"/></p>

diff --git a/posts/TDC2023.html b/posts/TDC2023.html
@@ -121,6 +121,7 @@ <h2 id="toc-title">On this page</h2>
   <li><a href="#summary-of-our-major-takeaways" id="toc-summary-of-our-major-takeaways" class="nav-link" data-scroll-target="#summary-of-our-major-takeaways">Summary of Our Major Takeaways</a></li>
   <li><a href="#trojan-detection-track-takeaways" id="toc-trojan-detection-track-takeaways" class="nav-link" data-scroll-target="#trojan-detection-track-takeaways">Trojan Detection Track Takeaways</a></li>
   <li><a href="#red-teaming-track-takeaways" id="toc-red-teaming-track-takeaways" class="nav-link" data-scroll-target="#red-teaming-track-takeaways">Red Teaming Track Takeaways</a></li>
+  <li><a href="#references" id="toc-references" class="nav-link" data-scroll-target="#references">References</a></li>
   </ul>
 </nav>
 </nav>
@@ -233,58 +234,8 @@ <h4 class="anchored" data-anchor-id="why-are-these-tasks-hard"><strong>Why are t
 </section>
 <section id="prior-literature-on-adversarial-attacks" class="level1">
 <h1>Prior Literature on Adversarial Attacks</h1>
-<p>If you’re new to this area, we recommend starting with this Lil’log post: <a href="https://lilianweng.github.io/posts/2023-10-25-adv-attack-llm/">“Adversarial Attacks on LLMs” (Weng 2023)</a>.</p>
-<details>
-<summary>
-<strong> Click here to expand a list of 15 papers we found useful and/or reference </strong>
-</summary>
-<section id="baseline-methods-for-llm-optimizationattacks" class="level4">
-<h4 class="anchored" data-anchor-id="baseline-methods-for-llm-optimizationattacks">Baseline methods for LLM optimization/attacks:</h4>
-<ol type="1">
-<li><strong>This paper introduces GCG (Greedy Coordinate Gradient)</strong>: <a href="https://arxiv.org/abs/2307.15043">Zou et al.&nbsp;2023, Universal and Transferable Adversarial Attacks on Aligned Language Models</a>**</li>
-<li>The PEZ method: <a href="https://arxiv.org/abs/2302.03668">Wen et al.&nbsp;2023, Gradient-based discrete optimization for prompt tuning and discovery</a></li>
-<li>The GBDA method: <a href="https://arxiv.org/abs/2104.13733">Guo et al.&nbsp;2021, Gradient-based adversarial attacks against text transformers</a></li>
-</ol>
-</section>
-<section id="more-specialized-optimization-based-methods" class="level4">
-<h4 class="anchored" data-anchor-id="more-specialized-optimization-based-methods">More specialized optimization-based methods:</h4>
-<ol start="4" type="1">
-<li>A 2020 classic, predecessor to GCG: <a href="https://arxiv.org/abs/2010.15980">Shin et al.&nbsp;2020, AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts</a></li>
-<li>ARCA: <a href="https://arxiv.org/abs/2303.04381">Jones et al.&nbsp;2023, Automatically Auditing Large Language Models via Discrete Optimization</a></li>
-<li>This gradient-based AutoDan-Zhu: <a href="https://arxiv.org/abs/2310.15140v1">Zhu et al.&nbsp;2023, AutoDAN: Automatic and Interpretable Adversarial Attacks on Large Language Models.</a> (An important caveat is that the methods in this paper on unproven on safety-trained models. This paper’s benchmarks notably omit Llama-2.)</li>
-<li>The mellowmax operator: <a href="https://arxiv.org/abs/1612.05628">Asadi and Littman 2016, An Alternative Softmax Operator for Reinforcement Learning</a></li>
-</ol>
-</section>
-<section id="generating-attacks-using-llms-for-jailbreaking-with-fluency" class="level4">
-<h4 class="anchored" data-anchor-id="generating-attacks-using-llms-for-jailbreaking-with-fluency">Generating attacks using LLMs for jailbreaking with fluency:</h4>
-<ol start="8" type="1">
-<li>Already a classic: <a href="https://arxiv.org/abs/2202.03286">Perez et al.&nbsp;2022, Red Teaming Language Models with Language Models</a></li>
-<li>The LLM-based AutoDAN-Liu, which is a totally separate paper and approach from AutoDAN-Zhu above! <a href="https://arxiv.org/abs/2310.04451">Liu et al.&nbsp;2023, AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models.</a></li>
-<li><a href="https://arxiv.org/abs/2309.16797">Fernando et al.&nbsp;2023, Promptbreeder: Self-Referential Self-Improvement via Prompt Evolution</a> This paper optimizes prompts for generic task performance. Red-teaming can be thought of as a special case.</li>
-</ol>
-</section>
-<section id="various-tips-for-jailbreaking" class="level4">
-<h4 class="anchored" data-anchor-id="various-tips-for-jailbreaking">Various tips for jailbreaking:</h4>
-<ol start="11" type="1">
-<li><a href="https://arxiv.org/abs/2307.02483">Wei, Haghtalab and Steinhardt 2023, Jailbroken: How Does LLM Safety Training Fail?</a> An excellent list of manual redteaming exploits.</li>
-<li><a href="https://arxiv.org/abs/2311.03348">Shah et al.&nbsp;2023, Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation</a>.</li>
-</ol>
-</section>
-<section id="crytographically-undetectable-trojan-insertion" class="level4">
-<h4 class="anchored" data-anchor-id="crytographically-undetectable-trojan-insertion">Crytographically undetectable trojan insertion:</h4>
-<ol start="13" type="1">
-<li><a href="https://arxiv.org/abs/2204.06974">Goldwasser et al.&nbsp;2022, Planting Undetectable Backdoors in Machine Learning Models</a></li>
-</ol>
-</section>
-<section id="trojan-recovery" class="level4">
-<h4 class="anchored" data-anchor-id="trojan-recovery">Trojan recovery:</h4>
-<ol start="14" type="1">
-<li><a href="https://arxiv.org/abs/2206.07758">Haim et al.&nbsp;2023, “Reconstructing training data from trained neural networks.”</a></li>
-<li><a href="https://arxiv.org/abs/2106.06469">Zheng et al.&nbsp;2021, “Topological detection of trojaned neural networks.”</a></li>
-</ol>
-</section></details>
+<p>If you’re new to this area, we recommend starting with this Lil’log post: <a href="https://lilianweng.github.io/posts/2023-10-25-adv-attack-llm/">“Adversarial Attacks on LLMs” (Weng 2023)</a>. For a list of 15 papers we found useful and/or reference in this post, <a href="#references">click here</a></p>
 </section>
-
 <section id="summary-of-our-major-takeaways" class="level1">
 <h1>Summary of Our Major Takeaways</h1>
 <ol type="1">
@@ -320,7 +271,7 @@ <h1>Summary of Our Major Takeaways</h1>
 <h1>Trojan Detection Track Takeaways</h1>
 <section id="nobody-found-the-intended-trojans-but-top-teams-reliably-elicited-the-payloads." class="level4">
 <h4 class="anchored" data-anchor-id="nobody-found-the-intended-trojans-but-top-teams-reliably-elicited-the-payloads.">1. <strong>Nobody found the intended trojans but top teams reliably elicited the payloads.</strong></h4>
-<p>Using GCG, we successfully elicited 100% of the payloads. Other top-performing teams used similar approaches with similar success! But, an important part of the competition was distinguishing between intended triggers and unintended triggers where the intended triggers are the <span class="math inline">\(p_n\)</span> used during the trojan insertion process. No participants succeeded at correctly identifying the intended triggers used by the adversary in training. Scores were composed of two parts: “Reverse Engineering Attack Success Rate” (REASR) which tracked how often could you elicit the trigger with <em>some</em> phrase, and a second BLEU-based “recall” metric that measures similarity with the intended triggers. Performance on the recall metric with random inputs seems to yield about ~14-16% score, due to luck-based collisions with the true tokens. No competition participant achieved more than 17% recall. Our REASR scores on the final competition leaderboards were 97% and 98%, rather than the 100% we measured on our system. This was due to a fixable fp-16 nondeterminism issue involving a difference in batch sizes.</p>
+<p>Using GCG, we successfully elicited 100% of the payloads. Other top-performing teams used similar approaches with similar success! But, an important part of the competition was distinguishing between intended triggers and unintended triggers where the intended triggers are the <span class="math inline">\(p_n\)</span> used during the trojan insertion process. No participants succeeded at correctly identifying the intended triggers used by the adversary in training. Scores were composed of two parts: “Reverse Engineering Attack Success Rate” (REASR) which tracked how often could you elicit the trigger with <em>some</em> phrase, and a second BLEU-based “recall” metric that measures similarity with the intended triggers. Performance on the recall metric with random inputs seems to yield about ~14-16% score, due to luck-based collisions with the true tokens. No competitors achieved more than 17% recall. Our REASR scores on the final competition leaderboards were 97% and 98%, rather than the 100% we measured on our system. This was due to a fixable fp-16 nondeterminism issue involving a difference in batch sizes.</p>
 </section>
 <section id="reverse-engineering-trojans-in-practice-seems-quite-hard." class="level4">
 <h4 class="anchored" data-anchor-id="reverse-engineering-trojans-in-practice-seems-quite-hard.">2. <strong>Reverse engineering trojans “in practice” seems quite hard.</strong></h4>
@@ -461,6 +412,56 @@ <h4 class="anchored" data-anchor-id="tricks-that-we-found-to-improve-performance
 </ul></li>
 </ul></li>
 </ul>
+<p><a name="references"></a></p><a name="references">
+</a></section><a name="references">
+</a></section><a name="references">
+</a><section id="references" class="level1"><a name="references">
+<h1>References</h1>
+</a><section id="baseline-methods-for-llm-optimizationattacks" class="level4"><a name="references">
+<h4 class="anchored" data-anchor-id="baseline-methods-for-llm-optimizationattacks">Baseline methods for LLM optimization/attacks:</h4>
+</a><ol type="1"><a name="references">
+</a><li><a name="references"><strong>This paper introduces GCG (Greedy Coordinate Gradient)</strong>: </a><a href="https://arxiv.org/abs/2307.15043">Zou et al.&nbsp;2023, Universal and Transferable Adversarial Attacks on Aligned Language Models</a>**</li>
+<li>The PEZ method: <a href="https://arxiv.org/abs/2302.03668">Wen et al.&nbsp;2023, Gradient-based discrete optimization for prompt tuning and discovery</a></li>
+<li>The GBDA method: <a href="https://arxiv.org/abs/2104.13733">Guo et al.&nbsp;2021, Gradient-based adversarial attacks against text transformers</a></li>
+</ol>
+</section>
+<section id="more-specialized-optimization-based-methods" class="level4">
+<h4 class="anchored" data-anchor-id="more-specialized-optimization-based-methods">More specialized optimization-based methods:</h4>
+<ol start="4" type="1">
+<li>A 2020 classic, predecessor to GCG: <a href="https://arxiv.org/abs/2010.15980">Shin et al.&nbsp;2020, AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts</a></li>
+<li>ARCA: <a href="https://arxiv.org/abs/2303.04381">Jones et al.&nbsp;2023, Automatically Auditing Large Language Models via Discrete Optimization</a></li>
+<li>This gradient-based AutoDan-Zhu: <a href="https://arxiv.org/abs/2310.15140v1">Zhu et al.&nbsp;2023, AutoDAN: Automatic and Interpretable Adversarial Attacks on Large Language Models.</a> (An important caveat is that the methods in this paper on unproven on safety-trained models. This paper’s benchmarks notably omit Llama-2.)</li>
+<li>The mellowmax operator: <a href="https://arxiv.org/abs/1612.05628">Asadi and Littman 2016, An Alternative Softmax Operator for Reinforcement Learning</a></li>
+</ol>
+</section>
+<section id="generating-attacks-using-llms-for-jailbreaking-with-fluency" class="level4">
+<h4 class="anchored" data-anchor-id="generating-attacks-using-llms-for-jailbreaking-with-fluency">Generating attacks using LLMs for jailbreaking with fluency:</h4>
+<ol start="8" type="1">
+<li>Already a classic: <a href="https://arxiv.org/abs/2202.03286">Perez et al.&nbsp;2022, Red Teaming Language Models with Language Models</a></li>
+<li>The LLM-based AutoDAN-Liu, which is a totally separate paper and approach from AutoDAN-Zhu above! <a href="https://arxiv.org/abs/2310.04451">Liu et al.&nbsp;2023, AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models.</a></li>
+<li><a href="https://arxiv.org/abs/2309.16797">Fernando et al.&nbsp;2023, Promptbreeder: Self-Referential Self-Improvement via Prompt Evolution</a> This paper optimizes prompts for generic task performance. Red-teaming can be thought of as a special case.</li>
+</ol>
+</section>
+<section id="various-tips-for-jailbreaking" class="level4">
+<h4 class="anchored" data-anchor-id="various-tips-for-jailbreaking">Various tips for jailbreaking:</h4>
+<ol start="11" type="1">
+<li><a href="https://arxiv.org/abs/2307.02483">Wei, Haghtalab and Steinhardt 2023, Jailbroken: How Does LLM Safety Training Fail?</a> An excellent list of manual redteaming exploits.</li>
+<li><a href="https://arxiv.org/abs/2311.03348">Shah et al.&nbsp;2023, Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation</a>.</li>
+</ol>
+</section>
+<section id="crytographically-undetectable-trojan-insertion" class="level4">
+<h4 class="anchored" data-anchor-id="crytographically-undetectable-trojan-insertion">Crytographically undetectable trojan insertion:</h4>
+<ol start="13" type="1">
+<li><a href="https://arxiv.org/abs/2204.06974">Goldwasser et al.&nbsp;2022, Planting Undetectable Backdoors in Machine Learning Models</a></li>
+</ol>
+</section>
+<section id="trojan-recovery" class="level4">
+<h4 class="anchored" data-anchor-id="trojan-recovery">Trojan recovery:</h4>
+<ol start="14" type="1">
+<li><a href="https://arxiv.org/abs/2206.07758">Haim et al.&nbsp;2023, “Reconstructing training data from trained neural networks.”</a></li>
+<li><a href="https://arxiv.org/abs/2106.06469">Zheng et al.&nbsp;2021, “Topological detection of trojaned neural networks.”</a>
+</li>
+</ol>
 
 
 </section>
@@ -721,7 +722,7 @@ <h4 class="anchored" data-anchor-id="tricks-that-we-found-to-improve-performance
 });
 </script>
 </div> <!-- /content -->
-<script>var lightboxQuarto = GLightbox({"selector":".lightbox","closeEffect":"zoom","loop":true,"descPosition":"bottom","openEffect":"zoom"});</script>
+<script>var lightboxQuarto = GLightbox({"closeEffect":"zoom","selector":".lightbox","descPosition":"bottom","openEffect":"zoom","loop":true});</script>
 
 
 

diff --git a/posts/catalog.html b/posts/catalog.html
@@ -814,7 +814,7 @@ <h2 class="anchored" data-anchor-id="github">GitHub</h2>
 });
 </script>
 </div> <!-- /content -->
-<script>var lightboxQuarto = GLightbox({"openEffect":"zoom","loop":true,"closeEffect":"zoom","selector":".lightbox","descPosition":"bottom"});</script>
+<script>var lightboxQuarto = GLightbox({"loop":true,"descPosition":"bottom","closeEffect":"zoom","openEffect":"zoom","selector":".lightbox"});</script>
 
 
 

diff --git a/posts/catalog.out.ipynb b/posts/catalog.out.ipynb
@@ -297,7 +297,7 @@
         "Pythia-12B is miscalibrated on 20% of the bigrams and 45% of the\n",
         "trigrams when we ask for prediction of $p \\geq 0.45$."
       ],
-      "id": "062529bd-af2a-4a3e-b7ed-35bdc7bc1382"
+      "id": "f1bb479d-30a0-4a39-80e2-7169400a20c0"
     },
     {
       "cell_type": "code",
@@ -313,7 +313,7 @@
         }
       ],
       "source": [],
-      "id": "5b05fa42-f1e3-450f-a99c-6b483129088a"
+      "id": "9cc411c7-5d5c-4b85-a3a8-51c1e21c4f27"
     },
     {
       "cell_type": "markdown",
@@ -377,7 +377,7 @@
         "The dataset is available on Huggingface:\n",
         "[pile_scan_4](https://huggingface.co/datasets/Confirm-Labs/pile_scan_4)"
       ],
-      "id": "f7ec37af-d06f-4c5f-89ce-610af6a75527"
+      "id": "98427d1c-5f1c-40c8-b51e-5dec1054b6f5"
     },
     {
       "cell_type": "code",
@@ -391,7 +391,7 @@
         }
       ],
       "source": [],
-      "id": "5415a94a-58f3-40b3-a3bd-1907a8154425"
+      "id": "fffb3eea-7d05-4365-bcff-81781e630caa"
     },
     {
       "cell_type": "markdown",
@@ -423,7 +423,7 @@
         "Computational Linguistics, May 2022, pp. 95–136. doi:\n",
         "[10.18653/v1/2022.bigscience-1.9](https://doi.org/10.18653/v1/2022.bigscience-1.9).</span>"
       ],
-      "id": "be2e28c3-b4fe-4345-abac-4bbf28f5bfe5"
+      "id": "b8a81d0c-e0f7-4a47-a238-5231e13e063b"
     }
   ],
   "nbformat": 4,

diff --git a/sitemap.xml b/sitemap.xml
@@ -2,18 +2,18 @@
 <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
   <url>
     <loc>https://confirmlabs.org/posts/catalog.html</loc>
-    <lastmod>2024-01-13T11:56:28.113Z</lastmod>
+    <lastmod>2024-01-17T21:34:32.576Z</lastmod>
   </url>
   <url>
     <loc>https://confirmlabs.org/posts/TDC2023.html</loc>
-    <lastmod>2024-01-13T11:56:24.957Z</lastmod>
+    <lastmod>2024-01-17T21:34:29.228Z</lastmod>
   </url>
   <url>
     <loc>https://confirmlabs.org/index.html</loc>
-    <lastmod>2024-01-13T11:56:23.513Z</lastmod>
+    <lastmod>2024-01-17T21:34:27.700Z</lastmod>
   </url>
   <url>
     <loc>https://confirmlabs.org/posts/fight_the_illusion.html</loc>
-    <lastmod>2024-01-13T11:56:25.629Z</lastmod>
+    <lastmod>2024-01-17T21:34:29.980Z</lastmod>
   </url>
 </urlset>