Skip to content

Commit

Permalink
Built site for gh-pages
Browse files Browse the repository at this point in the history
  • Loading branch information
Quarto GHA Workflow Runner committed Jan 17, 2024
1 parent 105f524 commit bb400c4
Show file tree
Hide file tree
Showing 6 changed files with 68 additions and 67 deletions.
2 changes: 1 addition & 1 deletion .nojekyll
Original file line number Diff line number Diff line change
@@ -1 +1 @@
0e5b2c24
6d86f1fb
6 changes: 3 additions & 3 deletions index.html
Original file line number Diff line number Diff line change
Expand Up @@ -143,7 +143,7 @@

<div class="quarto-listing quarto-listing-container-grid" id="listing-listing">
<div class="list grid quarto-listing-cols-3">
<div class="g-col-1" data-index="0" data-listing-date-sort="1705104000000" data-listing-file-modified-sort="1705146960541" data-listing-date-modified-sort="NaN" data-listing-reading-time-sort="25">
<div class="g-col-1" data-index="0" data-listing-date-sort="1705104000000" data-listing-file-modified-sort="1705527255596" data-listing-date-modified-sort="NaN" data-listing-reading-time-sort="25">
<a href="./posts/TDC2023.html" class="quarto-grid-link">
<div class="quarto-grid-item card h-100 card-left">
<p class="card-img-top"><img src="posts/TDC2023-sample-instances.png" style="height: 150px;" class="thumbnail-image card-img"/></p>
Expand All @@ -166,7 +166,7 @@ <h5 class="no-anchor card-title listing-title">
</div>
</a>
</div>
<div class="g-col-1" data-index="1" data-listing-date-sort="1701302400000" data-listing-file-modified-sort="1705146960561" data-listing-date-modified-sort="NaN" data-listing-reading-time-sort="7">
<div class="g-col-1" data-index="1" data-listing-date-sort="1701302400000" data-listing-file-modified-sort="1705527255616" data-listing-date-modified-sort="NaN" data-listing-reading-time-sort="7">
<a href="./posts/fight_the_illusion.html" class="quarto-grid-link">
<div class="quarto-grid-item card h-100 card-left">
<div class="listing-item-img-placeholder card-img-top" style="height: 150px;">&nbsp;</div>
Expand All @@ -189,7 +189,7 @@ <h5 class="no-anchor card-title listing-title">
</div>
</a>
</div>
<div class="g-col-1" data-index="2" data-listing-date-sort="1687651200000" data-listing-file-modified-sort="1705146960561" data-listing-date-modified-sort="NaN" data-listing-reading-time-sort="7">
<div class="g-col-1" data-index="2" data-listing-date-sort="1687651200000" data-listing-file-modified-sort="1705527255616" data-listing-date-modified-sort="NaN" data-listing-reading-time-sort="7">
<a href="./posts/catalog.html" class="quarto-grid-link">
<div class="quarto-grid-item card h-100 card-left">
<p class="card-img-top"><img src="posts/catalog_files/figure-html/cell-9-output-1.png" style="height: 150px;" class="thumbnail-image card-img"/></p>
Expand Down
107 changes: 54 additions & 53 deletions posts/TDC2023.html
Original file line number Diff line number Diff line change
Expand Up @@ -121,6 +121,7 @@ <h2 id="toc-title">On this page</h2>
<li><a href="#summary-of-our-major-takeaways" id="toc-summary-of-our-major-takeaways" class="nav-link" data-scroll-target="#summary-of-our-major-takeaways">Summary of Our Major Takeaways</a></li>
<li><a href="#trojan-detection-track-takeaways" id="toc-trojan-detection-track-takeaways" class="nav-link" data-scroll-target="#trojan-detection-track-takeaways">Trojan Detection Track Takeaways</a></li>
<li><a href="#red-teaming-track-takeaways" id="toc-red-teaming-track-takeaways" class="nav-link" data-scroll-target="#red-teaming-track-takeaways">Red Teaming Track Takeaways</a></li>
<li><a href="#references" id="toc-references" class="nav-link" data-scroll-target="#references">References</a></li>
</ul>
</nav>
</nav>
Expand Down Expand Up @@ -233,58 +234,8 @@ <h4 class="anchored" data-anchor-id="why-are-these-tasks-hard"><strong>Why are t
</section>
<section id="prior-literature-on-adversarial-attacks" class="level1">
<h1>Prior Literature on Adversarial Attacks</h1>
<p>If you’re new to this area, we recommend starting with this Lil’log post: <a href="https://lilianweng.github.io/posts/2023-10-25-adv-attack-llm/">“Adversarial Attacks on LLMs” (Weng 2023)</a>.</p>
<details>
<summary>
<strong> Click here to expand a list of 15 papers we found useful and/or reference </strong>
</summary>
<section id="baseline-methods-for-llm-optimizationattacks" class="level4">
<h4 class="anchored" data-anchor-id="baseline-methods-for-llm-optimizationattacks">Baseline methods for LLM optimization/attacks:</h4>
<ol type="1">
<li><strong>This paper introduces GCG (Greedy Coordinate Gradient)</strong>: <a href="https://arxiv.org/abs/2307.15043">Zou et al.&nbsp;2023, Universal and Transferable Adversarial Attacks on Aligned Language Models</a>**</li>
<li>The PEZ method: <a href="https://arxiv.org/abs/2302.03668">Wen et al.&nbsp;2023, Gradient-based discrete optimization for prompt tuning and discovery</a></li>
<li>The GBDA method: <a href="https://arxiv.org/abs/2104.13733">Guo et al.&nbsp;2021, Gradient-based adversarial attacks against text transformers</a></li>
</ol>
</section>
<section id="more-specialized-optimization-based-methods" class="level4">
<h4 class="anchored" data-anchor-id="more-specialized-optimization-based-methods">More specialized optimization-based methods:</h4>
<ol start="4" type="1">
<li>A 2020 classic, predecessor to GCG: <a href="https://arxiv.org/abs/2010.15980">Shin et al.&nbsp;2020, AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts</a></li>
<li>ARCA: <a href="https://arxiv.org/abs/2303.04381">Jones et al.&nbsp;2023, Automatically Auditing Large Language Models via Discrete Optimization</a></li>
<li>This gradient-based AutoDan-Zhu: <a href="https://arxiv.org/abs/2310.15140v1">Zhu et al.&nbsp;2023, AutoDAN: Automatic and Interpretable Adversarial Attacks on Large Language Models.</a> (An important caveat is that the methods in this paper on unproven on safety-trained models. This paper’s benchmarks notably omit Llama-2.)</li>
<li>The mellowmax operator: <a href="https://arxiv.org/abs/1612.05628">Asadi and Littman 2016, An Alternative Softmax Operator for Reinforcement Learning</a></li>
</ol>
</section>
<section id="generating-attacks-using-llms-for-jailbreaking-with-fluency" class="level4">
<h4 class="anchored" data-anchor-id="generating-attacks-using-llms-for-jailbreaking-with-fluency">Generating attacks using LLMs for jailbreaking with fluency:</h4>
<ol start="8" type="1">
<li>Already a classic: <a href="https://arxiv.org/abs/2202.03286">Perez et al.&nbsp;2022, Red Teaming Language Models with Language Models</a></li>
<li>The LLM-based AutoDAN-Liu, which is a totally separate paper and approach from AutoDAN-Zhu above! <a href="https://arxiv.org/abs/2310.04451">Liu et al.&nbsp;2023, AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models.</a></li>
<li><a href="https://arxiv.org/abs/2309.16797">Fernando et al.&nbsp;2023, Promptbreeder: Self-Referential Self-Improvement via Prompt Evolution</a> This paper optimizes prompts for generic task performance. Red-teaming can be thought of as a special case.</li>
</ol>
</section>
<section id="various-tips-for-jailbreaking" class="level4">
<h4 class="anchored" data-anchor-id="various-tips-for-jailbreaking">Various tips for jailbreaking:</h4>
<ol start="11" type="1">
<li><a href="https://arxiv.org/abs/2307.02483">Wei, Haghtalab and Steinhardt 2023, Jailbroken: How Does LLM Safety Training Fail?</a> An excellent list of manual redteaming exploits.</li>
<li><a href="https://arxiv.org/abs/2311.03348">Shah et al.&nbsp;2023, Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation</a>.</li>
</ol>
</section>
<section id="crytographically-undetectable-trojan-insertion" class="level4">
<h4 class="anchored" data-anchor-id="crytographically-undetectable-trojan-insertion">Crytographically undetectable trojan insertion:</h4>
<ol start="13" type="1">
<li><a href="https://arxiv.org/abs/2204.06974">Goldwasser et al.&nbsp;2022, Planting Undetectable Backdoors in Machine Learning Models</a></li>
</ol>
</section>
<section id="trojan-recovery" class="level4">
<h4 class="anchored" data-anchor-id="trojan-recovery">Trojan recovery:</h4>
<ol start="14" type="1">
<li><a href="https://arxiv.org/abs/2206.07758">Haim et al.&nbsp;2023, “Reconstructing training data from trained neural networks.”</a></li>
<li><a href="https://arxiv.org/abs/2106.06469">Zheng et al.&nbsp;2021, “Topological detection of trojaned neural networks.”</a></li>
</ol>
</section></details>
<p>If you’re new to this area, we recommend starting with this Lil’log post: <a href="https://lilianweng.github.io/posts/2023-10-25-adv-attack-llm/">“Adversarial Attacks on LLMs” (Weng 2023)</a>. For a list of 15 papers we found useful and/or reference in this post, <a href="#references">click here</a></p>
</section>

<section id="summary-of-our-major-takeaways" class="level1">
<h1>Summary of Our Major Takeaways</h1>
<ol type="1">
Expand Down Expand Up @@ -320,7 +271,7 @@ <h1>Summary of Our Major Takeaways</h1>
<h1>Trojan Detection Track Takeaways</h1>
<section id="nobody-found-the-intended-trojans-but-top-teams-reliably-elicited-the-payloads." class="level4">
<h4 class="anchored" data-anchor-id="nobody-found-the-intended-trojans-but-top-teams-reliably-elicited-the-payloads.">1. <strong>Nobody found the intended trojans but top teams reliably elicited the payloads.</strong></h4>
<p>Using GCG, we successfully elicited 100% of the payloads. Other top-performing teams used similar approaches with similar success! But, an important part of the competition was distinguishing between intended triggers and unintended triggers where the intended triggers are the <span class="math inline">\(p_n\)</span> used during the trojan insertion process. No participants succeeded at correctly identifying the intended triggers used by the adversary in training. Scores were composed of two parts: “Reverse Engineering Attack Success Rate” (REASR) which tracked how often could you elicit the trigger with <em>some</em> phrase, and a second BLEU-based “recall” metric that measures similarity with the intended triggers. Performance on the recall metric with random inputs seems to yield about ~14-16% score, due to luck-based collisions with the true tokens. No competition participant achieved more than 17% recall. Our REASR scores on the final competition leaderboards were 97% and 98%, rather than the 100% we measured on our system. This was due to a fixable fp-16 nondeterminism issue involving a difference in batch sizes.</p>
<p>Using GCG, we successfully elicited 100% of the payloads. Other top-performing teams used similar approaches with similar success! But, an important part of the competition was distinguishing between intended triggers and unintended triggers where the intended triggers are the <span class="math inline">\(p_n\)</span> used during the trojan insertion process. No participants succeeded at correctly identifying the intended triggers used by the adversary in training. Scores were composed of two parts: “Reverse Engineering Attack Success Rate” (REASR) which tracked how often could you elicit the trigger with <em>some</em> phrase, and a second BLEU-based “recall” metric that measures similarity with the intended triggers. Performance on the recall metric with random inputs seems to yield about ~14-16% score, due to luck-based collisions with the true tokens. No competitors achieved more than 17% recall. Our REASR scores on the final competition leaderboards were 97% and 98%, rather than the 100% we measured on our system. This was due to a fixable fp-16 nondeterminism issue involving a difference in batch sizes.</p>
</section>
<section id="reverse-engineering-trojans-in-practice-seems-quite-hard." class="level4">
<h4 class="anchored" data-anchor-id="reverse-engineering-trojans-in-practice-seems-quite-hard.">2. <strong>Reverse engineering trojans “in practice” seems quite hard.</strong></h4>
Expand Down Expand Up @@ -461,6 +412,56 @@ <h4 class="anchored" data-anchor-id="tricks-that-we-found-to-improve-performance
</ul></li>
</ul></li>
</ul>
<p><a name="references"></a></p><a name="references">
</a></section><a name="references">
</a></section><a name="references">
</a><section id="references" class="level1"><a name="references">
<h1>References</h1>
</a><section id="baseline-methods-for-llm-optimizationattacks" class="level4"><a name="references">
<h4 class="anchored" data-anchor-id="baseline-methods-for-llm-optimizationattacks">Baseline methods for LLM optimization/attacks:</h4>
</a><ol type="1"><a name="references">
</a><li><a name="references"><strong>This paper introduces GCG (Greedy Coordinate Gradient)</strong>: </a><a href="https://arxiv.org/abs/2307.15043">Zou et al.&nbsp;2023, Universal and Transferable Adversarial Attacks on Aligned Language Models</a>**</li>
<li>The PEZ method: <a href="https://arxiv.org/abs/2302.03668">Wen et al.&nbsp;2023, Gradient-based discrete optimization for prompt tuning and discovery</a></li>
<li>The GBDA method: <a href="https://arxiv.org/abs/2104.13733">Guo et al.&nbsp;2021, Gradient-based adversarial attacks against text transformers</a></li>
</ol>
</section>
<section id="more-specialized-optimization-based-methods" class="level4">
<h4 class="anchored" data-anchor-id="more-specialized-optimization-based-methods">More specialized optimization-based methods:</h4>
<ol start="4" type="1">
<li>A 2020 classic, predecessor to GCG: <a href="https://arxiv.org/abs/2010.15980">Shin et al.&nbsp;2020, AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts</a></li>
<li>ARCA: <a href="https://arxiv.org/abs/2303.04381">Jones et al.&nbsp;2023, Automatically Auditing Large Language Models via Discrete Optimization</a></li>
<li>This gradient-based AutoDan-Zhu: <a href="https://arxiv.org/abs/2310.15140v1">Zhu et al.&nbsp;2023, AutoDAN: Automatic and Interpretable Adversarial Attacks on Large Language Models.</a> (An important caveat is that the methods in this paper on unproven on safety-trained models. This paper’s benchmarks notably omit Llama-2.)</li>
<li>The mellowmax operator: <a href="https://arxiv.org/abs/1612.05628">Asadi and Littman 2016, An Alternative Softmax Operator for Reinforcement Learning</a></li>
</ol>
</section>
<section id="generating-attacks-using-llms-for-jailbreaking-with-fluency" class="level4">
<h4 class="anchored" data-anchor-id="generating-attacks-using-llms-for-jailbreaking-with-fluency">Generating attacks using LLMs for jailbreaking with fluency:</h4>
<ol start="8" type="1">
<li>Already a classic: <a href="https://arxiv.org/abs/2202.03286">Perez et al.&nbsp;2022, Red Teaming Language Models with Language Models</a></li>
<li>The LLM-based AutoDAN-Liu, which is a totally separate paper and approach from AutoDAN-Zhu above! <a href="https://arxiv.org/abs/2310.04451">Liu et al.&nbsp;2023, AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models.</a></li>
<li><a href="https://arxiv.org/abs/2309.16797">Fernando et al.&nbsp;2023, Promptbreeder: Self-Referential Self-Improvement via Prompt Evolution</a> This paper optimizes prompts for generic task performance. Red-teaming can be thought of as a special case.</li>
</ol>
</section>
<section id="various-tips-for-jailbreaking" class="level4">
<h4 class="anchored" data-anchor-id="various-tips-for-jailbreaking">Various tips for jailbreaking:</h4>
<ol start="11" type="1">
<li><a href="https://arxiv.org/abs/2307.02483">Wei, Haghtalab and Steinhardt 2023, Jailbroken: How Does LLM Safety Training Fail?</a> An excellent list of manual redteaming exploits.</li>
<li><a href="https://arxiv.org/abs/2311.03348">Shah et al.&nbsp;2023, Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation</a>.</li>
</ol>
</section>
<section id="crytographically-undetectable-trojan-insertion" class="level4">
<h4 class="anchored" data-anchor-id="crytographically-undetectable-trojan-insertion">Crytographically undetectable trojan insertion:</h4>
<ol start="13" type="1">
<li><a href="https://arxiv.org/abs/2204.06974">Goldwasser et al.&nbsp;2022, Planting Undetectable Backdoors in Machine Learning Models</a></li>
</ol>
</section>
<section id="trojan-recovery" class="level4">
<h4 class="anchored" data-anchor-id="trojan-recovery">Trojan recovery:</h4>
<ol start="14" type="1">
<li><a href="https://arxiv.org/abs/2206.07758">Haim et al.&nbsp;2023, “Reconstructing training data from trained neural networks.”</a></li>
<li><a href="https://arxiv.org/abs/2106.06469">Zheng et al.&nbsp;2021, “Topological detection of trojaned neural networks.”</a>
</li>
</ol>


</section>
Expand Down Expand Up @@ -721,7 +722,7 @@ <h4 class="anchored" data-anchor-id="tricks-that-we-found-to-improve-performance
});
</script>
</div> <!-- /content -->
<script>var lightboxQuarto = GLightbox({"selector":".lightbox","closeEffect":"zoom","loop":true,"descPosition":"bottom","openEffect":"zoom"});</script>
<script>var lightboxQuarto = GLightbox({"closeEffect":"zoom","selector":".lightbox","descPosition":"bottom","openEffect":"zoom","loop":true});</script>



Expand Down
2 changes: 1 addition & 1 deletion posts/catalog.html
Original file line number Diff line number Diff line change
Expand Up @@ -814,7 +814,7 @@ <h2 class="anchored" data-anchor-id="github">GitHub</h2>
});
</script>
</div> <!-- /content -->
<script>var lightboxQuarto = GLightbox({"openEffect":"zoom","loop":true,"closeEffect":"zoom","selector":".lightbox","descPosition":"bottom"});</script>
<script>var lightboxQuarto = GLightbox({"loop":true,"descPosition":"bottom","closeEffect":"zoom","openEffect":"zoom","selector":".lightbox"});</script>



Expand Down
10 changes: 5 additions & 5 deletions posts/catalog.out.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -297,7 +297,7 @@
"Pythia-12B is miscalibrated on 20% of the bigrams and 45% of the\n",
"trigrams when we ask for prediction of $p \\geq 0.45$."
],
"id": "062529bd-af2a-4a3e-b7ed-35bdc7bc1382"
"id": "f1bb479d-30a0-4a39-80e2-7169400a20c0"
},
{
"cell_type": "code",
Expand All @@ -313,7 +313,7 @@
}
],
"source": [],
"id": "5b05fa42-f1e3-450f-a99c-6b483129088a"
"id": "9cc411c7-5d5c-4b85-a3a8-51c1e21c4f27"
},
{
"cell_type": "markdown",
Expand Down Expand Up @@ -377,7 +377,7 @@
"The dataset is available on Huggingface:\n",
"[pile_scan_4](https://huggingface.co/datasets/Confirm-Labs/pile_scan_4)"
],
"id": "f7ec37af-d06f-4c5f-89ce-610af6a75527"
"id": "98427d1c-5f1c-40c8-b51e-5dec1054b6f5"
},
{
"cell_type": "code",
Expand All @@ -391,7 +391,7 @@
}
],
"source": [],
"id": "5415a94a-58f3-40b3-a3bd-1907a8154425"
"id": "fffb3eea-7d05-4365-bcff-81781e630caa"
},
{
"cell_type": "markdown",
Expand Down Expand Up @@ -423,7 +423,7 @@
"Computational Linguistics, May 2022, pp. 95–136. doi:\n",
"[10.18653/v1/2022.bigscience-1.9](https://doi.org/10.18653/v1/2022.bigscience-1.9).</span>"
],
"id": "be2e28c3-b4fe-4345-abac-4bbf28f5bfe5"
"id": "b8a81d0c-e0f7-4a47-a238-5231e13e063b"
}
],
"nbformat": 4,
Expand Down
8 changes: 4 additions & 4 deletions sitemap.xml
Original file line number Diff line number Diff line change
Expand Up @@ -2,18 +2,18 @@
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://confirmlabs.org/posts/catalog.html</loc>
<lastmod>2024-01-13T11:56:28.113Z</lastmod>
<lastmod>2024-01-17T21:34:32.576Z</lastmod>
</url>
<url>
<loc>https://confirmlabs.org/posts/TDC2023.html</loc>
<lastmod>2024-01-13T11:56:24.957Z</lastmod>
<lastmod>2024-01-17T21:34:29.228Z</lastmod>
</url>
<url>
<loc>https://confirmlabs.org/index.html</loc>
<lastmod>2024-01-13T11:56:23.513Z</lastmod>
<lastmod>2024-01-17T21:34:27.700Z</lastmod>
</url>
<url>
<loc>https://confirmlabs.org/posts/fight_the_illusion.html</loc>
<lastmod>2024-01-13T11:56:25.629Z</lastmod>
<lastmod>2024-01-17T21:34:29.980Z</lastmod>
</url>
</urlset>

0 comments on commit bb400c4

Please sign in to comment.