Built site for gh-pages

Confirm-Solutions · Nov 30, 2023 · 8fb2631 · 8fb2631
1 parent a06b8ce
commit 8fb2631
Show file tree

Hide file tree

Showing 9 changed files with 83 additions and 50 deletions.
diff --git a/.nojekyll b/.nojekyll
@@ -1 +1 @@
-9f165d84
+b3fbba8a
diff --git a/index.html b/index.html
@@ -143,7 +143,30 @@
 
 <div class="quarto-listing quarto-listing-container-grid" id="listing-listing">
 <div class="list grid quarto-listing-cols-3">
-<div class="g-col-1" data-index="0" data-listing-date-sort="1687651200000" data-listing-file-modified-sort="1701229733344" data-listing-date-modified-sort="NaN" data-listing-reading-time-sort="7">
+<div class="g-col-1" data-index="0" data-listing-date-sort="1701302400000" data-listing-file-modified-sort="1701380287742" data-listing-date-modified-sort="NaN" data-listing-reading-time-sort="7">
+<a href="./posts/fight_the_illusion.html" class="quarto-grid-link">
+<div class="quarto-grid-item card h-100 card-left">
+<div class="listing-item-img-placeholder card-img-top" style="height: 150px;">&nbsp;</div>
+<div class="card-body post-contents">
+<h5 class="no-anchor card-title listing-title">
+6 Ways to Fight the Interpretability Illusion
+</h5>
+<div class="card-text listing-description">
+Recommended pre-reading:
+</div>
+<div class="card-attribution card-text-small justify">
+<div class="listing-author">
+Michael Sklar
+</div>
+<div class="listing-date">
+Nov 30, 2023
+</div>
+</div>
+</div>
+</div>
+</a>
+</div>
+<div class="g-col-1" data-index="1" data-listing-date-sort="1687651200000" data-listing-file-modified-sort="1701380287738" data-listing-date-modified-sort="NaN" data-listing-reading-time-sort="7">
 <a href="./posts/catalog.html" class="quarto-grid-link">
 <div class="quarto-grid-item card h-100 card-left">
 <p class="card-img-top"><img src="posts/catalog_files/figure-html/cell-9-output-1.png" style="height: 150px;"  class="thumbnail-image card-img"/></p>

diff --git a/listings.json b/listings.json
@@ -2,6 +2,7 @@
   {
     "listing": "/index.html",
     "items": [
+      "/posts/fight_the_illusion.html",
       "/posts/catalog.html"
     ]
   }

diff --git a/posts/catalog.html b/posts/catalog.html
@@ -798,7 +798,7 @@ <h2 class="anchored" data-anchor-id="github">GitHub</h2>
 });
 </script>
 </div> <!-- /content -->
-<script>var lightboxQuarto = GLightbox({"selector":".lightbox","openEffect":"zoom","descPosition":"bottom","loop":true,"closeEffect":"zoom"});</script>
+<script>var lightboxQuarto = GLightbox({"selector":".lightbox","loop":true,"closeEffect":"zoom","openEffect":"zoom","descPosition":"bottom"});</script>
 
 
 

diff --git a/posts/catalog.out.ipynb b/posts/catalog.out.ipynb
@@ -297,7 +297,7 @@
         "Pythia-12B is miscalibrated on 20% of the bigrams and 45% of the\n",
         "trigrams when we ask for prediction of $p \\geq 0.45$."
       ],
-      "id": "b1b12a79-1f9e-4b08-8a4f-de67972b8b34"
+      "id": "f019decb-fb79-4678-b49c-f8ad10685733"
     },
     {
       "cell_type": "code",
@@ -313,7 +313,7 @@
         }
       ],
       "source": [],
-      "id": "77bdcbf8-047e-4347-a779-20da2c9f9647"
+      "id": "361f55f2-44c0-472e-a8b5-c8ca911ab255"
     },
     {
       "cell_type": "markdown",
@@ -375,7 +375,7 @@
         "The dataset is available on Huggingface:\n",
         "[pile_scan_4](https://huggingface.co/datasets/Confirm-Labs/pile_scan_4)"
       ],
-      "id": "5b2c0851-897a-4172-92a6-9d23dbef318d"
+      "id": "49ad1337-e733-40ea-8807-f388dd9d2967"
     },
     {
       "cell_type": "code",
@@ -389,7 +389,7 @@
         }
       ],
       "source": [],
-      "id": "8195339d-751e-456c-bbc0-958d25079862"
+      "id": "3811463e-28a2-4942-8f7d-6d2a31cb6dbb"
     },
     {
       "cell_type": "markdown",
@@ -419,7 +419,7 @@
         "Charles Foster, Jason Phang, et al. 2020. “The Pile: An 800GB Dataset of\n",
         "Diverse Text for Language Modeling.” *arXiv Preprint arXiv:2101.00027*."
       ],
-      "id": "ca8a9289-bbe1-41f9-8569-13f08076feb7"
+      "id": "48252ae7-e749-4523-ad56-d1da5482653b"
     }
   ],
   "nbformat": 4,

diff --git a/posts/fight_the_illusion.html b/posts/fight_the_illusion.html
@@ -6,7 +6,7 @@
 
 <meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes">
 
-<meta name="dcterms.date" content="2023-11-28">
+<meta name="dcterms.date" content="2023-11-30">
 
 <title>Confirm - 6 Ways to Fight the Interpretability Illusion</title>
 <style>
@@ -64,10 +64,10 @@
 
 <link rel="stylesheet" href="../styles.css">
 <meta name="citation_title" content="6 Ways to Fight the Interpretability Illusion">
-<meta name="citation_publication_date" content="2023-11-28">
-<meta name="citation_cover_date" content="2023-11-28">
+<meta name="citation_publication_date" content="2023-11-30">
+<meta name="citation_cover_date" content="2023-11-30">
 <meta name="citation_year" content="2023">
-<meta name="citation_online_date" content="2023-11-28">
+<meta name="citation_online_date" content="2023-11-30">
 <meta name="citation_fulltext_html_url" content="https://confirmlabs.org/posts/fight_the_illusion.html">
 <meta name="citation_language" content="en">
 <meta name="citation_reference" content="citation_title=GPT-NeoX-20B: An open-source autoregressive language model;,citation_abstract=We introduce GPT-NeoX-20B, a 20 billion parameter autoregressive language model trained on the Pile, whose weights will be made freely and openly available to the public through a permissive license. It is, to the best of our knowledge, the largest dense autoregressive model that has publicly available weights at the time of submission. In this work, we describe GPT-NeoX-20B’s architecture and training, and evaluate its performance. We open-source the training and evaluation code, as well as the model weights, at https://github.com/EleutherAI/gpt-neox.;,citation_author=Sidney Black;,citation_author=Stella Biderman;,citation_author=Eric Hallahan;,citation_author=Quentin Anthony;,citation_author=Leo Gao;,citation_author=Laurence Golding;,citation_author=Horace He;,citation_author=Connor Leahy;,citation_author=Kyle McDonell;,citation_author=Jason Phang;,citation_author=Michael Pieler;,citation_author=Usvsn Sai Prashanth;,citation_author=Shivanshu Purohit;,citation_author=Laria Reynolds;,citation_author=Jonathan Tow;,citation_author=Ben Wang;,citation_author=Samuel Weinbach;,citation_publication_date=2022-05;,citation_cover_date=2022-05;,citation_year=2022;,citation_fulltext_html_url=https://aclanthology.org/2022.bigscience-1.9;,citation_doi=10.18653/v1/2022.bigscience-1.9;,citation_conference_title=Proceedings of BigScience episode #5 – workshop on challenges &amp;amp;amp; perspectives in creating large language models;,citation_conference=Association for Computational Linguistics;">
@@ -130,7 +130,7 @@ <h1 class="title">6 Ways to Fight the Interpretability Illusion</h1>
     <div>
     <div class="quarto-title-meta-heading">Published</div>
     <div class="quarto-title-meta-contents">
-      <p class="date">November 28, 2023</p>
+      <p class="date">November 30, 2023</p>
     </div>
   </div>
 
@@ -143,11 +143,8 @@ <h1 class="title">6 Ways to Fight the Interpretability Illusion</h1>
 
 <!-----
 
-
-
 Conversion time: 0.504 seconds.
 
-
 Using this Markdown file:
 
 1. Paste this output into your source file.
@@ -163,7 +160,7 @@ <h1 class="title">6 Ways to Fight the Interpretability Illusion</h1>
 ----->
 <p>Recommended pre-reading:</p>
 <ul>
-<li>Atticus Geiger’s <a href="https://arxiv.org/abs/2303.02536">DAS</a> and <a href="https://arxiv.org/pdf/2305.08809.pdf">Boundless DAS</a>.</li>
+<li>Geiger et al.’s <a href="https://arxiv.org/abs/2303.02536">DAS</a> and <a href="https://arxiv.org/pdf/2305.08809.pdf">Boundless DAS</a>.</li>
 <li><a href="https://www.lesswrong.com/posts/RFtkRXHebkwxygDe2/an-interpretability-illusion-for-activation-patching-of">An Interpretability Illusion for Activation Patching of Arbitrary Subspaces</a>.</li>
 <li>The corresponding <a href="https://openreview.net/forum?id=Ebt7JgMHv1">ICLR paper, “Is This the Subspace You Are Looking For?</a>”</li>
 </ul>
@@ -176,7 +173,7 @@ <h1 class="title">6 Ways to Fight the Interpretability Illusion</h1>
 <li>“Pinning” a false belief into the model, for testing or alignment training. For example, forcing the model to believe it is not being watched, in order to test deception or escape behavior.</li>
 </ul>
 The key point is that the interpretability illusion is a failure to <em>describe typical model operation</em>, but a success for <em>enacting the causal model</em>.</li>
-<li><strong>Study more detailed causal models with multiple output streams, multiple options for the input variables, or more compositions.</strong> To start, notice that it is obviously good to have more outputs/consequences of the causal mode in the optimization. Why? First, if we have multiple output-measurements at the end of the causal graph, it is harder for a spurious direction to perform well on all of them by chance. Additionally, if an abstract causal model has modular pieces, then there should be exponentially many combinatorial-swap options that we can test. To score well on the IIA train-loss across all swaps, a spurious structure would have to be very sophisticated. While Lange et al.&nbsp;show that spurious solutions may arise for searches in 1 direction, it should be less likely to occur for <em>pairs</em> of directions, and less likely yet for full spurious circuits. So, illusion problems may be reduced by scaling up model complexity. Some possible issues remain, though:
+<li><strong>Study more detailed causal models with multiple output streams, multiple options for the input variables, or more compositions.</strong> To start, notice that it is obviously good to have more outputs/consequences of the causal mode in the optimization. Why? First, if we have multiple output-measurements at the end of the causal graph, it is harder for a spurious direction to perform well on all of them by chance. Additionally, if an abstract causal model has modular pieces, then there should be exponentially many combinatorial-swap options that we can test. To score well on the optimization’s training-loss across all swaps (in the language of <a href="https://arxiv.org/abs/2303.02536">DAS</a> this is a high “IIA”), the spurious structure would have to be very sophisticated. While Lange et al.&nbsp;show that spurious solutions may arise for searches in 1 direction, it should be less likely to occur for <em>pairs</em> of directions, and less likely yet for full spurious circuits. So, illusion problems may be reduced by scaling up the complexity of the causal model. Some possible issues remain, though:
 <ul>
 <li>In some cases we may struggle to identify specific directions within a multi-part model; i.e., we might find convincing overall performance for a circuit, but an individual dimension or two could be spurious, and we might be unable to determine exactly which.</li>
 <li>This approach relies on big, deep, abstract causal models existing inside the networks, with sufficient robustness in their functioning across variable changes. There is some suggestive work on predictable / standardized structures in LLM’s, from investigations like <a href="https://arxiv.org/pdf/2310.17191.pdf">Feng and Steinhardt (2023</a>)’s entity binding case study, the <a href="https://github.com/redwoodresearch/Easy-Transformer/blob/main/README.md">indirect object identification (IOI)</a> paper, and studies of <a href="https://arxiv.org/pdf/2305.14699.pdf">recursive tasks</a>. However, the consistency/robustness and DAS-discoverability of larger structures in scaled-up models is not yet clear. More case studies in larger models would be valuable.</li>
@@ -187,26 +184,26 @@ <h1 class="title">6 Ways to Fight the Interpretability Illusion</h1>
 <li><strong>Incorporate additional information as a prior / penalty for optimization.</strong> As Lange et al.&nbsp;note in the <a href="https://www.lesswrong.com/posts/RFtkRXHebkwxygDe2/an-interpretability-illusion-for-activation-patching-of">“Illusion” post</a>, and as described in Section 5 of the <a href="https://openreview.net/forum?id=Ebt7JgMHv1">ICLR paper</a>, it is possible to supply additional evidence that a found direction is faithful or not. In the case study with the IOI task, they argued the direction found by DAS on a residual layer fell within the query subspace of human-identified name mover heads. More generally, if intuitions about faithfulness can be scored with a quantitative metric, then tacking that metric onto the optimization as a penalty should help the optimizer favor correct directions over spurious solutions. Still, using this approach requires answering two difficult questions: what additional evidence to choose, and then how to quantify it? Some rough possibilities:
 <ul>
 <li>If we know of structures that should be related to the task, such as entity bindings (<a href="https://arxiv.org/pdf/2310.17191.pdf">Feng and Steinhardt (2023)</a>), we can try to build outwards from them; or if we have a reliable feature dictionary from sparse auto-encoders or “<a href="https://arxiv.org/pdf/2111.13654.pdf">belief graph</a>” per Hase et al.&nbsp;2021 which offers advance predictions for how subsequent layers’ features may react to a change, we can penalize lack of correlation or causal effects on downstream features.</li>
-<li>Somehow draw information from analyzing very basic components of the network: punish “MLP-in-the-middle” solutions by using some combination of changes in MLP activations / attention, gradients, sizes of the induced changes in the residual stream, etc.</li>
+<li>Somehow use the basic structure of the network to quantify which directions are ‘dormant.’ Although this sounds simple, I am unsure how to do it, given the indeterminacy of what a ‘dormant’ direction even means (this issue is described in Lange et al.’s lesswrong <a href="https://www.lesswrong.com/posts/RFtkRXHebkwxygDe2/an-interpretability-illusion-for-activation-patching-of">post</a>, Appendix section: the importance of correct model units.)</li>
 <li>Perhaps next-gen AI will offer accurate “auto-grading”, giving a general yet quantitative evaluation of plausibility of found solutions</li>
 </ul>
-Using extra information in this way unfortunately spends its usability for validation. But if extra information prevents the optimization from getting stuck on false signals, the trade-off should be favorable.</li>
+Using extra information in this way unfortunately spends its usability for validation. But preventing the optimization from getting stuck on spurious signals may be the higher priority.</li>
 </ol>
 <hr>
-<p>Thanks to Atticus Geiger, Jing Huang, Ben Thompson, Zygimantas Straznickas and others for conversations and feedback on earlier drafts.</p>
+<p>Thanks to Atticus Geiger, Jing Huang, Zhengxuan Wu, Ben Thompson, Zygimantas Straznickas and others for conversations and feedback on earlier drafts.</p>
 
 
 
 <div id="quarto-appendix" class="default"><section class="quarto-appendix-contents"><h2 class="anchored quarto-appendix-heading">Citation</h2><div><div class="quarto-appendix-secondary-label">BibTeX citation:</div><pre class="sourceCode code-with-copy quarto-appendix-bibtex"><code class="sourceCode bibtex">@online{sklar2023,
   author = {Sklar, Michael},
   title = {6 {Ways} to {Fight} the {Interpretability} {Illusion}},
-  date = {2023-11-28},
+  date = {2023-11-30},
   url = {https://confirmlabs.org/posts/fight_the_illusion.html},
   langid = {en}
 }
 </code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre><div class="quarto-appendix-secondary-label">For attribution, please cite this work as:</div><div id="ref-sklar2023" class="csl-entry quarto-appendix-citeas" role="listitem">
 Sklar, Michael. 2023. <span>“6 Ways to Fight the Interpretability
-Illusion.”</span> November 28, 2023. <a href="https://confirmlabs.org/posts/fight_the_illusion.html">https://confirmlabs.org/posts/fight_the_illusion.html</a>.
+Illusion.”</span> November 30, 2023. <a href="https://confirmlabs.org/posts/fight_the_illusion.html">https://confirmlabs.org/posts/fight_the_illusion.html</a>.
 </div></div></section></div></main> <!-- /main -->
 <script id="quarto-html-after-body" type="application/javascript">
 window.document.addEventListener("DOMContentLoaded", function (event) {