Built site for gh-pages

Confirm-Solutions · Nov 29, 2023 · f7b5f19 · f7b5f19
1 parent 8610267
commit f7b5f19
Show file tree

Hide file tree

Showing 7 changed files with 15 additions and 15 deletions.
diff --git a/.nojekyll b/.nojekyll
@@ -1 +1 @@
-fd316aa3
+a5fa4430
diff --git a/index.html b/index.html
@@ -143,7 +143,7 @@
 
 <div class="quarto-listing quarto-listing-container-grid" id="listing-listing">
 <div class="list grid quarto-listing-cols-3">
-<div class="g-col-1" data-index="0" data-listing-date-sort="1687651200000" data-listing-file-modified-sort="1701229445824" data-listing-date-modified-sort="NaN" data-listing-reading-time-sort="7">
+<div class="g-col-1" data-index="0" data-listing-date-sort="1687651200000" data-listing-file-modified-sort="1701229501419" data-listing-date-modified-sort="NaN" data-listing-reading-time-sort="7">
 <a href="./posts/catalog.html" class="quarto-grid-link">
 <div class="quarto-grid-item card h-100 card-left">
 <p class="card-img-top"><img src="posts/catalog_files/figure-html/cell-9-output-1.png" style="height: 150px;"  class="thumbnail-image card-img"/></p>

diff --git a/posts/catalog.html b/posts/catalog.html
@@ -798,7 +798,7 @@ <h2 class="anchored" data-anchor-id="github">GitHub</h2>
 });
 </script>
 </div> <!-- /content -->
-<script>var lightboxQuarto = GLightbox({"selector":".lightbox","closeEffect":"zoom","loop":true,"openEffect":"zoom","descPosition":"bottom"});</script>
+<script>var lightboxQuarto = GLightbox({"selector":".lightbox","descPosition":"bottom","closeEffect":"zoom","loop":true,"openEffect":"zoom"});</script>
 
 
 

diff --git a/posts/catalog.out.ipynb b/posts/catalog.out.ipynb
@@ -297,7 +297,7 @@
         "Pythia-12B is miscalibrated on 20% of the bigrams and 45% of the\n",
         "trigrams when we ask for prediction of $p \\geq 0.45$."
       ],
-      "id": "8166b4e1-fe28-4a80-b13b-40e262561d90"
+      "id": "10170b94-e26d-44e3-a77a-aaf9dc15a2c5"
     },
     {
       "cell_type": "code",
@@ -313,7 +313,7 @@
         }
       ],
       "source": [],
-      "id": "9a5d22eb-e826-4e06-8620-b2e9b3e9812e"
+      "id": "819430fe-bdf0-409c-ab58-fa645952a489"
     },
     {
       "cell_type": "markdown",
@@ -375,7 +375,7 @@
         "The dataset is available on Huggingface:\n",
         "[pile_scan_4](https://huggingface.co/datasets/Confirm-Labs/pile_scan_4)"
       ],
-      "id": "c2618577-5464-4db9-a381-ebcb860c4b24"
+      "id": "9f0f2e6b-2702-4164-85da-3d4910a53bee"
     },
     {
       "cell_type": "code",
@@ -389,7 +389,7 @@
         }
       ],
       "source": [],
-      "id": "d49b8ca6-83a7-4d0e-93a6-c071bdeb33d2"
+      "id": "94ad5779-f7f6-45e9-a312-4fc7fcb0b068"
     },
     {
       "cell_type": "markdown",
@@ -419,7 +419,7 @@
         "Charles Foster, Jason Phang, et al. 2020. “The Pile: An 800GB Dataset of\n",
         "Diverse Text for Language Modeling.” *arXiv Preprint arXiv:2101.00027*."
       ],
-      "id": "9be8445b-e0de-44ee-add5-30d2114b7acc"
+      "id": "42230dda-9a8a-4433-ab56-d17fab69aead"
     }
   ],
   "nbformat": 4,

diff --git a/posts/fight_the_illusion.html b/posts/fight_the_illusion.html
@@ -167,7 +167,7 @@ <h1 class="title">6 Ways to Fight the Interpretability Illusion</h1>
 <li><a href="https://www.lesswrong.com/posts/RFtkRXHebkwxygDe2/an-interpretability-illusion-for-activation-patching-of">An Interpretability Illusion for Activation Patching of Arbitrary Subspaces</a>.</li>
 <li>The corresponding <a href="https://openreview.net/forum?id=Ebt7JgMHv1">ICLR paper, “Is This the Subspace You Are Looking For?</a>”</li>
 </ul>
-<p>__</p>
+<hr>
 <p>This post is motivated by Lange, Makelov, and Nanda’s LessWrong post <a href="https://www.lesswrong.com/posts/RFtkRXHebkwxygDe2/an-interpretability-illusion-for-activation-patching-of">Interpretability Illusion for Activation Patching</a> and <a href="https://openreview.net/forum?id=Ebt7JgMHv1">ICLR paper</a>. They study <a href="https://arxiv.org/abs/2303.02536">Geiger et al’s DAS</a> method, which uses optimization to identify an abstracted causal model with a small subset of dimensions in a neural network’s residual stream or internal MLP layer. Their results show that DAS can, depending on the situation, turn up both “correct” and “spurious” findings on the train-set. From the investigations in the <a href="https://openreview.net/forum?id=Ebt7JgMHv1">ICLR paper</a> and conversations with a few researchers, my understanding is these “spurious” directions have not performed well on held-out generalization sets, so in practice it is easy to distinguish the “illusions” from “real effects”. But, I am interested in developing even stronger optimize-to-interpret methods. With more powerful optimizers, illusion effects should be even stronger, and competition from spurious signals may make true signals harder to locate in training. So, here are 6 possible ways to fight against the interpretability illusion. Most of them can be tried in combination.</p>
 <ol type="1">
 <li><strong>The causal model still holds, and may still be what we want.</strong>: We call it an interpretability <em>illusion</em> because we are failing to describe the model’s normal functioning. But unusual functioning is fine for some goals! Applications include:

diff --git a/posts/fight_the_illusion.ipynb b/posts/fight_the_illusion.ipynb
@@ -9,7 +9,7 @@
         "Michael Sklar  \n",
         "2023-11-28"
       ],
-      "id": "424fbf11-ccd4-4568-ab4e-fddffb1f3ad2"
+      "id": "91cbd9f0-1b56-481f-9797-9e9e4ea86ef9"
     },
     {
       "cell_type": "raw",
@@ -38,7 +38,7 @@
         "* Source doc: 6 ways to fight the Interpretability illusion\n",
         "----->"
       ],
-      "id": "84ddb8d9-6186-4cfe-910e-6a7a3d14517f"
+      "id": "f0e12637-658b-4d98-880f-9d1548981580"
     },
     {
       "cell_type": "markdown",
@@ -53,7 +53,7 @@
         "-   The corresponding [ICLR paper, “Is This the Subspace You Are Looking\n",
         "    For?](https://openreview.net/forum?id=Ebt7JgMHv1)”\n",
         "\n",
-        "\\_\\_\n",
+        "------------------------------------------------------------------------\n",
         "\n",
         "This post is motivated by Lange, Makelov, and Nanda’s LessWrong post\n",
         "[Interpretability Illusion for Activation\n",
@@ -199,7 +199,7 @@
         "Thanks to Atticus Geiger, Jing Huang, Ben Thompson, Zygimantas\n",
         "Straznickas and others for conversations and feedback on earlier drafts."
       ],
-      "id": "3498b614-95a8-4bef-8255-56b161af2e4d"
+      "id": "e3dba39b-5650-40d7-ab8d-92969230deb7"
     }
   ],
   "nbformat": 4,

diff --git a/sitemap.xml b/sitemap.xml
@@ -2,10 +2,10 @@
 <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
   <url>
     <loc>https://confirmlabs.org/index.html</loc>
-    <lastmod>2023-11-29T03:44:16.316Z</lastmod>
+    <lastmod>2023-11-29T03:45:11.999Z</lastmod>
   </url>
   <url>
     <loc>https://confirmlabs.org/posts/catalog.html</loc>
-    <lastmod>2023-11-29T03:44:19.160Z</lastmod>
+    <lastmod>2023-11-29T03:45:14.831Z</lastmod>
   </url>
 </urlset>