Built site for gh-pages

Confirm-Solutions · Nov 29, 2023 · 8556511 · 8556511
1 parent 770a8fd
commit 8556511
Show file tree

Hide file tree

Showing 8 changed files with 701 additions and 19 deletions.
diff --git a/.nojekyll b/.nojekyll
@@ -1 +1 @@
-b47a07e9
+735c29ea
diff --git a/index.html b/index.html
@@ -143,7 +143,7 @@
 
 <div class="quarto-listing quarto-listing-container-grid" id="listing-listing">
 <div class="list grid quarto-listing-cols-3">
-<div class="g-col-1" data-index="0" data-listing-date-sort="1687651200000" data-listing-file-modified-sort="1701212277069" data-listing-date-modified-sort="NaN" data-listing-reading-time-sort="7">
+<div class="g-col-1" data-index="0" data-listing-date-sort="1687651200000" data-listing-file-modified-sort="1701217746139" data-listing-date-modified-sort="NaN" data-listing-reading-time-sort="7">
 <a href="./posts/catalog.html" class="quarto-grid-link">
 <div class="quarto-grid-item card h-100 card-left">
 <p class="card-img-top"><img src="posts/catalog_files/figure-html/cell-9-output-1.png" style="height: 150px;"  class="thumbnail-image card-img"/></p>

diff --git a/posts/catalog.html b/posts/catalog.html
@@ -798,7 +798,7 @@ <h2 class="anchored" data-anchor-id="github">GitHub</h2>
 });
 </script>
 </div> <!-- /content -->
-<script>var lightboxQuarto = GLightbox({"openEffect":"zoom","closeEffect":"zoom","selector":".lightbox","loop":true,"descPosition":"bottom"});</script>
+<script>var lightboxQuarto = GLightbox({"openEffect":"zoom","loop":true,"closeEffect":"zoom","selector":".lightbox","descPosition":"bottom"});</script>
 
 
 

diff --git a/posts/catalog.out.ipynb b/posts/catalog.out.ipynb
@@ -297,7 +297,7 @@
         "Pythia-12B is miscalibrated on 20% of the bigrams and 45% of the\n",
         "trigrams when we ask for prediction of $p \\geq 0.45$."
       ],
-      "id": "f36f6ed6-9ee1-45a4-813f-f7aa1a9ccf31"
+      "id": "eb864443-abf0-47c9-98d3-9d4872a4decd"
     },
     {
       "cell_type": "code",
@@ -313,7 +313,7 @@
         }
       ],
       "source": [],
-      "id": "bfa35923-febe-446a-9a5e-f39ea07b3212"
+      "id": "fb8e8227-8624-431c-9177-b82f0164343f"
     },
     {
       "cell_type": "markdown",
@@ -375,7 +375,7 @@
         "The dataset is available on Huggingface:\n",
         "[pile_scan_4](https://huggingface.co/datasets/Confirm-Labs/pile_scan_4)"
       ],
-      "id": "59f80f9c-fde8-4e45-a1d0-24891d7b1894"
+      "id": "bb1d4eed-4a7c-48aa-9d74-f50efc4f485f"
     },
     {
       "cell_type": "code",
@@ -389,7 +389,7 @@
         }
       ],
       "source": [],
-      "id": "e476ceb5-9ae4-42ac-88f2-03ff711ee37b"
+      "id": "ed447629-d472-4afa-8683-d77f1bd49513"
     },
     {
       "cell_type": "markdown",
@@ -419,7 +419,7 @@
         "Charles Foster, Jason Phang, et al. 2020. “The Pile: An 800GB Dataset of\n",
         "Diverse Text for Language Modeling.” *arXiv Preprint arXiv:2101.00027*."
       ],
-      "id": "52528303-b716-4c57-b1f5-016d109fc868"
+      "id": "1c6bcc70-6c37-4f36-b6c8-b3cb3d7527ce"
     }
   ],
   "nbformat": 4,

diff --git a/posts/fight_the_illusion.html b/posts/fight_the_illusion.html
diff --git a/posts/fight_the_illusion.out.ipynb b/posts/fight_the_illusion.out.ipynb
@@ -0,0 +1,236 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "# 6 Ways to Fight the Interpretability Illusion\n",
+        "\n",
+        "Michael Sklar  \n",
+        "2023-11-28"
+      ],
+      "id": "a4a42736-a8aa-4851-9207-3100bd99e88a"
+    },
+    {
+      "cell_type": "raw",
+      "metadata": {
+        "raw_mimetype": "text/html"
+      },
+      "source": [
+        "<!-----\n",
+        "\n",
+        "\n",
+        "\n",
+        "Conversion time: 0.504 seconds.\n",
+        "\n",
+        "\n",
+        "Using this Markdown file:\n",
+        "\n",
+        "1. Paste this output into your source file.\n",
+        "2. See the notes and action items below regarding this conversion run.\n",
+        "3. Check the rendered output (headings, lists, code blocks, tables) for proper\n",
+        "   formatting and use a linkchecker before you publish this page.\n",
+        "\n",
+        "Conversion notes:\n",
+        "\n",
+        "* Docs to Markdown version 1.0β35\n",
+        "* Tue Nov 28 2023 15:52:28 GMT-0800 (PST)\n",
+        "* Source doc: 6 ways to fight the Interpretability illusion\n",
+        "----->"
+      ],
+      "id": "4c0ddb5a-143e-4db3-b1f5-7df53cf6db3a"
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "Recommended pre-reading:\n",
+        "\n",
+        "Atticus Geiger’s [DAS](https://arxiv.org/abs/2303.02536) and [Boundless\n",
+        "DAS](https://arxiv.org/pdf/2305.08809.pdf). Lesswrong post [An\n",
+        "Interpretability Illusion for Activation Patching of Arbitrary\n",
+        "Subspaces](https://www.lesswrong.com/posts/RFtkRXHebkwxygDe2/an-interpretability-illusion-for-activation-patching-of).\n",
+        "Corresponding [ICLR paper, “Is This the Subspace You Are Looking\n",
+        "For?](https://openreview.net/forum?id=Ebt7JgMHv1)”\n",
+        "\n",
+        "\\_\\_  \n",
+        "  \n",
+        "This post is motivated by Lange, Makelov, and Nanda’s lesswrong post\n",
+        "[Interpretability Illusion for Activation\n",
+        "Patching](https://www.lesswrong.com/posts/RFtkRXHebkwxygDe2/an-interpretability-illusion-for-activation-patching-of)\n",
+        "and [ICLR paper](https://openreview.net/forum?id=Ebt7JgMHv1). They study\n",
+        "Geiger et al’s [DAS](https://arxiv.org/abs/2303.02536) method, which\n",
+        "uses optimization to identify an abstracted causal model with a small\n",
+        "subset of dimensions in a neural network’s residual stream or internal\n",
+        "MLP layer. Their results show that DAS can, depending on the situation,\n",
+        "turn up both “correct” and spurious” findings on the train-set. From the\n",
+        "investigations in the [ICLR\n",
+        "paper](https://openreview.net/forum?id=Ebt7JgMHv1) and conversations\n",
+        "with a few researchers, my understanding is these “spurious” directions\n",
+        "have not performed well on held-out generalization sets, so in practice\n",
+        "it is easy to distinguish the “illusions” from “real effects”. But, I am\n",
+        "interested in developing even stronger optimize-to-interpret methods.\n",
+        "With more powerful optimizers, illusion effects should be even stronger,\n",
+        "and competition from “spurious” signals may make “true” signals harder\n",
+        "to locate in training. So, here are 6 possible ways to fight against the\n",
+        "interpretability illusion. Most of them can be tried in combination.\n",
+        "\n",
+        "1.  **The causal model still holds, and may still be what we want.\n",
+        "    **I.e.: We call it an interpretability “illusion” because we are\n",
+        "    failing to describe the model’s normal functioning. But unusual\n",
+        "    functioning is fine for some goals! Applications include:\n",
+        "\n",
+        "    1.  Finding latent circuits which might be targetable by optimized\n",
+        "        non-routine inputs (e.g. redteaming)\n",
+        "    2.  “Pinning” a false belief into the model, for testing or\n",
+        "        alignment training \\[e.g., forcing the model to believe it is\n",
+        "        not being watched, in order to test deception or escape\n",
+        "        behavior\\].\n",
+        "\n",
+        "    The key point is that the interpretability illusion is a failure to\n",
+        "    *describe typical model operation*, but a success for *enacting the\n",
+        "    causal model.*\n",
+        "\n",
+        "2.  **Study more detailed causal models with multiple output streams,\n",
+        "    multiple options for the input variables, or more compositions. **To\n",
+        "    start, notice that it is obviously good to have more\n",
+        "    outputs/consequences of the causal mode in the optimization. Why?\n",
+        "    First, if we have multiple output-measurements at the end of the\n",
+        "    causal graph, it is harder for a spurious direction to perform well\n",
+        "    on all of them by chance. Additionally: if an abstract causal model\n",
+        "    has modular pieces, then there should be exponentially many\n",
+        "    combinatorial-swap options that we can test. To score well on the\n",
+        "    IIA train-loss across all swaps, a ‘spurious’ structure would have\n",
+        "    to be very sophisticated. While Lange et al. show that spurious\n",
+        "    solutions may arise for searches in 1 direction, it should be less\n",
+        "    likely to occur for \\_pairs \\_of directions, and less likely yet for\n",
+        "    full spurious circuits. So, illusion problems may be reduced by\n",
+        "    scaling up model complexity. Some possible issues remain, though:\n",
+        "\n",
+        "    1.  In some cases we may struggle to identify specific directions\n",
+        "        within a multi-part model; i.e., we might find convincing\n",
+        "        overall performance for a circuit, but an individual dimension\n",
+        "        or two could be spurious, and we might be unable to determine\n",
+        "        exactly which.\n",
+        "    2.  This approach relies on big, deep, abstract causal models\n",
+        "        existing inside the networks, with sufficient robustness in\n",
+        "        their functioning across variable changes. While there is some\n",
+        "        suggestive work on predictable / standardized structures in\n",
+        "        LLM’s, from investigations like [Feng and Steinhardt\n",
+        "        (2023](https://arxiv.org/pdf/2310.17191.pdf))’s entity binding\n",
+        "        case study, the\n",
+        "        [IOI](https://github.com/redwoodresearch/Easy-Transformer/blob/main/README.md)\n",
+        "        paper, and studies of [recursive\n",
+        "        tasks](https://arxiv.org/pdf/2305.14699.pdf), the\n",
+        "        consistency/robustness and DAS-discoverability of larger\n",
+        "        structures in scaled-up models is not yet clear. More case\n",
+        "        studies in larger models would be of value.\n",
+        "\n",
+        "3.  **Measure generalizability, and use it to filter out spurious\n",
+        "    findings after-the-fact.** This is just common-sense, and\n",
+        "    researchers are already doing this in several ways. We can construct\n",
+        "    train/test splits with random sampling, and conclude a found\n",
+        "    direction is spurious if it does not generalize on the test data; or\n",
+        "    we could ask how the patched model generalizes\n",
+        "    out-of-training-distribution following a small perturbation, such as\n",
+        "    adding extra preceding tokens. Spurious solutions are likely to be\n",
+        "    sensitive to minor changes, and for many purposes we are primarily\n",
+        "    interested in causal models that generalize well. As mentioned\n",
+        "    earlier, the [ICLR](https://openreview.net/forum?id=Ebt7JgMHv1)\n",
+        "    paper’s \\`spurious’ findings performed sufficiently poorly on\n",
+        "    generalization sets that they could easily be distinguished from\n",
+        "    real effects.\n",
+        "\n",
+        "4.  **Quantify a null distribution **In the [“Illusion”\n",
+        "    post](https://www.lesswrong.com/posts/RFtkRXHebkwxygDe2/an-interpretability-illusion-for-activation-patching-of),\n",
+        "    Lange et al. show that the strength of the spurious signal depends\n",
+        "    on how many neurons it is allowed to optimize over. So, a very\n",
+        "    strong signal, taken over a small optimization set, should be more\n",
+        "    convincing. Thinking as statisticians, we could attempt to construct\n",
+        "    a \\`null distribution’ for the spurious signals; this approach could\n",
+        "    offer evidence that a causal map element is being represented “at\n",
+        "    all.” One could imagine doing this kind of inference for individual\n",
+        "    *pieces* of a larger causal model, with difference uncertainty bars\n",
+        "    for different components.\n",
+        "\n",
+        "5.  **Use unsupervised feature extraction as a first step**. Recent\n",
+        "    interpretability work with\n",
+        "    [auto-encoders](https://transformer-circuits.pub/2023/monosemantic-features)\n",
+        "    [suggests](https://arxiv.org/abs/2309.08600) that many of a small\n",
+        "    transformer’s most important features can be identified. If this\n",
+        "    technique scales well, it could ***vastly*** reduce the amount of\n",
+        "    optimization pressure needed to identify the right directions,\n",
+        "    shrinking the search space and reducing optimistic bias / spurious\n",
+        "    findings.\n",
+        "\n",
+        "6.  **Incorporate additional information as a prior / penalty for\n",
+        "    optimization. **As Lange et al. note in the [“Illusion”\n",
+        "    post](https://www.lesswrong.com/posts/RFtkRXHebkwxygDe2/an-interpretability-illusion-for-activation-patching-of),\n",
+        "    and as described in Section 5 of the [ICLR\n",
+        "    paper](https://openreview.net/forum?id=Ebt7JgMHv1), it is possible\n",
+        "    to supply additional evidence that a found direction is “faithful”\n",
+        "    (or not). In the case study with the IOI task, they argued the\n",
+        "    direction found by DAS on a residual layer fell within the query\n",
+        "    subspace of human-identified “name mover” heads. More generally, if\n",
+        "    intuitions about faithfulness can be scored with a quantitative\n",
+        "    metric, then tacking that metric onto the optimization as a penalty\n",
+        "    should help the optimizer favor “correct” directions over “spurious”\n",
+        "    solutions. Still, using this approach requires answering two\n",
+        "    difficult questions: what additional evidence to choose, and then\n",
+        "    how to quantify it? Some vague possibilities:\n",
+        "\n",
+        "    1.  Perhaps next-gen AI will offer accurate “auto-grading”, giving a\n",
+        "        general yet quantitative evaluation of plausibility of found\n",
+        "        solutions\n",
+        "    2.  Somehow draw information from analyzing very basic components of\n",
+        "        the network: punish “MLP-in-the-middle” solutions by using some\n",
+        "        combination of changes in MLP activations / attention,\n",
+        "        gradients, sizes of the induced changes in the residual stream,\n",
+        "        etc.\n",
+        "    3.  If we know of structures that “should be related” to the task,\n",
+        "        such as entity bindings ([Feng and Steinhardt\n",
+        "        (2023)](https://arxiv.org/pdf/2310.17191.pdf)), we can try to\n",
+        "        build outwards from them; or if we have a reliable feature\n",
+        "        dictionary from sparse auto-encoders or “[belief\n",
+        "        graph](https://arxiv.org/pdf/2111.13654.pdf)” per Hase et\n",
+        "        al. 2021 which offers advance predictions for how subsequent\n",
+        "        layers’ features may react to a change, we can penalize lack of\n",
+        "        correlation or causal effects on downstream features.\n",
+        "\n",
+        "    Using extra information in this way unfortunately spends its\n",
+        "    usability for validation. But if utilizing it prevents the\n",
+        "    optimization from getting stuck on false signals, the trade-off\n",
+        "    should be favorable.\n",
+        "\n",
+        "—-\n",
+        "\n",
+        "Thanks to Atticus Geiger, Jing Huang, Ben Thompson, Zygimantas\n",
+        "Straznickas and others for conversations and feedback on earlier drafts.\n",
+        "\n",
+        "------------------------------------------------------------------------"
+      ],
+      "id": "665cbeb6-a7a8-4cdc-b87b-62d5dd00e489"
+    }
+  ],
+  "nbformat": 4,
+  "nbformat_minor": 5,
+  "metadata": {
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3 (ipykernel)",
+      "language": "python"
+    },
+    "language_info": {
+      "name": "python",
+      "codemirror_mode": {
+        "name": "ipython",
+        "version": "3"
+      },
+      "file_extension": ".py",
+      "mimetype": "text/x-python",
+      "nbconvert_exporter": "python",
+      "pygments_lexer": "ipython3",
+      "version": "3.10.10"
+    }
+  }
+}
diff --git a/search.json b/search.json
@@ -1,4 +1,11 @@
 [
+  {
+    "objectID": "index.html",
+    "href": "index.html",
+    "title": "Confirm",
+    "section": "",
+    "text": "A catalog of several million tasks Pythia can do.\n\n\n\n\n\n\nT. Ben Thompson, Michael Sklar\n\n\nJun 25, 2023\n\n\n\n\n\n\n\n\nNo matching items"
+  },
   {
     "objectID": "posts/catalog.html",
     "href": "posts/catalog.html",
@@ -33,12 +40,5 @@
     "title": "A catalog of several million tasks Pythia can do.",
     "section": "GitHub",
     "text": "GitHub\nThe code to reproduce the datasets here is available at: https://github.com/Confirm-Solutions/catalog"
-  },
-  {
-    "objectID": "index.html",
-    "href": "index.html",
-    "title": "Confirm",
-    "section": "",
-    "text": "A catalog of several million tasks Pythia can do.\n\n\n\n\n\n\nT. Ben Thompson, Michael Sklar\n\n\nJun 25, 2023\n\n\n\n\n\n\n\n\nNo matching items"
   }
 ]
diff --git a/sitemap.xml b/sitemap.xml
@@ -1,11 +1,11 @@
 <?xml version="1.0" encoding="UTF-8"?>
 <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
   <url>
-    <loc>https://confirmlabs.org/posts/catalog.html</loc>
-    <lastmod>2023-11-28T22:58:10.225Z</lastmod>
+    <loc>https://confirmlabs.org/index.html</loc>
+    <lastmod>2023-11-29T00:29:32.823Z</lastmod>
   </url>
   <url>
-    <loc>https://confirmlabs.org/index.html</loc>
-    <lastmod>2023-11-28T22:58:07.441Z</lastmod>
+    <loc>https://confirmlabs.org/posts/catalog.html</loc>
+    <lastmod>2023-11-29T00:29:35.663Z</lastmod>
   </url>
 </urlset>