From f7b5f191b5c8f6073f3110567316326bf12bd1bd Mon Sep 17 00:00:00 2001 From: Quarto GHA Workflow Runner Date: Wed, 29 Nov 2023 03:45:16 +0000 Subject: [PATCH] Built site for gh-pages --- .nojekyll | 2 +- index.html | 2 +- posts/catalog.html | 2 +- posts/catalog.out.ipynb | 10 +++++----- posts/fight_the_illusion.html | 2 +- posts/fight_the_illusion.ipynb | 8 ++++---- sitemap.xml | 4 ++-- 7 files changed, 15 insertions(+), 15 deletions(-) diff --git a/.nojekyll b/.nojekyll index 9f16e9d..149c1db 100644 --- a/.nojekyll +++ b/.nojekyll @@ -1 +1 @@ -fd316aa3 \ No newline at end of file +a5fa4430 \ No newline at end of file diff --git a/index.html b/index.html index 5a5dd73..659eeab 100644 --- a/index.html +++ b/index.html @@ -143,7 +143,7 @@
-
+

diff --git a/posts/catalog.html b/posts/catalog.html index d421d9f..472dcf1 100644 --- a/posts/catalog.html +++ b/posts/catalog.html @@ -798,7 +798,7 @@

GitHub

});
- + diff --git a/posts/catalog.out.ipynb b/posts/catalog.out.ipynb index 7850d3b..aa10b07 100644 --- a/posts/catalog.out.ipynb +++ b/posts/catalog.out.ipynb @@ -297,7 +297,7 @@ "Pythia-12B is miscalibrated on 20% of the bigrams and 45% of the\n", "trigrams when we ask for prediction of $p \\geq 0.45$." ], - "id": "8166b4e1-fe28-4a80-b13b-40e262561d90" + "id": "10170b94-e26d-44e3-a77a-aaf9dc15a2c5" }, { "cell_type": "code", @@ -313,7 +313,7 @@ } ], "source": [], - "id": "9a5d22eb-e826-4e06-8620-b2e9b3e9812e" + "id": "819430fe-bdf0-409c-ab58-fa645952a489" }, { "cell_type": "markdown", @@ -375,7 +375,7 @@ "The dataset is available on Huggingface:\n", "[pile_scan_4](https://huggingface.co/datasets/Confirm-Labs/pile_scan_4)" ], - "id": "c2618577-5464-4db9-a381-ebcb860c4b24" + "id": "9f0f2e6b-2702-4164-85da-3d4910a53bee" }, { "cell_type": "code", @@ -389,7 +389,7 @@ } ], "source": [], - "id": "d49b8ca6-83a7-4d0e-93a6-c071bdeb33d2" + "id": "94ad5779-f7f6-45e9-a312-4fc7fcb0b068" }, { "cell_type": "markdown", @@ -419,7 +419,7 @@ "Charles Foster, Jason Phang, et al. 2020. “The Pile: An 800GB Dataset of\n", "Diverse Text for Language Modeling.” *arXiv Preprint arXiv:2101.00027*." ], - "id": "9be8445b-e0de-44ee-add5-30d2114b7acc" + "id": "42230dda-9a8a-4433-ab56-d17fab69aead" } ], "nbformat": 4, diff --git a/posts/fight_the_illusion.html b/posts/fight_the_illusion.html index 3b844bc..888983d 100644 --- a/posts/fight_the_illusion.html +++ b/posts/fight_the_illusion.html @@ -167,7 +167,7 @@

6 Ways to Fight the Interpretability Illusion

  • An Interpretability Illusion for Activation Patching of Arbitrary Subspaces.
  • The corresponding ICLR paper, “Is This the Subspace You Are Looking For?
  • -

    __

    +

    This post is motivated by Lange, Makelov, and Nanda’s LessWrong post Interpretability Illusion for Activation Patching and ICLR paper. They study Geiger et al’s DAS method, which uses optimization to identify an abstracted causal model with a small subset of dimensions in a neural network’s residual stream or internal MLP layer. Their results show that DAS can, depending on the situation, turn up both “correct” and “spurious” findings on the train-set. From the investigations in the ICLR paper and conversations with a few researchers, my understanding is these “spurious” directions have not performed well on held-out generalization sets, so in practice it is easy to distinguish the “illusions” from “real effects”. But, I am interested in developing even stronger optimize-to-interpret methods. With more powerful optimizers, illusion effects should be even stronger, and competition from spurious signals may make true signals harder to locate in training. So, here are 6 possible ways to fight against the interpretability illusion. Most of them can be tried in combination.

    1. The causal model still holds, and may still be what we want.: We call it an interpretability illusion because we are failing to describe the model’s normal functioning. But unusual functioning is fine for some goals! Applications include: diff --git a/posts/fight_the_illusion.ipynb b/posts/fight_the_illusion.ipynb index bd137b7..923c0b0 100644 --- a/posts/fight_the_illusion.ipynb +++ b/posts/fight_the_illusion.ipynb @@ -9,7 +9,7 @@ "Michael Sklar \n", "2023-11-28" ], - "id": "424fbf11-ccd4-4568-ab4e-fddffb1f3ad2" + "id": "91cbd9f0-1b56-481f-9797-9e9e4ea86ef9" }, { "cell_type": "raw", @@ -38,7 +38,7 @@ "* Source doc: 6 ways to fight the Interpretability illusion\n", "----->" ], - "id": "84ddb8d9-6186-4cfe-910e-6a7a3d14517f" + "id": "f0e12637-658b-4d98-880f-9d1548981580" }, { "cell_type": "markdown", @@ -53,7 +53,7 @@ "- The corresponding [ICLR paper, “Is This the Subspace You Are Looking\n", " For?](https://openreview.net/forum?id=Ebt7JgMHv1)”\n", "\n", - "\\_\\_\n", + "------------------------------------------------------------------------\n", "\n", "This post is motivated by Lange, Makelov, and Nanda’s LessWrong post\n", "[Interpretability Illusion for Activation\n", @@ -199,7 +199,7 @@ "Thanks to Atticus Geiger, Jing Huang, Ben Thompson, Zygimantas\n", "Straznickas and others for conversations and feedback on earlier drafts." ], - "id": "3498b614-95a8-4bef-8255-56b161af2e4d" + "id": "e3dba39b-5650-40d7-ab8d-92969230deb7" } ], "nbformat": 4, diff --git a/sitemap.xml b/sitemap.xml index 133de9c..3970fb1 100644 --- a/sitemap.xml +++ b/sitemap.xml @@ -2,10 +2,10 @@ https://confirmlabs.org/index.html - 2023-11-29T03:44:16.316Z + 2023-11-29T03:45:11.999Z https://confirmlabs.org/posts/catalog.html - 2023-11-29T03:44:19.160Z + 2023-11-29T03:45:14.831Z