Merge pull request #74 from WolodjaZ/master

Added paper Scaling_Monosemanticity presented by Vladimir Zaigrajew
MI2DataLab · Jun 9, 2024 · 9cc33c5 · 9cc33c5
2 parents c5240fb + 779b613
commit 9cc33c5
Show file tree

Hide file tree

Showing 3 changed files with 9 additions and 1 deletion.
diff --git a/2024/2024_06_03_Scaling_Monosemanticity/README.md b/2024/2024_06_03_Scaling_Monosemanticity/README.md
@@ -0,0 +1,8 @@
+# Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
+
+One of the roadblocks to a better understanding of neural networks' internals is polysemanticity, where neurons appear to activate in multiple, semantically distinct contexts. Polysemanticity prevents us from identifying concise, human-understandable explanations for what neural networks are doing internally. One hypothesized cause of polysemanticity is superposition, where neural networks represent more features than they have neurons by assigning features to an overcomplete set of directions in activation space, rather than to individual neurons. Recent successful works have attempted to resolve this problem through a dictionary learning approach, where sparse autoencoders are trained on model activations to produce disentangled, interpretable, and monosemantic representations. We will explore how sparse autoencoders work, their application, and their interpretation within large language models (LLMs), with a main focus on recent research article from Anthropic. This article demonstrates how sparse autoencoders can scale with LLMs producing monosemantic representations and how this representations can be used to understand inner representations of the model.
+
+[Main paper](https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html)
+[Previous paper from Anthropic about this approach](https://transformer-circuits.pub/2023/monosemantic-features/index.html)
+[Paper introducing sparse autoencoders into LLms](https://arxiv.org/abs/2309.08600)
+[Additional Paper trying to improve shrinkage problem in sparse autoencoders](https://arxiv.org/abs/2404.16014)
diff --git a/2024/2024_06_03_Scaling_Monosemanticity/Scaling_Monosemanticity.pdf b/2024/2024_06_03_Scaling_Monosemanticity/Scaling_Monosemanticity.pdf
diff --git a/README.md b/README.md
@@ -23,7 +23,7 @@ Join us at https://meet.drwhy.ai.
 * 13.05.2024 - [Introduction to ViT and transformer attributions](https://github.com/MI2DataLab/MI2DataLab_Seminarium/tree/master/2024/2024_05_13_Introduction_to_Visual_Transformers_and_Transformer_Attributions/) - Filip Kołodziejczyk
 * 20.05.2024 - MI^2 PhD thesis presentations - Weronika Guzik, Katarzyna Kobylińska, Katarzyna Woźnica.
 * 27.05.2024 - AtMan: Understanding Transformer Predictions Through Memory Efficient Attention Manipulation - Maciej Chrabąszcz
-* 03.06.2024 - Learning to Estimate Shapley Values with Vision Transformers - Vladimir Zaigrajew
+* 03.06.2024 - [Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet](https://github.com/MI2DataLab/MI2DataLab_Seminarium/tree/master/2024/2024_06_03_Scaling_Monosemanticity) - Vladimir Zaigrajew
 * 10.06.2024 - Semester summary and discussion.
 
 ### Winter semester