diff --git a/2024/2024_06_03_Scaling_Monosemanticity/README.md b/2024/2024_06_03_Scaling_Monosemanticity/README.md new file mode 100644 index 0000000..0b8e327 --- /dev/null +++ b/2024/2024_06_03_Scaling_Monosemanticity/README.md @@ -0,0 +1,8 @@ +# Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet + +One of the roadblocks to a better understanding of neural networks' internals is polysemanticity, where neurons appear to activate in multiple, semantically distinct contexts. Polysemanticity prevents us from identifying concise, human-understandable explanations for what neural networks are doing internally. One hypothesized cause of polysemanticity is superposition, where neural networks represent more features than they have neurons by assigning features to an overcomplete set of directions in activation space, rather than to individual neurons. Recent successful works have attempted to resolve this problem through a dictionary learning approach, where sparse autoencoders are trained on model activations to produce disentangled, interpretable, and monosemantic representations. We will explore how sparse autoencoders work, their application, and their interpretation within large language models (LLMs), with a main focus on recent research article from Anthropic. This article demonstrates how sparse autoencoders can scale with LLMs producing monosemantic representations and how this representations can be used to understand inner representations of the model. + +[Main paper](https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html) +[Previous paper from Anthropic about this approach](https://transformer-circuits.pub/2023/monosemantic-features/index.html) +[Paper introducing sparse autoencoders into LLms](https://arxiv.org/abs/2309.08600) +[Additional Paper trying to improve shrinkage problem in sparse autoencoders](https://arxiv.org/abs/2404.16014) diff --git a/2024/2024_06_03_Scaling_Monosemanticity/Scaling_Monosemanticity.pdf b/2024/2024_06_03_Scaling_Monosemanticity/Scaling_Monosemanticity.pdf new file mode 100644 index 0000000..9dc5587 Binary files /dev/null and b/2024/2024_06_03_Scaling_Monosemanticity/Scaling_Monosemanticity.pdf differ diff --git a/README.md b/README.md index 63bd5e4..1c84ac4 100644 --- a/README.md +++ b/README.md @@ -23,7 +23,7 @@ Join us at https://meet.drwhy.ai. * 13.05.2024 - [Introduction to ViT and transformer attributions](https://github.com/MI2DataLab/MI2DataLab_Seminarium/tree/master/2024/2024_05_13_Introduction_to_Visual_Transformers_and_Transformer_Attributions/) - Filip Kołodziejczyk * 20.05.2024 - MI^2 PhD thesis presentations - Weronika Guzik, Katarzyna Kobylińska, Katarzyna Woźnica. * 27.05.2024 - AtMan: Understanding Transformer Predictions Through Memory Efficient Attention Manipulation - Maciej Chrabąszcz -* 03.06.2024 - Learning to Estimate Shapley Values with Vision Transformers - Vladimir Zaigrajew +* 03.06.2024 - [Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet](https://github.com/MI2DataLab/MI2DataLab_Seminarium/tree/master/2024/2024_06_03_Scaling_Monosemanticity) - Vladimir Zaigrajew * 10.06.2024 - Semester summary and discussion. ### Winter semester