Skip to content

Commit

Permalink
Merge pull request #74 from WolodjaZ/master
Browse files Browse the repository at this point in the history
Added paper  Scaling_Monosemanticity presented by Vladimir Zaigrajew
  • Loading branch information
sobieskibj authored Jun 9, 2024
2 parents c5240fb + 779b613 commit 9cc33c5
Show file tree
Hide file tree
Showing 3 changed files with 9 additions and 1 deletion.
8 changes: 8 additions & 0 deletions 2024/2024_06_03_Scaling_Monosemanticity/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

One of the roadblocks to a better understanding of neural networks' internals is polysemanticity, where neurons appear to activate in multiple, semantically distinct contexts. Polysemanticity prevents us from identifying concise, human-understandable explanations for what neural networks are doing internally. One hypothesized cause of polysemanticity is superposition, where neural networks represent more features than they have neurons by assigning features to an overcomplete set of directions in activation space, rather than to individual neurons. Recent successful works have attempted to resolve this problem through a dictionary learning approach, where sparse autoencoders are trained on model activations to produce disentangled, interpretable, and monosemantic representations. We will explore how sparse autoencoders work, their application, and their interpretation within large language models (LLMs), with a main focus on recent research article from Anthropic. This article demonstrates how sparse autoencoders can scale with LLMs producing monosemantic representations and how this representations can be used to understand inner representations of the model.

[Main paper](https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html)
[Previous paper from Anthropic about this approach](https://transformer-circuits.pub/2023/monosemantic-features/index.html)
[Paper introducing sparse autoencoders into LLms](https://arxiv.org/abs/2309.08600)
[Additional Paper trying to improve shrinkage problem in sparse autoencoders](https://arxiv.org/abs/2404.16014)
Binary file not shown.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ Join us at https://meet.drwhy.ai.
* 13.05.2024 - [Introduction to ViT and transformer attributions](https://github.com/MI2DataLab/MI2DataLab_Seminarium/tree/master/2024/2024_05_13_Introduction_to_Visual_Transformers_and_Transformer_Attributions/) - Filip Kołodziejczyk
* 20.05.2024 - MI^2 PhD thesis presentations - Weronika Guzik, Katarzyna Kobylińska, Katarzyna Woźnica.
* 27.05.2024 - AtMan: Understanding Transformer Predictions Through Memory Efficient Attention Manipulation - Maciej Chrabąszcz
* 03.06.2024 - Learning to Estimate Shapley Values with Vision Transformers - Vladimir Zaigrajew
* 03.06.2024 - [Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet](https://github.com/MI2DataLab/MI2DataLab_Seminarium/tree/master/2024/2024_06_03_Scaling_Monosemanticity) - Vladimir Zaigrajew
* 10.06.2024 - Semester summary and discussion.

### Winter semester
Expand Down

0 comments on commit 9cc33c5

Please sign in to comment.