You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It would be good to support trained UMAPs that are linked to a particular embedding model. Similar to the way we train an SAE to "unfold" a representative set of directions in the embedding model's latent space, we could train a representative UMAP by trying to cover the space of the model.
The quality and the size of that UMAP may need to be explored, but it seems worth it to allow the user to map any dataset using a supported embedding model (like nomic-embed-text-v1.5) to the same 2D space. This would let you quickly see which parts of the model space "light up" for a given dataset.
It would be good to support trained UMAPs that are linked to a particular embedding model. Similar to the way we train an SAE to "unfold" a representative set of directions in the embedding model's latent space, we could train a representative UMAP by trying to cover the space of the model.
The quality and the size of that UMAP may need to be explored, but it seems worth it to allow the user to map any dataset using a supported embedding model (like nomic-embed-text-v1.5) to the same 2D space. This would let you quickly see which parts of the model space "light up" for a given dataset.
See https://enjalot.github.io/latent-taxonomy/ for an example of a UMAP calculated on the top activating samples for SAE features.
The text was updated successfully, but these errors were encountered: