Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add t-SNE embeddings and clusters #88

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from
Draft

Conversation

huddlej
Copy link
Contributor

@huddlej huddlej commented Aug 28, 2024

Description of proposed changes

Adds rules, scripts, and config to produce joint t-SNE embeddings per build from all gene segments and find clusters from the resulting embeddings. When the user defines the embedding key in their build config, the workflow produces pairwise distances per gene segment, runs t-SNE on those distances, finds clusters with HDBSCAN, and exports the embedding coordinates and clusters in the Auspice JSON.

One major caveat of this implementation is that it requires (or at least benefits from) using the same strains for all genes. Although we could relax that requirement, the resulting embeddings would only be as good as the number of shared strains across all genes. One option to strike a balance between strict requirement of all strains in the same analysis or any strains would be to require the same strains and produce embeddings only for a subset of gene segments even if we build trees for all segments. These options would be worth discussing.

This builds on the work from Nanduri et al.

Related issue(s)

nextstrain/seasonal-flu#176

Examples

The following screenshot is a tangletree between NP (left) and PB2 (right) from the GISAID build coloring tips by host to highlight the reassortment between these gene segments in the cattle outbreak:

image

This is the same view colored by HDBSCAN cluster from t-SNE applied to all 8 segments showing that most of the cattle outbreak gets its own cluster (28):

image

The smaller clusters within the cattle outbreak (24 and 27) appear to come from diversity in other gene segments. Looking at these samples in the other gene trees, I found that cluster 24 captures a 45-sample clade in HA, MP, PB1, and NS. Below is the tangletree between NP and HA highlighting cluster 24.

image

I found that cluster 27 corresponds to a 12-sample clade in NA, NS, and PA. The tangletree below highlights cluster 27 in NP and NA trees.

image

Checklist

  • Checks pass

Adds rules, scripts, and config to produce joint t-SNE embeddings per
build from all gene segments and find clusters from the resulting
embeddings. When the user defines the `embedding` key in their build
config, the workflow produces pairwise distances per gene segment, runs
t-SNE on those distances, finds clusters with HDBSCAN, and exports the
embedding coordinates and clusters in the Auspice JSON.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant