Add t-SNE embeddings and clusters #88
Draft
+231
−6
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description of proposed changes
Adds rules, scripts, and config to produce joint t-SNE embeddings per build from all gene segments and find clusters from the resulting embeddings. When the user defines the
embedding
key in their build config, the workflow produces pairwise distances per gene segment, runs t-SNE on those distances, finds clusters with HDBSCAN, and exports the embedding coordinates and clusters in the Auspice JSON.One major caveat of this implementation is that it requires (or at least benefits from) using the same strains for all genes. Although we could relax that requirement, the resulting embeddings would only be as good as the number of shared strains across all genes. One option to strike a balance between strict requirement of all strains in the same analysis or any strains would be to require the same strains and produce embeddings only for a subset of gene segments even if we build trees for all segments. These options would be worth discussing.
This builds on the work from Nanduri et al.
Related issue(s)
nextstrain/seasonal-flu#176
Examples
The following screenshot is a tangletree between NP (left) and PB2 (right) from the GISAID build coloring tips by host to highlight the reassortment between these gene segments in the cattle outbreak:
This is the same view colored by HDBSCAN cluster from t-SNE applied to all 8 segments showing that most of the cattle outbreak gets its own cluster (28):
The smaller clusters within the cattle outbreak (24 and 27) appear to come from diversity in other gene segments. Looking at these samples in the other gene trees, I found that cluster 24 captures a 45-sample clade in HA, MP, PB1, and NS. Below is the tangletree between NP and HA highlighting cluster 24.
I found that cluster 27 corresponds to a 12-sample clade in NA, NS, and PA. The tangletree below highlights cluster 27 in NP and NA trees.
Checklist