Add t-SNE embeddings and clusters #88

huddlej · 2024-08-28T22:34:36Z

Description of proposed changes

Adds rules, scripts, and config to produce joint t-SNE embeddings per build from all gene segments and find clusters from the resulting embeddings. When the user defines the embedding key in their build config, the workflow produces pairwise distances per gene segment, runs t-SNE on those distances, finds clusters with HDBSCAN, and exports the embedding coordinates and clusters in the Auspice JSON.

One major caveat of this implementation is that it requires (or at least benefits from) using the same strains for all genes. Although we could relax that requirement, the resulting embeddings would only be as good as the number of shared strains across all genes. One option to strike a balance between strict requirement of all strains in the same analysis or any strains would be to require the same strains and produce embeddings only for a subset of gene segments even if we build trees for all segments. These options would be worth discussing.

This builds on the work from Nanduri et al.

Related issue(s)

nextstrain/seasonal-flu#176

Examples

The following screenshot is a tangletree between NP (left) and PB2 (right) from the GISAID build coloring tips by host to highlight the reassortment between these gene segments in the cattle outbreak:

This is the same view colored by HDBSCAN cluster from t-SNE applied to all 8 segments showing that most of the cattle outbreak gets its own cluster (28):

The smaller clusters within the cattle outbreak (24 and 27) appear to come from diversity in other gene segments. Looking at these samples in the other gene trees, I found that cluster 24 captures a 45-sample clade in HA, MP, PB1, and NS. Below is the tangletree between NP and HA highlighting cluster 24.

I found that cluster 27 corresponds to a 12-sample clade in NA, NS, and PA. The tangletree below highlights cluster 27 in NP and NA trees.

Checklist

Checks pass

Adds rules, scripts, and config to produce joint t-SNE embeddings per build from all gene segments and find clusters from the resulting embeddings. When the user defines the `embedding` key in their build config, the workflow produces pairwise distances per gene segment, runs t-SNE on those distances, finds clusters with HDBSCAN, and exports the embedding coordinates and clusters in the Auspice JSON.

huddlej mentioned this pull request Nov 22, 2024

Remove pathogen-specific tools from base runtimes nextstrain/public#7

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add t-SNE embeddings and clusters #88

Add t-SNE embeddings and clusters #88

huddlej commented Aug 28, 2024 •

edited

Loading

Add t-SNE embeddings and clusters #88

Are you sure you want to change the base?

Add t-SNE embeddings and clusters #88

Conversation

huddlej commented Aug 28, 2024 • edited Loading

Description of proposed changes

Related issue(s)

Examples

Checklist

huddlej commented Aug 28, 2024 •

edited

Loading