-
Notifications
You must be signed in to change notification settings - Fork 162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Streamtrees #1902
base: master
Are you sure you want to change the base?
Streamtrees #1902
Conversation
This is so cool @jameshadfield! I can give more detailed feedback soon, but I wanted to highlight just a few things on initial perusal:
|
(After conversation with James...) We want to be sure this viz approach is useful for being able to read evolutionary / epidemiological stories from the genetic data. If we construct streams from the And we can do things like color by S1 mutations like we often want to understand what's driving clade success. But in general I'd be trying to think about how a streamtree view would enable proper reading of stories like:
My guess is that streamtrees would work quite well for the epidemiological cases, but only really when "clades" are created that correspond well to geographic transitions (note that this tracking geography was where Pango lineages got their start). This could be literally creating branch labels to mark geographic transitions from |
Related to the above, I think the biggest design decision here is to enforce creating streams from existing branch labels. This effectively pushes the problem of what streams to allow for into augur and machinery like |
@jameshadfield --- A follow up thought here while on the topic of remit and structure for this feature. It would be amazing to be able to have https://nextstrain.org/nextclade/sars-cov-2/ but rather than each Pango lineage having a single circle, they would each have a stream. I think this could effectively be hacked into your current setup by creating a branch label on the branch immediately leading to each tip with the Pango lineage label and then throwing in a set of 1 to 100 representative strains from this Pango lineage which would each possess necessary metadata of collection date, S1 mutations, region, etc... These 1-100 representative strains would not have any tree structure and would just be a comb / polytomy replacing the single existing tip. The number of strains per Pango lineage would be input so that more frequent lineages get more strains and consequently wider streams. My main reason to bring this up: is a standard Auspice JSON with specific branch labels and polytomies of discrete strains the best way to encode this? I think so? There would be more efficient ways to encode this than a bag of strains if we had a single coloring to worry about, but if we want to allow different colorings, then treating this as a set of discrete strains with metadata is probably the way to go. I would imagine this scenario of collapsed tip distributions / streams to be a common one. We have the analogous issue with unique seasonal influenza HA haplotypes. I don't know if |
As you say, if you have the logic to collapse to streams you could explore the ability to collapse to circles whose area is proportional to n and whose color is the simple merged color logic we use for phylogeographic uncertainty. This would allow further scaling. But I think fair to start with just streams as getting this working is definitely the more complex avenue. |
This is the first prototype of a long-running idea of mine (and others) as a way of displaying big trees where our typical approach is either too slow and/or we run out of pixels. This sketch was from many years ago:
This prototype implements streams by allowing branch labels to cut the tree into monophylies / paraphylies and visualising each as streamgraphs. While the motivation was for this to display huge trees which Auspice can't currently display, it works and is useful (and fun) for smaller trees as well. Long-term I think this would be a good starting view into very large or diverse datasets.
How to use
nextstrain.org review app
Any dataset with a branch label (apart from amino acid) can have streamtrees. The dropdown in the sidebar can change the label used to partition the tree, and there's a toggle to go between streamtrees and normal tree view. My general view is that the appropriate partitioning of sample-sets will be best done in Augur, either algorithmically or manually. (There's also scope for dynamic partitioning in Auspice via genotype or color-by, but that's not implemented here.)
Suggested testing datasets
Known caveats / bugs
Future directions
Screenshots
Screen.Recording.2024-11-14.at.8.43.25.PM.mov
Screen.Recording.2024-11-14.at.8.40.30.PM.mov