-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: explicitly specify how the root and ambiguous states are handled… #1690
base: master
Are you sure you want to change the base?
Conversation
… during sequence reconstruction and mutation counting
Running this on a mpox clade-I build which currently assigns all root mutations to the basal branch leading to clade Ib. This PR: Branch leading to clade Ia
This PR: Branch leading to clade Ib
Pre-PR behaviour
So it's a lot better - and perhaps this PR is correct within its remit - but something's not quite right. The total number of inferred ATGC-ATGC mutations (73) is unchanged, but this doesn't match the refined branch lengths (65). Perhaps this is due to masking being used in the refine alignment but not considered for the conversion of subs/site to number of mutations? Secondly, and this is just for my own interest, how are the gap mutations being allocated among sister branches? I'd have thought clade IIb gets I've got some tests here which use small contrived (50nt-ish) genomes where we can reason with the mutations so I'll take a look at them later. |
I thought this was simply the subs/site scaled by some scalar, but it's actually doing the full inference and counting mutations, and this inference is modified by this PR (I missed this). Rerunning mpox including the refine step makes the assigned mutations match the mutation count. The allocation of gap-mutations also follow the branch-lengths. |
I looked into the tests, starting with the first test here which infers ancestral mutations given the reference via The results using this PR are stochastic, I observed the following three reconstructions back-to-back-to-back:
For the same test but looking at the case where we don't supply the root-sequence we see similar stochasticity, with differences in root-sequence reconstruction observed at pos 7, 18, 33, 43. |
thanks for digging into this. This makes all sense to me and is the expected result as far as i can tell. providing a |
note that the sample C is a lot closer to the root (0.02) than Node AB (0.06). So in most cases, the root will agree with C. and yes, we infer the actual mutations for the branch length in mutation units. this made sense from the perspective of very similar genomes, where you want branch length to correspond to the mutations in a direct discrete way. |
… during sequence reconstruction and mutation counting.
This addresses issue #1689