Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: explicitly specify how the root and ambiguous states are handled… #1690

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

rneher
Copy link
Member

@rneher rneher commented Nov 25, 2024

… during sequence reconstruction and mutation counting.

This addresses issue #1689

  • Automated checks pass
  • [Check][1] if you need to add a changelog message
  • [Check][2] if you need to add tests
  • [Check][3] if you need to update docs

@jameshadfield
Copy link
Member

jameshadfield commented Nov 25, 2024

Running this on a mpox clade-I build which currently assigns all root mutations to the basal branch leading to clade Ib.

This PR: Branch leading to clade Ia

  • Augur refine divergence: 13 mutations.
  • Inferred mutations from ATGC to ATGC: 14 (e.g. A2963G)
  • Gap to base mutations: 205 (e.g. -19105T)
  • Base to gap mutations: 0

This PR: Branch leading to clade Ib

  • Augur refine divergence: 52 mutations.
  • Inferred mutations from ATGC to ATGC: 59
  • Gap to base mutations: 0
  • Base to gap mutations: 677

Pre-PR behaviour

  • Same divergence from augur refine (I didn't re-run the rule)
  • All mutations are on the branch leading to clade-Ib (not good, obviously)
    • Inferred mutations from ATGC to ATGC: 73 (all on the branch leading to clade-Ib)
    • Gap to base mutations: 0
    • Base to gap mutations: 882

So it's a lot better - and perhaps this PR is correct within its remit - but something's not quite right. The total number of inferred ATGC-ATGC mutations (73) is unchanged, but this doesn't match the refined branch lengths (65). Perhaps this is due to masking being used in the refine alignment but not considered for the conversion of subs/site to number of mutations?

Secondly, and this is just for my own interest, how are the gap mutations being allocated among sister branches? I'd have thought clade IIb gets 882 * 52/(52+13) = 706 but that branch gets 677.

I've got some tests here which use small contrived (50nt-ish) genomes where we can reason with the mutations so I'll take a look at them later.

@jameshadfield
Copy link
Member

jameshadfield commented Nov 26, 2024

Perhaps this is due to masking being used in the refine alignment but not considered for the conversion of subs/site to number of mutations?

I thought this was simply the subs/site scaled by some scalar, but it's actually doing the full inference and counting mutations, and this inference is modified by this PR (I missed this).

Rerunning mpox including the refine step makes the assigned mutations match the mutation count. The allocation of gap-mutations also follow the branch-lengths.

@jameshadfield
Copy link
Member

I looked into the tests, starting with the first test here which infers ancestral mutations given the reference via --root-sequence. This is easy to reason with because it uses the toy genome I introduced here:
image

The results using this PR are stochastic, I observed the following three reconstructions back-to-back-to-back:

  • Run 1: Test succeeded (including the non-parsimonious reconstruction at pos 18 noted in the test)
  • Run 2: pos 14 inferred to be T at the root with 2 mutations needed (the parsimonious reconstruction with C at the root needs only 1). Similarly we now infer pos 7 to be G at the root (this is the same mutation pattern across the tree as pos 18).
  • Run 3: Similar to run 2 however inferring pos 7 to be G at the root but using the parsimonious reconstruction of pos 18 (i.e. it matches the root).

For the same test but looking at the case where we don't supply the root-sequence we see similar stochasticity, with differences in root-sequence reconstruction observed at pos 7, 18, 33, 43.

@rneher
Copy link
Member Author

rneher commented Nov 27, 2024

thanks for digging into this. This makes all sense to me and is the expected result as far as i can tell. providing a --root-sequence does not actually affect the reconstruction, it just decorates the branch to root with all mutations necessary to construct the inferred root sequence from the provided root sequence.

@rneher
Copy link
Member Author

rneher commented Nov 27, 2024

note that the sample C is a lot closer to the root (0.02) than Node AB (0.06). So in most cases, the root will agree with C.

and yes, we infer the actual mutations for the branch length in mutation units. this made sense from the perspective of very similar genomes, where you want branch length to correspond to the mutations in a direct discrete way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants