fix: explicitly specify how the root and ambiguous states are handled… #1690

rneher · 2024-11-25T19:53:05Z

… during sequence reconstruction and mutation counting.

This addresses issue #1689

Automated checks pass
[Check][1] if you need to add a changelog message
[Check][2] if you need to add tests
[Check][3] if you need to update docs

… during sequence reconstruction and mutation counting

jameshadfield · 2024-11-25T20:58:02Z

Running this on a mpox clade-I build which currently assigns all root mutations to the basal branch leading to clade Ib.

This PR: Branch leading to clade Ia

Augur refine divergence: 13 mutations.
Inferred mutations from ATGC to ATGC: 14 (e.g. A2963G)
Gap to base mutations: 205 (e.g. -19105T)
Base to gap mutations: 0

This PR: Branch leading to clade Ib

Augur refine divergence: 52 mutations.
Inferred mutations from ATGC to ATGC: 59
Gap to base mutations: 0
Base to gap mutations: 677

Pre-PR behaviour

Same divergence from augur refine (I didn't re-run the rule)
All mutations are on the branch leading to clade-Ib (not good, obviously)
- Inferred mutations from ATGC to ATGC: 73 (all on the branch leading to clade-Ib)
- Gap to base mutations: 0
- Base to gap mutations: 882

So it's a lot better - and perhaps this PR is correct within its remit - but something's not quite right. The total number of inferred ATGC-ATGC mutations (73) is unchanged, but this doesn't match the refined branch lengths (65). Perhaps this is due to masking being used in the refine alignment but not considered for the conversion of subs/site to number of mutations?

Secondly, and this is just for my own interest, how are the gap mutations being allocated among sister branches? I'd have thought clade IIb gets 882 * 52/(52+13) = 706 but that branch gets 677.

I've got some tests here which use small contrived (50nt-ish) genomes where we can reason with the mutations so I'll take a look at them later.

jameshadfield · 2024-11-26T22:04:13Z

Perhaps this is due to masking being used in the refine alignment but not considered for the conversion of subs/site to number of mutations?

I thought this was simply the subs/site scaled by some scalar, but it's actually doing the full inference and counting mutations, and this inference is modified by this PR (I missed this).

Rerunning mpox including the refine step makes the assigned mutations match the mutation count. The allocation of gap-mutations also follow the branch-lengths.

jameshadfield · 2024-11-26T23:36:41Z

I looked into the tests, starting with the first test here which infers ancestral mutations given the reference via --root-sequence. This is easy to reason with because it uses the toy genome I introduced here:

The results using this PR are stochastic, I observed the following three reconstructions back-to-back-to-back:

Run 1: Test succeeded (including the non-parsimonious reconstruction at pos 18 noted in the test)
Run 2: pos 14 inferred to be T at the root with 2 mutations needed (the parsimonious reconstruction with C at the root needs only 1). Similarly we now infer pos 7 to be G at the root (this is the same mutation pattern across the tree as pos 18).
Run 3: Similar to run 2 however inferring pos 7 to be G at the root but using the parsimonious reconstruction of pos 18 (i.e. it matches the root).

For the same test but looking at the case where we don't supply the root-sequence we see similar stochasticity, with differences in root-sequence reconstruction observed at pos 7, 18, 33, 43.

rneher · 2024-11-27T09:23:31Z

thanks for digging into this. This makes all sense to me and is the expected result as far as i can tell. providing a --root-sequence does not actually affect the reconstruction, it just decorates the branch to root with all mutations necessary to construct the inferred root sequence from the provided root sequence.

rneher · 2024-11-27T09:26:02Z

note that the sample C is a lot closer to the root (0.02) than Node AB (0.06). So in most cases, the root will agree with C.

and yes, we infer the actual mutations for the branch length in mutation units. this made sense from the perspective of very similar genomes, where you want branch length to correspond to the mutations in a direct discrete way.

rneher added 2 commits November 25, 2024 20:51

fix: explicitly specify how the root and ambiguous states are handled…

7f0430d

… during sequence reconstruction and mutation counting

add to changelog

5e5d664

rneher requested a review from jameshadfield November 25, 2024 20:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: explicitly specify how the root and ambiguous states are handled… #1690

fix: explicitly specify how the root and ambiguous states are handled… #1690

rneher commented Nov 25, 2024

jameshadfield commented Nov 25, 2024 •

edited

Loading

jameshadfield commented Nov 26, 2024 •

edited

Loading

jameshadfield commented Nov 26, 2024

rneher commented Nov 27, 2024

rneher commented Nov 27, 2024

fix: explicitly specify how the root and ambiguous states are handled… #1690

Are you sure you want to change the base?

fix: explicitly specify how the root and ambiguous states are handled… #1690

Conversation

rneher commented Nov 25, 2024

jameshadfield commented Nov 25, 2024 • edited Loading

jameshadfield commented Nov 26, 2024 • edited Loading

jameshadfield commented Nov 26, 2024

rneher commented Nov 27, 2024

rneher commented Nov 27, 2024

jameshadfield commented Nov 25, 2024 •

edited

Loading

jameshadfield commented Nov 26, 2024 •

edited

Loading