Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with COAR #326

Open
psathyrella opened this issue Mar 7, 2024 · 1 comment
Open

Issues with COAR #326

psathyrella opened this issue Mar 7, 2024 · 1 comment

Comments

@psathyrella
Copy link
Owner

psathyrella commented Mar 7, 2024

  1. The alignment step doesn't always do a good job of handling cases where the true and inferred lineages don't have neatly-corresponding sequences. For instance here it aligns two nodes with hamming distance 20:
    coar-weirdness
    And here are the two nodes in the true tree:
    coar-weirdness-true-tree
    and in the inferred tree:
    coar-weirdness-inf-tree
    To me, the inferred tree is clearly not claiming that these two nodes are equivalent, it rather just has an extra node near root. One misalignment like this, however, will completely dominate the COAR calculation since correctly aligned seqs are only ever off by a couple of bases.

  2. Since the lineages from most leaves come together near the root, errors in sequence inference near root are counted many times, which does not seem intuitive: if I incorrectly infer one mutation near root, I don't think that impact of that mistake should necessarily scale with N leaves. For instance here the naive sequence is off by 4, and it's counted in the calculation for every leaf's lineage:
    coar-counting

  3. I don't think that using total sequence length is the correct denominator (max penalty). In any given tree, the most that we can be wrong really seems to scale more with the total tree depth or N mutations, rather than with total sequence length. The former would also result in COAR values that are nearer to 1, whereas now COAR is like 0.0003, and having lots of leading zeros in plots is always confusing.

While 3. is potentially worth implementing, 1. and 2. are more inherent and just make me more reluctant to rely on COAR as a final metric.

My guess is that what we want COAR to do is measure the accuracy of the order of inferred mutations from root. But I think that in practice just looking at the handful of inferred ancestral sequences doesn't really do this. I think we could compare the order of inferred and true mutations (even without keeping track of the full list of mutations in order in simulation), but not sure if it's worthwhile.

@psathyrella
Copy link
Owner Author

Attaching coar definition.
Davidsen and Matsen 2018 - coar-defn.pdf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant