Tree Sequences with Indels #1775
-
Has any work been done adapting tree sequences to work with indels? I'm working on a coding-sequence aligner and I am exploring whether tree sequences are a useful way to record multiple-sequence alignments. While it is trivial to record the indels as occurring at specific positions on specific branches, I am not sure the best way to record the alignment of homologous positions, or even if it is necessary. So before I go to far with this, I wanted to know if anyone has thought about this problem before. |
Beta Was this translation helpful? Give feedback.
Replies: 8 comments 11 replies
-
Great to hear you're thinking about this @reedacartwright! Short answer is "no". Currently sites are independent in tskit, and we don't try to reason about interference between sites at all. So, if you have sites at position 10 and 11 and there is an "AA" alleles at 10, this has no effect on the variation we report at site 11. This is clearly wrong, but that's how it is at the moment. There has been some background work on thinking about how coordinate systems evolve along the trees over time, as an alternative way of thinking about "graph genomes". Other than some proof of concept ideas, nothing much has happened. |
Beta Was this translation helpful? Give feedback.
-
With some creative thinking, I think alignments could be stored. If one thinks of aligned sequences as differences from a consensus, then a mutation's derived state can represent indels. This way of thinking lets you circumvent the independence issues @jeromekelleher noted. However, it is a bit outside the box, and you could run into problems down the road. But it is worth some experimenting, I'd say. |
Beta Was this translation helpful? Give feedback.
-
Just jumping in here, but it would presumably be fairly easy to check if the length of the longest allele associated with a site is shorter than the distance to the next (variable) site. If so, we can be sure there is no interference between sites. It might be worth having a function that checks if this is so. |
Beta Was this translation helpful? Give feedback.
-
@reedacartwright - is DAWG still the go-to simulator for creating sequences with indels? Are there others that people use which include duplications, inversions, etc, e.g. to test graph genome methods? |
Beta Was this translation helpful? Give feedback.
-
Here are some thoughts I came up with a while back, but never posted them: At the lowest level, a sequence is made up of fragments, where each fragment comes from a specific parent. If you add indels and rearrangements to the model, fragments can change length, orientation, and coordinates. The question is how do you record this information efficiently. Consider the following alignment of two sister species and their MRCA:
This can be represented as two coordinate systems: An "A" sequence of length 5 and a "G" sequence of length 2. Any SNP, MNP, or deletion that affects A or G can be passed on to descendants without updating coordinates. Insertions on the other hand insert a new fragment into the sequence and get passed on to descendants via the fragment structure of the sequence. |
Beta Was this translation helpful? Give feedback.
-
I wrote some thoughts about non-overlapping indels at tskit-dev/tsinfer#893 (comment). Basically, I think that for an indel of size 3 (say The It could be that when we import from a VCF, e.g. using tsinfer, we create a tree sequence with a non-variable site at position 10 containing the REF as the ancestral state, and a variable site with a blank allele at position 11. That might seem a bit of a hack, however. |
Beta Was this translation helpful? Give feedback.
-
As always, the "VCF way" has plenty of quirks that make life even more difficult: for example, here are some real variants from a WGS VCF I've worked with, which was called using the standard GATK 4.0 pipeline:
So we'd ideally want to enforce that the VCF has been normalised before inferrence to ensure that there is a maximum of one nucleotide shared between REF and ALT. I agree that it would be better to trim all the shared nucleotides in indel calls entirely; another solution to the |
Beta Was this translation helpful? Give feedback.
-
That's a fair point. Come to think of it, have we tested how well the REF and ALT columns that
A quick search yielded lots of examples of problematic VCFs with N in the REF column, so it seems to be at least allowed by the spec. It would certainly render that variant useless for any further analysis though, so it might be better to just remove the variant in that case. |
Beta Was this translation helpful? Give feedback.
Great to hear you're thinking about this @reedacartwright! Short answer is "no". Currently sites are independent in tskit, and we don't try to reason about interference between sites at all. So, if you have sites at position 10 and 11 and there is an "AA" alleles at 10, this has no effect on the variation we report at site 11. This is clearly wrong, but that's how it is at the moment.
There has been some background work on thinking about how coordinate systems evolve along the trees over time, as an alternative way of thinking about "graph genomes". Other than some proof of concept ideas, nothing much has happened.