Tree Sequences with Indels #1775

reedacartwright · 2021-10-09T04:00:04Z

reedacartwright
Oct 9, 2021

Has any work been done adapting tree sequences to work with indels?

I'm working on a coding-sequence aligner and I am exploring whether tree sequences are a useful way to record multiple-sequence alignments. While it is trivial to record the indels as occurring at specific positions on specific branches, I am not sure the best way to record the alignment of homologous positions, or even if it is necessary.

So before I go to far with this, I wanted to know if anyone has thought about this problem before.

Answered by jeromekelleher

Oct 9, 2021

Great to hear you're thinking about this @reedacartwright! Short answer is "no". Currently sites are independent in tskit, and we don't try to reason about interference between sites at all. So, if you have sites at position 10 and 11 and there is an "AA" alleles at 10, this has no effect on the variation we report at site 11. This is clearly wrong, but that's how it is at the moment.

There has been some background work on thinking about how coordinate systems evolve along the trees over time, as an alternative way of thinking about "graph genomes". Other than some proof of concept ideas, nothing much has happened.

View full answer

jeromekelleher · 2021-10-09T09:05:35Z

jeromekelleher
Oct 9, 2021
Maintainer

Great to hear you're thinking about this @reedacartwright! Short answer is "no". Currently sites are independent in tskit, and we don't try to reason about interference between sites at all. So, if you have sites at position 10 and 11 and there is an "AA" alleles at 10, this has no effect on the variation we report at site 11. This is clearly wrong, but that's how it is at the moment.

There has been some background work on thinking about how coordinate systems evolve along the trees over time, as an alternative way of thinking about "graph genomes". Other than some proof of concept ideas, nothing much has happened.

0 replies

molpopgen · 2021-10-11T16:52:03Z

molpopgen
Oct 11, 2021
Maintainer

With some creative thinking, I think alignments could be stored. If one thinks of aligned sequences as differences from a consensus, then a mutation's derived state can represent indels. This way of thinking lets you circumvent the independence issues @jeromekelleher noted. However, it is a bit outside the box, and you could run into problems down the road. But it is worth some experimenting, I'd say.

0 replies

hyanwong · 2021-11-11T10:26:35Z

hyanwong
Nov 11, 2021
Collaborator

Just jumping in here, but it would presumably be fairly easy to check if the length of the longest allele associated with a site is shorter than the distance to the next (variable) site. If so, we can be sure there is no interference between sites. It might be worth having a function that checks if this is so.

2 replies

jeromekelleher Nov 11, 2021
Maintainer

In that case it's straightforward yeah

hyanwong Jan 8, 2022
Collaborator

Probably worth having a documentation section on indels etc in the docs, probably in this section

hyanwong · 2022-01-07T13:41:07Z

hyanwong
Jan 7, 2022
Collaborator

@reedacartwright - is DAWG still the go-to simulator for creating sequences with indels? Are there others that people use which include duplications, inversions, etc, e.g. to test graph genome methods?

3 replies

reedacartwright Jan 7, 2022
Author

It's my go-to simulator. I think other programs are more popular, mostly because I've never written a paper on Dawg 2.0 as other research has gotten in the way of Dawg development over the last 10 years.

hyanwong Jan 8, 2022
Collaborator

Thanks! What alternatives are there out there? It would be lovely to push work on the coordinate system / graph genome stuff from a tskit perspective somehow, and I think the way in might be via simulation. I think something that simulates inversions is important here, because that will really screw with the coordinates!

reedacartwright Jan 12, 2022
Author

Honestly, I haven't paid attention to developments over the last several years.

There's a lot of diversity out there in how the internal data structures are handled. So I'm not sure one size will fit all.

reedacartwright · 2022-01-12T21:47:06Z

reedacartwright
Jan 12, 2022
Author

Here are some thoughts I came up with a while back, but never posted them:

At the lowest level, a sequence is made up of fragments, where each fragment comes from a specific parent. If you add indels and rearrangements to the model, fragments can change length, orientation, and coordinates. The question is how do you record this information efficiently. Consider the following alignment of two sister species and their MRCA:

Tree: (H,C)R;

H: A-A--AA
C: AAAGGAA
R: AAA--AA

This can be represented as two coordinate systems: An "A" sequence of length 5 and a "G" sequence of length 2. Any SNP, MNP, or deletion that affects A or G can be passed on to descendants without updating coordinates. Insertions on the other hand insert a new fragment into the sequence and get passed on to descendants via the fragment structure of the sequence.

0 replies

hyanwong · 2024-02-07T09:34:55Z

hyanwong
Feb 7, 2024
Collaborator

I wrote some thoughts about non-overlapping indels at tskit-dev/tsinfer#893 (comment). Basically, I think that for an indel of size 3 (say TTG), we should allow a tree sequence to encode it in two ways: either the "VCF way", with site.position=10, site.ancestral_state="A", site.mutations[0].derived_state="ATTG" or the more logical way with site.postion=11, site.ancestral_state="", site.mutations[0].derived_state="TTG". This second method would allow a SNP at position 10 as well as an indel at position 11.

The ts.write_vcf() function should detect the second case (when there are empty allele states) and translate them into a VCF with a pos of site.position-1. In this case, it is unclear to me, however, what to use as the REF letter that is prepended to the allele string in this case, unless we have an existing reference sequence or an existing (presumably non-variable) site at position 10.

It could be that when we import from a VCF, e.g. using tsinfer, we create a tree sequence with a non-variable site at position 10 containing the REF as the ancestral state, and a variable site with a blank allele at position 11. That might seem a bit of a hack, however.

0 replies

duncanMR · 2024-02-07T11:10:40Z

duncanMR
Feb 7, 2024

As always, the "VCF way" has plenty of quirks that make life even more difficult: for example, here are some real variants from a WGS VCF I've worked with, which was called using the standard GATK 4.0 pipeline:

CHROM	POS	REF	ALT
chr1	159668874	TACAC	TAC
chr2	47905960	CTTTTTTTT	CTTTTTT

So we'd ideally want to enforce that the VCF has been normalised before inferrence to ensure that there is a maximum of one nucleotide shared between REF and ALT.

I agree that it would be better to trim all the shared nucleotides in indel calls entirely; another solution to the ts.write_vcf() problem would be to just dump the original REF sequence of the indel into the site metadata when we process it. It shouldn't add too much overhead, since we need to add the length of the indel to the site metadata anyway to ensure that there are no overlaps.

2 replies

hyanwong Feb 7, 2024
Collaborator

I agree about normalisation. Metadata is indeed another possibility for storing the shared letter that is required for VCF output. However, I'd worry that if we have a whole set of possible sources for the REF, we would have to work out an order of precedence in case of conflicts, which can get very messy. So I think I'd prefer to use an existing way of encoding reference letters.

I guess if we don't have a reference sequence, we could also put an N (indicating that the REF is unknown). I suspect that might break the VCF spec, though?

hyanwong Feb 7, 2024
Collaborator

Also note that the VCF spec states a specific exclusion when the indel is at the start of the sequence. In this case, the VCF output should include the reference base at the end of the indel instead. Sigh.

duncanMR · 2024-02-07T11:53:15Z

duncanMR
Feb 7, 2024

I agree about normalisation. Metadata is indeed another possibility for storing the shared letter that is required for VCF output. However, I'd worry that if we have a whole set of possible sources for the REF, we would have to work out an order of precedence in case of conflicts, which can get very messy. So I think I'd prefer to use an existing way of encoding reference letters.

That's a fair point. Come to think of it, have we tested how well the REF and ALT columns that ts.write_vcf() outputs match the source VCF used for inferrence (putting aside the indel problem)?

I guess if we don't have a reference sequence, we could also put an N (indicating that the REF is unknown). I suspect that might break the VCF spec, though?

A quick search yielded lots of examples of problematic VCFs with N in the REF column, so it seems to be at least allowed by the spec. It would certainly render that variant useless for any further analysis though, so it might be better to just remove the variant in that case.

4 replies

hyanwong Feb 7, 2024
Collaborator

Re: REF and ALT, it will only match when the ref is the ancestral allele.

Re N in ref for indels, this would be alleles like REF="N", ALT="NCCT". Would that render the variant useless downstream?

duncanMR Feb 7, 2024

I just tried replacing a medically significant REF:GA -> ALT:G variant in a VCF with NA -> N, keeping every other field the same. The variant annotation tool I used failed to find the latter in any variant databases, so it would be missed if the variants were prioritised. So it didn't break the pipeline, but it would be very difficult to anything useful with it.

Come to think of it, is it even possible for a SNP to be called immediately adjacent to an indel? I would have thought that variant callers would lump them together into a single indel, but I'm not sure. If that's the case, we might as well save ourselves the hassle and just use the VCF way of encoding them.

hyanwong Feb 7, 2024
Collaborator

I don't see why you couldn't have a SNP next to an indel. You would then have 4 combinations: (SNP = A/T, indel = "GGG"/"") =>

alignments

"A---"
"AGGG"
"T---"
"TGGG"

You could certainly get this in simulated data, even if it would be hard to check in a VCF. There is presumably a reason why the VCF spec specifies the REF as the first base, in case you could also have an ALT there.

I am quite keen to encode variation in a tree sequence in the "logical" way, rather than be hamstrung by weird VCF requirements.

duncanMR Feb 7, 2024

Fair point. Of the options we've discussed, I'm most inclined toward your idea of inserting Ns to the vcf output; if someone intends to do downstream analysis of a generated vcf, they will probably need to use tskit's reference sequence feature anyway to get a sensible output.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tree Sequences with Indels #1775

{{title}}

Replies: 8 comments 11 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Tree Sequences with Indels #1775

reedacartwright Oct 9, 2021

Replies: 8 comments · 11 replies

jeromekelleher Oct 9, 2021 Maintainer

molpopgen Oct 11, 2021 Maintainer

hyanwong Nov 11, 2021 Collaborator

jeromekelleher Nov 11, 2021 Maintainer

hyanwong Jan 8, 2022 Collaborator

hyanwong Jan 7, 2022 Collaborator

reedacartwright Jan 7, 2022 Author

hyanwong Jan 8, 2022 Collaborator

reedacartwright Jan 12, 2022 Author

reedacartwright Jan 12, 2022 Author

hyanwong Feb 7, 2024 Collaborator

duncanMR Feb 7, 2024

hyanwong Feb 7, 2024 Collaborator

hyanwong Feb 7, 2024 Collaborator

duncanMR Feb 7, 2024

hyanwong Feb 7, 2024 Collaborator

duncanMR Feb 7, 2024

hyanwong Feb 7, 2024 Collaborator

duncanMR Feb 7, 2024

reedacartwright
Oct 9, 2021

Replies: 8 comments 11 replies

jeromekelleher
Oct 9, 2021
Maintainer

molpopgen
Oct 11, 2021
Maintainer

hyanwong
Nov 11, 2021
Collaborator

jeromekelleher Nov 11, 2021
Maintainer

hyanwong Jan 8, 2022
Collaborator

hyanwong
Jan 7, 2022
Collaborator

reedacartwright Jan 7, 2022
Author

hyanwong Jan 8, 2022
Collaborator

reedacartwright Jan 12, 2022
Author

reedacartwright
Jan 12, 2022
Author

hyanwong
Feb 7, 2024
Collaborator

duncanMR
Feb 7, 2024

hyanwong Feb 7, 2024
Collaborator

hyanwong Feb 7, 2024
Collaborator

duncanMR
Feb 7, 2024

hyanwong Feb 7, 2024
Collaborator

hyanwong Feb 7, 2024
Collaborator