Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indel SNPs rejected due to mismapping issue #68

Open
GerardManning opened this issue Apr 6, 2020 · 0 comments
Open

Indel SNPs rejected due to mismapping issue #68

GerardManning opened this issue Apr 6, 2020 · 0 comments

Comments

@GerardManning
Copy link

Ancestry and 23andme appear to have discrepant mapping of indel SNPs, causing them to be discarded by SNPs, and sometimes causing the program to fail.

Indel SNPs positions where the allele is named D (deletion) or I (insertion) and can be deletions or insertions of several bases, or more complex multi-nucleotide changes (so technically, not all are actually SNPs). In my test case, Ancestry covers 8073 indel SNPs, and 23andme (v5) has 4828, of which 2143 are found on both platforms (by rs#). However, only 1328 of the 2143 report the same chromosomal location, so the remaining 815 to be rejected as discrepant.

Most of the location discrepancies are tiny - 475 of them are 1 base, 174 are 2, 50 are 3, 48 are 4, 33 are 5, 16 are 6, and a further 18 are 17 or less. In almost all cases, the difference is the length of the indel, and the ancestry location is less than the 23andme location. This seems to be due to ancestry reading the position where the deletion started, and 23andme reading the position where it ended, but there are some additional odd cases, and neither ancestry nor 23andme is always consistent with dbSNP.

In practical terms, I'd like to suggest two possible solutions

  1. Loosen the criterion for merged SNPs to have identical location, and if an ancestry/23andme merge, retain the 23andme location. Having the same rs# and being within 20 bases of each other would seem a reasonably strict criterion

  2. Could also just ignore all discrepant indels. The vast majority of them are homozygous in most people (i.e. are rare and usually disease-associated alleles), so they are not very informative for genealogy.

I also wanted to note that I was unable to get my two genotype files to merge by altering the parameter discrepant_genotypes_threshold to 1000000, but only by manually deleting all indel SNPs from the two input files. I haven't looked further into this issue, but may be worth checking out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant