Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Genome Nexus sometimes annotates SNV as DNP #32

Open
thomasyu888 opened this issue Jan 16, 2021 · 5 comments
Open

Genome Nexus sometimes annotates SNV as DNP #32

thomasyu888 opened this issue Jan 16, 2021 · 5 comments

Comments

@thomasyu888
Copy link

thomasyu888 commented Jan 16, 2021

  • input: input.txt

  • Intermediate files: annotation-tools intermediate files I must add the .txt at the end or github won't allow me to upload these. My understanding it the input.txt.temp.annotated.txt is the output from Genome Nexus. But because the annotation-tools allows us to include a directory with a list of mafs or vcfs, it annotates each one of those files separately. processed.txt is all of these merged.
    input.txt.temp.annotated.txt
    input.txt.temp.txt

  • Processed:
    processed.txt

@thomasyu888 thomasyu888 changed the title Genome Nexus sometimes annotates SNP as DNP Genome Nexus sometimes annotates SNV as DNP Jan 16, 2021
@inodb inodb self-assigned this Jan 22, 2021
@n1zea144 n1zea144 assigned thomasyu888 and ao508 and unassigned inodb and ao508 Jan 22, 2021
@thomasyu888
Copy link
Author

thomasyu888 commented Jan 23, 2021

At first glance, by looking at the input.txt.temp.txt it seems like the Variant_Type is being filled in. They are all annotated as DNP except 1. However, when you look at the processed.txt, they are all annotated as DNP, which means that Genome Nexus probably also does some correcting too...

@as1000
Copy link

as1000 commented Feb 10, 2021

It looks like something is going on related to the reference alleles.

They are changing between the input.txt file and the input.txt.temp.annotated.txt file.

The variant type annotation seems to be based on input.txt reference allele (old reference) instead of the input.txt.temp.annotated.txt reference allele (new reference).

@averyniceday
Copy link
Contributor

@thomasyu888 could I get some additional details on this issue? Are there specific records inside the uploaded files that I can use as an example to trace the problem?

@thomasyu888
Copy link
Author

Thanks for looking into this @averyniceday . If you download input.txt, you will see that there is no Variant_Type column, which means the annotation-tools and Genome-Nexus is filling this column. input.txt is the input that we receive from certain centers.

As stated in this comment, there seems to be a couple things going on:

  • If you look at input.txt.temp.txt, this is the intermediate file annotation-tools generates and it creates Variant_Type column with all DNP and one ONP.
  • If you look at input.txt.temp.annotated.txt and processed.txt, you will see that all the Variant_Type is all DNP.

Due to these two steps, my hypothesis is as follows:

  • annotation-tools looks at the Ref/alt alleles and fills out Variant_Type to the best of its ability
  • Genome Nexus re-annotates Variant_Type if deemed incorrect.

So in this scenario, our collaborator is saying these variants should be annotated with SNV instead of DNP.

@sheridancbio
Copy link
Contributor

sheridancbio commented Mar 29, 2021

I have looked over the file input.txt and queried all of the variant positions in the ucsc genome browser to extract the latest/final version of the hg19 nucleotide sequences at the variant positions listed. All cases show a discrepancy between the provided Reference_Allele column and the hg19 sequence as extracted from the ucsc browser. A preponderance of cases are of these two patterns:

Ref_allele=CC, UCSC_hg19=TC, Tumor_seq_allele2=TT (34 cases)

Ref_allele=GG, UCSC_hg19=AG, Tumor_seq_allele2=AA (19 cases)

But in all cases, a SNP is seen when comparing the tumor_seq_allele to the UCSC_hg19 allele, and a DMP (or in one case a 6-NT replacement) is seen when comparing the tumor_seq_allele to the provided Ref_allele. (full details below)

We are confirming that all of these cases should have been returned with a "failure to annotate" message with a possible note that the provided reference_allele was not consistent with the reference genome used by our installation of VEP (which seems to use the latest/final version of the hg19/GRCh37 assembly). Instead the queries to VEP were sent in a form which did not provide the reference genome sequence explicitly - instead relying on a format which specifies only the genome position range which is deleted and providing the nucleotides which replace the deleted region.

Tumor_Sample_Barcode	Chromosome	Start_Position	Reference_Allele	UCSC_browser_hg19	Tumor_Seq_Allele2
SAGE-1	5	112175211	GAAATA	TAAAAG	GAAAAA
SAGE-1	5	138163255	CC	TC	TT
SAGE-1	2	29436868	CC	TC	TT
SAGE-1	13	32937622	CC	TC	TT
SAGE-1	15	93528897	CC	TC	TT
SAGE-1	19	10610252	GG	AG	AA
SAGE-1	7	140453135	AC	CA	CT
SAGE-1	11	69514130	CC	TC	TT
SAGE-1	3	37089129	AA	GA	GC
SAGE-1	3	41266100	CT	TC	TA
SAGE-1	17	37865595	CC	TC	TT
SAGE-1	2	141214108	CC	TC	TT
SAGE-1	5	176710898	GG	AG	AA
SAGE-1	5	149502647	CC	TC	TT
SAGE-1	16	2121897	CC	TC	TT
SAGE-1	X	129148054	CC	TC	TT
SAGE-1	4	66467840	CC	TC	TT
SAGE-1	17	37883607	CC	TC	TT
SAGE-1	5	1295227	GG	AG	AA
SAGE-1	7	142651409	CC	TC	TT
SAGE-1	12	49425423	GG	AG	AA
SAGE-1	X	47422632	CC	TC	TT
SAGE-1	3	181430929	CC	TC	TT
SAGE-1	7	140453134	CA	TC	TT
SAGE-1	7	81372706	GG	AG	AA
SAGE-1	X	123176483	GA	AG	AT
SAGE-1	18	48584572	AG	CA	CC
SAGE-1	6	157522350	GA	AG	AC
SAGE-1	12	25398283	CC	AC	AG
SAGE-1	17	7577104	GG	AG	AA
SAGE-1	6	163987796	CC	TC	TT
SAGE-1	1	27088750	CC	TC	TT
SAGE-1	3	187449652	GG	AG	AA
SAGE-1	2	25468143	CC	TC	TT
SAGE-1	3	134960019	CC	TC	TT
SAGE-1	X	53239727	GG	AG	AA
SAGE-1	17	7577536	CC	TC	TT
SAGE-1	8	68956752	GG	TG	TT
SAGE-1	5	170827841	GT	AG	AA
SAGE-1	3	30715663	CC	TC	TT
SAGE-1	1	158582645	TG	CT	CA
SAGE-1	19	42791871	CC	GC	GA
SAGE-1	17	29657312	GG	AG	AA
SAGE-1	4	1807982	GG	AG	AA
SAGE-1	19	11170726	CC	TC	TT
SAGE-1	19	1226604	AG	CA	CT
SAGE-1	16	9857499	CC	TC	TG	TT
SAGE-1	7	55220318	GC	TC	TT
SAGE-1	10	43615631	CC	TC	TG
SAGE-1	12	432834	GG	AG	AA
SAGE-1	X	1321306	CC	AC	AA
SAGE-1	12	92538138	GG	AG	AA
SAGE-1	9	133760318	CC	TC	TT
SAGE-1	6	138197162	CC	TC	TT
SAGE-1	16	89809318	TG	CT	CA
SAGE-1	6	401559	CC	AC	AT
SAGE-1	19	2223346	CC	TC	TT
SAGE-1	7	77973237	CC	TC	TT
SAGE-1	5	233755	GG	AG	AA
SAGE-1	11	119169211	CC	TC	TT
SAGE-1	7	148523612	GG	AG	AA
SAGE-1	17	29562789	GG	AG	AA
SAGE-1	12	57861962	CC	TC	TT
SAGE-1	2	191926536	CC	AC	AT
SAGE-1	7	140482926	GG	AG	AA
SAGE-1	11	119149240	CC	TC	TT
SAGE-1	3	142217555	CT	TC	TA
SAGE-1	17	29652836	GG	AG	AA
SAGE-1	5	121761155	CT	TC	TA
SAGE-1	19	1221338	GG	AG	AT
SAGE-1	7	140453156	CC	AC	AT
SAGE-1	20	31021513	GA	AG	AT
SAGE-1	X	129147804	CC	TC	TT
SAGE-1	5	176722317	GG	TG	TT
SAGE-1	17	29559840	CC	TC	TT
SAGE-1	12	25398283	CC	AC	AA
SAGE-1	3	89448623	CC	TC	TT
SAGE-1	9	133759703	CC	TC	TG
SAGE-1	16	3828794	GG	AG	AA
SAGE-1	12	69230515	AA	GA	GT
SAGE-1	17	78897299	CC	TC	TT
SAGE-1	19	52725474	GG	AG	AA
SAGE-1	1	27100305	GG	TG	TT
SAGE-1	11	119148906	CC	TC	TT

@inodb inodb assigned sheridancbio and unassigned averyniceday Mar 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants