Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add VCF support #107

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

nvnieuwk
Copy link

Adds #105

Adds 3 new options to the predict command:

  1. --vcf to state that a VCF file should be created (Will create a bgzipped vcf file)
  2. --fai to create the contigs in the VCF header
  3. --sample to set the sample name to be used in the VCF. This will default to the basename of the outid

The VCF header looks like this:

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=chr1,length=248956422>
...
##contig=<ID=chrUn_JTFH01001998v1_decoy,length=2001>
##ALT=<ID=CNV,Description="Copy number variant region">
##ALT=<ID=DEL,Description="Deletion relative to the reference">
##ALT=<ID=DUP,Description="Region of elevated copy number relative to the reference">
##INFO=<ID=SVLEN,Number=.,Type=Integer,Description="Difference in length between REF and ALT alleles">
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">
##FILTER=<ID=cnvQual,Description="CNV with quality below 10">
##FILTER=<ID=cnvCopyRatio,Description="CNV with copy ratio within +/- 0.2 of 1.0">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=SM,Number=1,Type=Float,Description="Linear copy ratio of the segment mean">
##FORMAT=<ID=ZS,Number=1,Type=Float,Description="The z-score calculated for the current CNV">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	Sample

And the variants themselves will look like this:

chr1	6260002	WisecondorX_DUP_4	N	<DUP>	.	.	END=6280000;SVTYPE=CNV;SVLEN=20000	GT:SM:ZS	./.:1.0289:6.62867
chr1	6515002	WisecondorX_DEL_1	N	<DEL>	.	.	END=6895000;SVTYPE=CNV;SVLEN=380000	GT:SM:ZS	./.:-0.2654:-5.26266

@nvnieuwk nvnieuwk requested review from matthdsm and mvheetve June 28, 2023 11:27
@nvnieuwk
Copy link
Author

I moved the fields to INFO, changed the output to segments and added an ABB flag if the variant is an abberation

@JspSrs
Copy link

JspSrs commented Jun 29, 2023

Hi Matthias and others, I do not have a copy of WiseCondorX running, but noticed the VCF output remark (#relevant for another project).
For me the DUP_4 in the example is unclear, also in relation to the linear copy number ratio in that example. Is the example just a real mock-up or should it reflect reality? If so, does the "_4" mean a copy number of 4 (CN=4, i.e. like with a homozygous tandem duplication). DUP has a meaning, like insertion of the exact sequence in tandem. GAIN is more neutral and normaly used in CNV analysis. AMP is often for any GAIN amounting more, i.e. CN>3, 4 and up

@nvnieuwk
Copy link
Author

Hi @JspSrs,
The number for is just a count value. The snippet posted consists of two variants I took from the test VCF. So this value has no real meaning except for making the identifiers unique.

CN currently isn't in the VCF because this isn't supported by WisecondorX at the moment (correct me if I'm wrong @matthdsm).

DUP has a meaning, like insertion of the exact sequence in tandem. GAIN is more neutral and normaly used in CNV analysis. AMP is often for any GAIN amounting more, i.e. CN>3, 4 and up

For this I followed the conventions on CNVs in VCFs with what I could derive from the data available in WisecondorX. I don't think using GAIN is such a good idea since GAIN isn't used in VCFs to specify CNVs.

The info available in the VCF is very limited at the moment and I would like to see it expanded in the future but I'm for now unable to tell you if that will happen and if so when.

-Nicolas

@JspSrs
Copy link

JspSrs commented Jun 29, 2023

@nvnieuwk, Hi Nicolas,
Thank you for the prompt response.
Regarding "For this I followed the conventions on CNVs in VCFs with what I could derive from the data available in WisecondorX. I don't think using GAIN is such a good idea since GAIN isn't used in VCFs to specify CNVs."; I think it shows the VCF format should have more definition and including cytogenomics specialists.
"GAIN" is more versatile, while DUP has a very specific implication in both DNA diagnostics and cytogenomics (i.e. a specific, identical sequence inserted next to the original. Either in inverted or in_tandem orientation).

Why I mention it here, sometimes improvements must come from bottom up. ;-)

Jasper Saris, Dept Clinical Genetics, Erasmus MC

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants