Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Output VCF is not valid due to symbolic allele used as reference allele #121

Open
vlshesketh opened this issue Jan 15, 2024 · 6 comments
Open

Comments

@vlshesketh
Copy link

The VCFs generated by QDNAseq do not conform to the specification for the reference allele field, as they contain symbolic alleles e.g. <DIP>. This results in errors when trying to either validate the VCF using VCF validator:

Error: Reference is not a string of bases. This occurs 3 time(s), first time in line 14.

Or when loading the VCF into IGV:

The provided VCF file is malformed at approximately line number 14: Symbolic alleles not allowed as reference allele

The specification requires that the reference should a string of one or more bases (A,C,G,T or N, from https://samtools.github.io/hts-specs/VCFv4.4.pdf). Page 31 of the guidelines gives some examples of how to represent copy number events, or alternatively other structural variant/CNV callers use N as the reference base.

@HenrikBengtsson
Copy link
Collaborator

Thanks for this, and thanks for digging out the relevant references.

Can please attach an example of a VCF file so we, or someone else, can reproduce it with vcf_validator < /path/to/file.vcf (assuming that's the call). Also, what vcf_validator version do you use?

I'm not questioning your conclusions; I'm asking so we can create a minimal reproducible example, that we then can add as a unit test, which makes it easier to fix the bug.

@HenrikBengtsson HenrikBengtsson added this to the Next release milestone Jan 15, 2024
@vlshesketh
Copy link
Author

Thanks for the speedy reply! I've attached a VCF which fails validation (renamed extension to allow upload), as well as the output from vcf-validator. I'm using the latest version of vcf-validator which is 0.9.4.

SAMPLE.wf_cnv.vcf.txt
SAMPLE.wf_cnv.vcf.errors_summary.1705397450367.txt

@vlshesketh
Copy link
Author

Hello, is it possible to get an update on when the VCF will be modified so that it validates successfully?

@sachingadakh
Copy link

sachingadakh commented Jul 9, 2024

I run into same issue while loading VCF on IGV, "The provided VCF file is malformed at approximately line number 14: Symbolic alleles not allowed as reference allele"
Is there any update on resolving this issue?
However I replaced DIP with N resolved the issue

@vmkalbskopf
Copy link

I'm running into this error on IGV too.

@HenrikBengtsson
Copy link
Collaborator

Recap summary of this problem

The example VCF file that @vlshesketh provided in #121 (comment) is:

##fileformat=VCFv4.2
##source=QDNAseq-1.34.0
##REF=<ID=DIP,Description="CNV call">
##ALT=<ID=DEL,Description="Deletion">
##ALT=<ID=DUP,Description="Duplication">
##FILTER=<ID=LOWQ,Description="Filtered due to call in low quality region">
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of variant: DEL,DUP,INS">
##INFO=<ID=SVLEN,Number=1,Type=Integer,Description="Length of variant">
##INFO=<ID=BINS,Number=1,Type=Integer,Description="Number of bins in call">
##INFO=<ID=SCORE,Number=1,Type=Integer,Description="Score of calling algorithm">
##INFO=<ID=LOG2CNT,Number=1,Type=Float,Description="Log 2 count">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	test
21	14000001	.	<DIP>	<DUP>	1000	PASS	SVTYPE=DUP;END=46709983;SVLEN=32709983;BINS=60;SCORE=1;LOG2CNT=0.57	GT	0/1
X	1	.	<DIP>	<DEL>	1000	PASS	SVTYPE=DEL;END=156000000;SVLEN=156000000;BINS=274;SCORE=-1;LOG2CNT=-0.9	GT	0/1
Y	2500001	.	<DIP>	<DEL>	1000	PASS	SVTYPE=DEL;END=27000000;SVLEN=24500000;BINS=26;SCORE=-1;LOG2CNT=-1.04	GT	0/1

Using the lastest vcf-validator 0.10.0;

$ vcf_validator < SAMPLE.wf_cnv.vcf
[info] Reading from standard input...
[info] Summary report written to : stdin.errors_summary.1729547373906.txt
[info] According to the VCF specification, the input file is not valid

gives

$ cat stdin.errors_summary.1729547373906.txt
According to the VCF specification, the input file is not valid
Error: Reference is not a string of bases. This occurs 3 time(s), first time in line 14.
Warning: A valid 'reference' entry is not listed in the meta section. This occurs 1 time(s), first time in line 14.

The problematic line is:

$ sed -n '14p' SAMPLE.wf_cnv.vcf
21	14000001	.	<DIP>	<DUP>	1000	PASS	SVTYPE=DUP;END=46709983;SVLEN=32709983;BINS=60;SCORE=1;LOG2CNT=0.57	GT	0/1

Section 5.6 of the VCF v4.4 specification says:


5.6 Representing copy number variation

To encode copy number variation, VCF uses <CNV>, <DEL> and <DUP> symbolic structural variant alleles, CN INFO and FORMAT fields.

Allele specific copy number is specified through a <CNV> ALT allele for each distinct allelic copy number. INFO CN defines the allele specific copy number with FORMAT CN defining the overall copy number for that sample. POS and INFO SVLEN specify the genomic interval over which the copy number is defined. <DEL> and <DUP> copy number (SVCLAIM=D) alleles should be treated as <CNV> alleles that implicitly define INFO CN=0 and CN=2, respectively. As with all symbolic structural variants, the starting position of the interval is the base immediately after POS. For example, a region on chr1 from position 101 to 130 (both inclusive) with allele-specific copy numbers of 1 and 2 can be represented as follows:

chr1 100 . T <CNV>,<CNV> . . END=130;SVLEN=30,30;CN=1,2 GT:CN 1/2:3

All <CNV> alleles in the same VCF record should have the same SVLEN. To eliminate genotype ambiguity, copy number ALT alleles should not be mixed with other ALT alleles. When only copy number ALT alleles are present in a VCF record, GT=0 is equivalent to a <CNV> ALT allele with INFO CN of 1 and should be treated identically.

If only total copy number is known, the copy number of the segment should be defined with a single <CNV> ALT allele with a missing INFO CN field. In the above example this corresponds to the following:

chr1 100 . T <CNV> . . END=130;SVLEN=30 GT:CN .:3

The granularity of copy number representation is explicitly not defined in these specifications. Copy number segmentation can be base-pair accurate with even 1bp changes deletions resulting in new copy number segments, be at a highly granular megabase level of resolution, or anywhere in between. When the bounds of a copy number segment is not known precisely, this should be encoded in the CIPOS and CILEN INFO fields.


Help needed

I have very little time to work on this; can some please show and explain what output QDNAseq should give instead of the current:

chr1 100 . T <CNV>,<CNV> . . END=130;SVLEN=30,30;CN=1,2 GT:CN 1/2:3

to meet the VCF specifications?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants