-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changelog v2.3 viralrecon #15
Comments
Another new functionality is that viralrecon now allows to determine which software use for variant calling (iVar or Bcftools) and consensus genome generation (iVar or Bcftools), so you can combine them (nf-core#246). Previous viralrecon versions had iVar as default for both variant calling and consensus genome generation. This combination had some drawbacks related with the issues associated with iVar (andersen-lab/ivar#103 , andersen-lab/ivar#97, andersen-lab/ivar#85). Now, viralrecon performs variant calling using iVar, then it will filter those variants as explained before in strand-bias and merged codons, and finally it will generate the consensus genome using the filtered variants called by iVar. This generates the following differences in the final consensus fasta files:
This is fixed when creating the consensus with iVar filtered variants: First sequence is reference, second sequence is the consensus generated by Bcftools and third sequence is consensus generated with iVar. iVar's tsv file will look like this:
The deletion has frequency lower than 0.75 as determined in the consensus filter, but it is being added to the iVar consensus, but not with Bcftools consensus.
First sequence is reference, second sequence is the consensus generated by bcftools and third sequence is consensus generated with iVar. iVar's .tsv file will look like this:
It was supposed to be a deletion, not a N nucleotide, and the N will not appear when creating the consensus with Bcftools.
As explained in iVar's manual if one base is not enough to match a given frequency, then an ambiguous nucleotide is called at that position, which means including low frequency variants. Example:
This variants are at 0.3 AF, so the reference nucleotide AF is not enough to reach the minimum 0.75 AF, then both are included in the consensus as ambiguous nucleotides: First sequence is reference, second sequence is the consensus generated by Bcftools and third sequence is consensus generated with iVar. iVar consensus is introducing R (A or G) in position 27665 and S (G or C) in position 27666 when the reference only should be included. This is fixed when creating the consensus with Bcftools.
When there are deletions in iVar's tsv file with low allele frequency, the reference should be included, but iVar introduces Ns instead: First sequence is reference, second sequence is the consensus generated by Bcftools and third sequence is consensus generated with iVar. The tsv file looks like this:
In the consensus, the reference nucleotides should be included as with Bcftools. |
Major enhancements
Included strand-bias annotation for ivar
NGS data are prone to certain types of artifact variant calls, strand bias is a clear example. For example, all but one variant-supporting reads are on the reverse strand whereas reference-supporting reads are equally represented on both strands giving rise to a False positive scenario known as Strand bias [1].
Most nowadays variant callers support for strand-bias filtering, but ivar still lacks this functionality andersen-lab/ivar#5.
viralrecon new release offers now this funcionality taking this artifact into consideration while converting iVar variants tsv file to vcf format inside the ivar_variants_to_vcf.py script. In order to do that a Fisher exact test is performed and
SB
filter annotation is used for tagging variants with a significant strand-bias p-value < 0.05. Moreover a new INFO field is added with the p-value (p.eSB_pvalue=1e-05
).Input tsv:
Output vcf:
Fisher exact test is based a contingency table as stated in the GATK literature [2]:
Code - contigency legend:
Strand-bias filtering is not always a recommended filter for all type of experiments, amplicon data due to the enrichment preparation procedure based on PCRs are prone to strand-bias artifacts that not necessarily means a greater probability of a false positive, moreover amplicon experiments normally generates deep coverage data that does not need this type of filtering. That's we
ivar_variants_to_vcf.py
provides a new option--ignore-strand-bias
for ignoring the fisher test, this parameter is set by default when--protocol amplicon
.Consecutive variants called by ivar belonging to the same codon are now collapsed in one line in order to fix annotation
During variant analysis of Sars-Cov-2 some complex variants as a the triplet nucleotide change which change the entire codon in the B.1.1.7 VOC, variant callers reports three nucleotide changes instead of just one change including the three nucleotide changes, with the subsequent wrong aminoacid annotation. This is also a known problem in ivar andersen-lab/ivar#92 which we have fixed in this new viralrecon release also through
ivar_variants_to_vcf.py
script.Input tsv file with three variant lines and wrong
aa
annotation:Output vcf with three variants belonging to the same codon merged in just one line:
Fixed annotation with snpeff:
As for the strand-bias implementation the script comes with the parameter
--ignore-merge-codons
if you want the previousivar_variants_to_vcf.py
behaviour.Script logic for consecutive and same codon variants detection.
The script
ivar_variants_to_vcf.py
iterates through all the .tsv file reading each line. It saves the information of each line in the dictionary structure which will be filled with all the informative fields for up to 3 positions maximum. Once the dictionary meets this requirements we check for consecutive positions and evaluate if they belong to the same codon. The dictionary acts as a queue of size three, being evaluated always when it is full.Once the dict is full we evaluate as follows:
![image](https://user-images.githubusercontent.com/3480206/153198106-39deb63b-7c17-4dd1-8bbd-506e3afea072.png)
Option to generate consensus with BCFTools / BEDTools using iVar variants
Another new functionality is that viralrecon now allows to determine which software use for variant calling (iVar or Bcftools) and consensus genome generation (iVar or Bcftools), so you can combine them (nf-core#246).
Previous viralrecon versions had iVar as default for both variant calling and consensus genome generation. This combination had some drawbacks related with the issues associated with iVar (andersen-lab/ivar#103 , andersen-lab/ivar#97, andersen-lab/ivar#85).
Now, viralrecon performs variant calling using iVar, then it will filter those variants as explained before in strand-bias and merged codons, and finally it will generate the consensus genome using the filtered variants called by iVar. This generates the following differences in the final consensus fasta files:
This is fixed when creating the consensus with iVar filtered variants:
First sequence is reference, second sequence is the consensus generated by Bcftools and third sequence is consensus generated with iVar.
iVar's tsv file will look like this:
The deletion has frequency lower than 0.75 as determined in the consensus filter, but it is being added to the iVar consensus, but not with Bcftools consensus.
First sequence is reference, second sequence is the consensus generated by bcftools and third sequence is consensus generated with iVar.
iVar's .tsv file will look like this:
It was supposed to be a deletion, not a N nucleotide, and the N will not appear when creating the consensus with Bcftools.
As explained in iVar's manual if one base is not enough to match a given frequency, then an ambiguous nucleotide is called at that position, which means including low frequency variants. Example:
This variants are at 0.3 AF, so the reference nucleotide AF is not enough to reach the minimum 0.75 AF, then both are included in the consensus as ambiguous nucleotides:
First sequence is reference, second sequence is the consensus generated by Bcftools and third sequence is consensus generated with iVar.
iVar consensus is introducing R (A or G) in position 27665 and S (G or C) in position 27666 when the reference only should be included. This is fixed when creating the consensus with Bcftools.
When there are deletions in iVar's tsv file with low allele frequency, the reference should be included, but iVar introduces Ns instead:
First sequence is reference, second sequence is the consensus generated by Bcftools and third sequence is consensus generated with iVar.
The tsv file looks like this:
In the consensus, the reference nucleotides should be included as with Bcftools.
New variants and linage report table
viralrecon now provides a new table for variants report unifying variant calling, annotation and linage if desired. This table can be really useful for variants inspection, co-infections or metagenomics data as sewage sars-cov-2 sequencing.
Pipeline validation and benchmarking
The pipeline has been validated using 54 SARS-Cov-2 samples using Artic amplicon scheme v4. This samples have a mixed composition of SARS-Cov-2 linages including B.1.1.7, AY.* and BA.*, which are known to have problematic deletions and triplets.
Bibliography:
[1] Koboldt, D.C. Best practices for variant calling in clinical sequencing. Genome Med 12, 91 (2020).
[2] Fisher’s Exact Test GATK Team (2020).
Special acknowledgement for this documentation to:
@svarona
@ErikaKvalem
@Alema91
@saramonzon
@drpatelh
The text was updated successfully, but these errors were encountered: