Call SNPs from ENCODE DCC Hi-C pipeline (Juicer2). Original version: 7/25/2021. Current version: 09/05/2021.
This is a wrapper script to use GATK4 to call SNPs from Hi-C alignment data as generated by the ENCODE DCC Hi-C pipeline (aka Juicer2).
./run-gatk-after-juicer2.sh [options] <path_to_aligned_dedupped_bam>
ENCODE USAGE (hg38):
./run-gatk-after-juicer2.sh -r [path_to_GCA_000001405.15_GRCh38_no_alt_analysis_set.fna] --gatk-bundle [path_to_GATK_bundle_for_hg38] [path_to_mega_aligned_dedup.bam]
Note: Creates GCA_000001405.15_GRCh38_no_alt_analysis_set.dict and GCA_000001405.15_GRCh38_no_alt_analysis_set.interval_list unless already available in the reference folder. To create and make available for multiple runs, execute:
gatk CreateSequenceDictionary R=GCA_000001405.15_GRCh38_no_alt_analysis_set.fna O=GCA_000001405.15_GRCh38_no_alt_analysis_set.dict
gatk ScatterIntervalsByNs R=GCA_000001405.15_GRCh38_no_alt_analysis_set.fna OT=ACGT N=500 O=GCA_000001405.15_GRCh38_no_alt_analysis_set.interval_list
path_to_aligned_dedupped_bam
Path to bam file containing deduplicated alignments of Hi-C reads in bam format (output by Juicer2).
-h|--help
Shows help.
-r|--reference <path_to_fasta>
Pointer to reference that was used to process the Hi-C data. Required.
-c|--coverage [approx_clean_coverage_in_X_of_genomes]
Target coverage in terms of "clean" alignment data. If bam coverage exceeds the target the reads will be subsampled to perform SNP calling. Set to 0 to disable. Default: 30.
-f|--fraction [fraction_of_input_bam_to_be_used]
Fraction of alignment data to be used for subsampling. Overrides the targe coverage parameter. Deafult: 1.0.
--gatk-bundle <path_to_GATK_bundle>
Shortcut to use GATK resources for base recalibration and variant recalibration when processing human data. For non-human data pass known vcf files if available explicitely with --known-sites-for-base-recalibration, --known-sites-for-snp-recalibration and --known-sites-for-indel-recalibration options.
--known-sites-for-base-recalibration <path_to_vcf_file_w_known_polymorphisms>
Use set of variants from a given vcf file for base recalibration. Multiple vcf files can be passed with multiple option invocations.
--known-sites-for-snp-recalibration <path_to_vcf_file_w_known_polymorphisms>
Use set of variants from a given vcf file for snp recalibration. Multiple vcf files can be passed with multiple option invocations. All databases are assigned with prior=10.0.
--known-sites-for-indel-recalibration <path_to_vcf_file_w_known_polymorphisms>
Use set of variants from a given vcf file for indel recalibration. Multiple vcf files can be passed with multiple option invocations. All databases are assignd with prior=2.0.
-q|--mapq [min_mapping_quality]
Ignore reads with mapping quality less than [min_mapping_quality]. Overrides the default GATK MappingQualityReadFilter. Default: 10.
--min-base-quality-score [min_base_quality_score]
Minimum base quality required to consider a base for variant calling. Passed on to GATK. Default: 10.
--restriction-site-file <path_to_restriction_site_file>
Pass the restriction site file when processign a Hi-C experiment generated with a known restriction enzyme to exclude regions immediately around the cut site from variant analysis. Default: ignore option.
--exclusion-interval [num_in_bp]
Use this option to tweak the padding aroun the restriction site position. Default: 5bp.
-t|--threads [num]
Indicate how many threads to use. Default: half of available cores as calculated by parallel --number-of-cores.
--from-stage [pipeline_stage]
Fast-forward to a particular stage of the pipeline. The pipeline_stage argument can be "prep", "sort", "recalibrate_bases", "genotype", "recalibrate_variants", "cleanup".
--to-stage [pipeline_stage]
Exit after a particular stage of the pipeline. The argument can be "prep", "sort", "recalibrate_bases", "genotype", "recalibrate_variants", "cleanup".