Skip to content

aidenlab/hic2gatk

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 

Repository files navigation

hic2gatk

Call SNPs from ENCODE DCC Hi-C pipeline (Juicer2). Original version: 7/25/2021. Current version: 09/05/2021.

DESCRIPTION:

This is a wrapper script to use GATK4 to call SNPs from Hi-C alignment data as generated by the ENCODE DCC Hi-C pipeline (aka Juicer2).

USAGE:

./run-gatk-after-juicer2.sh [options] <path_to_aligned_dedupped_bam>

ENCODE USAGE (hg38):

./run-gatk-after-juicer2.sh -r [path_to_GCA_000001405.15_GRCh38_no_alt_analysis_set.fna] --gatk-bundle [path_to_GATK_bundle_for_hg38] [path_to_mega_aligned_dedup.bam]

Note: Creates GCA_000001405.15_GRCh38_no_alt_analysis_set.dict and GCA_000001405.15_GRCh38_no_alt_analysis_set.interval_list unless already available in the reference folder. To create and make available for multiple runs, execute:

gatk CreateSequenceDictionary R=GCA_000001405.15_GRCh38_no_alt_analysis_set.fna  O=GCA_000001405.15_GRCh38_no_alt_analysis_set.dict
gatk ScatterIntervalsByNs R=GCA_000001405.15_GRCh38_no_alt_analysis_set.fna  OT=ACGT N=500 O=GCA_000001405.15_GRCh38_no_alt_analysis_set.interval_list

ARGUMENTS:

path_to_aligned_dedupped_bam

Path to bam file containing deduplicated alignments of Hi-C reads in bam format (output by Juicer2).

OPTIONS:

-h|--help

Shows help.

-r|--reference <path_to_fasta>

Pointer to reference that was used to process the Hi-C data. Required.

-c|--coverage [approx_clean_coverage_in_X_of_genomes]

Target coverage in terms of "clean" alignment data. If bam coverage exceeds the target the reads will be subsampled to perform SNP calling. Set to 0 to disable. Default: 30.

-f|--fraction [fraction_of_input_bam_to_be_used]

Fraction of alignment data to be used for subsampling. Overrides the targe coverage parameter. Deafult: 1.0.

GATK CONTROL:

--gatk-bundle <path_to_GATK_bundle>

Shortcut to use GATK resources for base recalibration and variant recalibration when processing human data. For non-human data pass known vcf files if available explicitely with --known-sites-for-base-recalibration, --known-sites-for-snp-recalibration and --known-sites-for-indel-recalibration options.

--known-sites-for-base-recalibration <path_to_vcf_file_w_known_polymorphisms>

Use set of variants from a given vcf file for base recalibration. Multiple vcf files can be passed with multiple option invocations.

--known-sites-for-snp-recalibration <path_to_vcf_file_w_known_polymorphisms>

Use set of variants from a given vcf file for snp recalibration. Multiple vcf files can be passed with multiple option invocations. All databases are assigned with prior=10.0.

--known-sites-for-indel-recalibration <path_to_vcf_file_w_known_polymorphisms>

Use set of variants from a given vcf file for indel recalibration. Multiple vcf files can be passed with multiple option invocations. All databases are assignd with prior=2.0.

-q|--mapq [min_mapping_quality]

Ignore reads with mapping quality less than [min_mapping_quality]. Overrides the default GATK MappingQualityReadFilter. Default: 10.

--min-base-quality-score [min_base_quality_score]

Minimum base quality required to consider a base for variant calling. Passed on to GATK. Default: 10.

HIC-SPECIFIC CONTROL:

--restriction-site-file <path_to_restriction_site_file>

Pass the restriction site file when processign a Hi-C experiment generated with a known restriction enzyme to exclude regions immediately around the cut site from variant analysis. Default: ignore option.

--exclusion-interval [num_in_bp]

Use this option to tweak the padding aroun the restriction site position. Default: 5bp.

WORKFLOW CONTROL:

-t|--threads [num]

Indicate how many threads to use. Default: half of available cores as calculated by parallel --number-of-cores.

--from-stage [pipeline_stage]

Fast-forward to a particular stage of the pipeline. The pipeline_stage argument can be "prep", "sort", "recalibrate_bases", "genotype", "recalibrate_variants", "cleanup".

--to-stage [pipeline_stage]

Exit after a particular stage of the pipeline. The argument can be "prep", "sort", "recalibrate_bases", "genotype", "recalibrate_variants", "cleanup".

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages