Skip to content

Latest commit

 

History

History
23 lines (17 loc) · 3.15 KB

README.md

File metadata and controls

23 lines (17 loc) · 3.15 KB

NeuroGAP downsampling project

This repository contains the code generated and run to systematically compare low-coverage versus GWAS accuracy and sensitivity. Starting data consisted of high coverage whole genomes (target: 30X) from the 5 study sites included in this project. Reads or variants were then downsampled to correspond with a depth of coverage or sites from a GWAS array. Variant call set quality was assessed and performance was compared with respect to the high coverage genomes as a "truth" set.


Processing steps

For sequencing data:

For array data:

For both:

  • Run BEAGLE using this workflow (beagle-refine-impute.wdl) and with corresponding example inputs (beagle-refine-impute.json). Differences between sequencing vs array analyses:
  • Use a 2-step approach for the low-coverage data consisting of genotyping refinement using genotype likelihoods rather than hard genotype calls (fields called GL rather than GT in VCF parlance). After genotypes are refined, use the hard calls to run imputation. BEAGLE v4.1 was used for refinement, and BEAGLE v5.1 was used for imputation. Different reference panels were used for these steps, as refinement is much more computationally intensive (refinement reference panel subsetting script here: refine_reference.py).
  • No point to refinement for array analyses, just run imputation using BEAGLE v5.1

Analytical steps

  • In addition to standard variant call metrics from GATK (gathered routinely as part of the beagle-refine-impute.wdl and standard GATK workflows), gather concordance and sensitivity metrics (relative to full dataset) as in variant_info.py