This Snakemake pipeline implements the GATK best-practices workflow for calling small germline variants with possibility of using GATK Hard/VQSR filter or using deepvariant variant caller.
Improvements from pipeline (https://github.com/snakemake-workflows/dna-seq-gatk-variant-calling.git)
- Update software versions;
- Use GATK SPARK tools, when possible;
- Set an human genome version more robust;
- Add GATK VQSR filter version;
- Add deep variant software call available;
$ mamba create -c conda-forge -c bioconda --name call_variants snakemake snakedeploy
$ source activate call_variants
$ mkdir -p path/to/project-workdir
$ cd path/to/project-workdir
$ snakedeploy deploy-workflow https://github.com/bioinformatics-ua/GDI_Pipeline.git . --branch main
$ ldd --version
Snakedeploy will create two folders workflow
and config
. The workflow
as a Snakemake module, and config
contains configuration files which will be modified in the next step in order to configure the workflow to your needs. Later, when executing the workflow, Snakemake will automatically find the main Snakefile in the workflow subfolder.
To configure this workflow, modify config/config.yaml
according to your needs:
- To run GATK - hard filter ->
params: -> algorithm: "gatk"
andfiltering: -> vqsr: false
- To run GATK - VQSR ->
params: -> algorithm: "gatk"
andfiltering: -> vqsr: true
- To run DeepVariant ->
params: -> algorithm: "deepvariant"
vqsr_resources: -> path: "<vqsr diretory here>"
.
Add samples to config/samples.tsv
. Only the column sample is mandatory, but any additional columns can be added.
For each sample, add one or more sequencing units (runs, lanes or replicates) to the unit sheet config/units.tsv
. For each unit, define platform, and either one (column fq1) or two (columns fq1, fq2) FASTQ files (these can point to anywhere in your system).
The pipeline will jointly call all samples that are defined, following the GATK best practices.
If you want to try with some examples go to fastq
diretory and run download_some_samples_to_test.sh
. The config/samples.tsv
and config/units.tsv
config files are already set to these samples.
Run only in your server:
$ snakemake --jobs 20 --use-conda
Activate your model for Sun Grid Engine (sGE)
$ snakemake --profile sge --jobs 20 --use-conda
Activate your model for Sun Grid Engine (sGE)
$ snakemake --profile slurm --jobs 20 --use-conda
To add a cluster environment for SnakeMake you need to install on in your home path:
This is the case for SGE:
$ cookiecutter https://github.com/Snakemake-Profiles/sge.git
## add the queue name in the next file
$ vi ~.config/snakemake/sge/cluster.yaml