From 5e0e08428bc37080a553ba3327c58c13a021169f Mon Sep 17 00:00:00 2001 From: Bernt Popp Date: Sat, 9 Sep 2023 22:16:59 +0200 Subject: [PATCH] add README --- analyses/calling/README.md | 61 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 61 insertions(+) create mode 100644 analyses/calling/README.md diff --git a/analyses/calling/README.md b/analyses/calling/README.md new file mode 100644 index 0000000..451e6ec --- /dev/null +++ b/analyses/calling/README.md @@ -0,0 +1,61 @@ +# MuTect2 Variant Calling Pipeline + +This pipeline is designed to perform variant calling using GATK's MuTect2. It is implemented using Snakemake and allows for a high degree of customization through a configuration file. + +## Requirements + +- Snakemake +- Conda +- GATK (installed through Conda in the pipeline) + +## Installation + +1. Clone this repository to your local system. +2. Ensure you have all the required software installed. + +## Configuration + +Before running the pipeline, you need to configure the following files: + +### `config.yaml` + +This file contains the settings for the pipeline. Here are the details of the settings that you can configure: + +- `final_bam_folder`: The folder containing the final BAM files. +- `final_bam_file_extension`: (Optional) The extension of the BAM files (default: ".bam"). +- `output_folder`: The folder where the output files will be stored. +- `reference_unpacked`: The path to the reference genome file. +- `panel_of_normals`: The path to the Panel of Normals file. +- `af_only_gnomad`: The path to the allele frequency only gnomAD file. +- `mutect_scatter_by_chromosome`: (Optional) Set to `True` to enable scattering by chromosome, `False` otherwise (default: `False`). + +### `calling_metadata.tsv` + +This file contains the metadata for the analyses to be run. It should contain the following columns: + +- `sample1`: The name of the first sample (tumor sample). +- `sample2`: (Optional) The name of the second sample (normal sample). Leave empty for tumor-only analyses. +- `bam1_file_basename`: The basename of the BAM file for the first sample. +- `bam2_file_basename`: (Optional) The basename of the BAM file for the second sample. Leave empty for tumor-only analyses. +- `individual1`: The identifier for the first individual. +- `individual2`: (Optional) The identifier for the second individual. Leave empty for tumor-only analyses. +- `analysis`: The type of analysis to be performed (e.g., "To" for tumor-only). + +## Running the Pipeline + +To run the pipeline, use the following command: + +```sh +sbatch run_mutect2_calling.sh +``` + +The run_mutect2_calling.sh shell script contains the Snakemake command to run the workflow with the appropriate settings and resource allocations. +You may need to edit this script to specify the number of cores and other resources based on your system's configuration. + +## Output +The pipeline produces the following outputs in the output_folder specified in the config.yaml: + +- `variant_calls`: A folder containing the VCF files with the variant calls. +- `logs`: A folder containing the log files for the MuTect2 runs. + +Each VCF file is named with the format `__.vcf.gz`. \ No newline at end of file