add README

scholl-lab · Sep 9, 2023 · 5e0e084 · 5e0e084
1 parent 3104321
commit 5e0e084
Showing 1 changed file with 61 additions and 0 deletions.
diff --git a/analyses/calling/README.md b/analyses/calling/README.md
@@ -0,0 +1,61 @@
+# MuTect2 Variant Calling Pipeline
+
+This pipeline is designed to perform variant calling using GATK's MuTect2. It is implemented using Snakemake and allows for a high degree of customization through a configuration file.
+
+## Requirements
+
+- Snakemake
+- Conda
+- GATK (installed through Conda in the pipeline)
+
+## Installation
+
+1. Clone this repository to your local system.
+2. Ensure you have all the required software installed.
+
+## Configuration
+
+Before running the pipeline, you need to configure the following files:
+
+### `config.yaml`
+
+This file contains the settings for the pipeline. Here are the details of the settings that you can configure:
+
+- `final_bam_folder`: The folder containing the final BAM files.
+- `final_bam_file_extension`: (Optional) The extension of the BAM files (default: ".bam").
+- `output_folder`: The folder where the output files will be stored.
+- `reference_unpacked`: The path to the reference genome file.
+- `panel_of_normals`: The path to the Panel of Normals file.
+- `af_only_gnomad`: The path to the allele frequency only gnomAD file.
+- `mutect_scatter_by_chromosome`: (Optional) Set to `True` to enable scattering by chromosome, `False` otherwise (default: `False`).
+
+### `calling_metadata.tsv`
+
+This file contains the metadata for the analyses to be run. It should contain the following columns:
+
+- `sample1`: The name of the first sample (tumor sample).
+- `sample2`: (Optional) The name of the second sample (normal sample). Leave empty for tumor-only analyses.
+- `bam1_file_basename`: The basename of the BAM file for the first sample.
+- `bam2_file_basename`: (Optional) The basename of the BAM file for the second sample. Leave empty for tumor-only analyses.
+- `individual1`: The identifier for the first individual.
+- `individual2`: (Optional) The identifier for the second individual. Leave empty for tumor-only analyses.
+- `analysis`: The type of analysis to be performed (e.g., "To" for tumor-only).
+
+## Running the Pipeline
+
+To run the pipeline, use the following command:
+
+```sh
+sbatch run_mutect2_calling.sh
+```
+
+The run_mutect2_calling.sh shell script contains the Snakemake command to run the workflow with the appropriate settings and resource allocations.
+You may need to edit this script to specify the number of cores and other resources based on your system's configuration.
+
+## Output
+The pipeline produces the following outputs in the output_folder specified in the config.yaml:
+
+- `variant_calls`: A folder containing the VCF files with the variant calls.
+- `logs`: A folder containing the log files for the MuTect2 runs.
+
+Each VCF file is named with the format `<individual1>_<analysis>_<chromosome>.vcf.gz`.