Skip to content

Commit

Permalink
add README
Browse files Browse the repository at this point in the history
  • Loading branch information
berntpopp committed Sep 9, 2023
1 parent 3104321 commit 5e0e084
Showing 1 changed file with 61 additions and 0 deletions.
61 changes: 61 additions & 0 deletions analyses/calling/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
# MuTect2 Variant Calling Pipeline

This pipeline is designed to perform variant calling using GATK's MuTect2. It is implemented using Snakemake and allows for a high degree of customization through a configuration file.

## Requirements

- Snakemake
- Conda
- GATK (installed through Conda in the pipeline)

## Installation

1. Clone this repository to your local system.
2. Ensure you have all the required software installed.

## Configuration

Before running the pipeline, you need to configure the following files:

### `config.yaml`

This file contains the settings for the pipeline. Here are the details of the settings that you can configure:

- `final_bam_folder`: The folder containing the final BAM files.
- `final_bam_file_extension`: (Optional) The extension of the BAM files (default: ".bam").
- `output_folder`: The folder where the output files will be stored.
- `reference_unpacked`: The path to the reference genome file.
- `panel_of_normals`: The path to the Panel of Normals file.
- `af_only_gnomad`: The path to the allele frequency only gnomAD file.
- `mutect_scatter_by_chromosome`: (Optional) Set to `True` to enable scattering by chromosome, `False` otherwise (default: `False`).

### `calling_metadata.tsv`

This file contains the metadata for the analyses to be run. It should contain the following columns:

- `sample1`: The name of the first sample (tumor sample).
- `sample2`: (Optional) The name of the second sample (normal sample). Leave empty for tumor-only analyses.
- `bam1_file_basename`: The basename of the BAM file for the first sample.
- `bam2_file_basename`: (Optional) The basename of the BAM file for the second sample. Leave empty for tumor-only analyses.
- `individual1`: The identifier for the first individual.
- `individual2`: (Optional) The identifier for the second individual. Leave empty for tumor-only analyses.
- `analysis`: The type of analysis to be performed (e.g., "To" for tumor-only).

## Running the Pipeline

To run the pipeline, use the following command:

```sh
sbatch run_mutect2_calling.sh
```

The run_mutect2_calling.sh shell script contains the Snakemake command to run the workflow with the appropriate settings and resource allocations.
You may need to edit this script to specify the number of cores and other resources based on your system's configuration.

## Output
The pipeline produces the following outputs in the output_folder specified in the config.yaml:

- `variant_calls`: A folder containing the VCF files with the variant calls.
- `logs`: A folder containing the log files for the MuTect2 runs.

Each VCF file is named with the format `<individual1>_<analysis>_<chromosome>.vcf.gz`.

0 comments on commit 5e0e084

Please sign in to comment.