snakemake-qiaseq-mirna

snakemake-qiaseq-mirna is a Snakemake pipeline for the quantification of known miRNAs and piRNAs from sequencing data that is obtained from sequencing libraries prepared with Qiagen's QIAseq miRNA Library Kit.

Authors

Jouni Kujala, University of Eastern Finland, Institute of Clinical Medicine

Usage

Step 1: Install dependencies

This pipeline has two dependencies that must be installed to your system before running this pipeline. These dependencies are

We recommend you to install Conda package management system and then install Snakemake with it. After the succesfull installation of Conda, you can create virtual environment and run Snakemake within them. Use of virtual environments is a recommended way to run pipeline because it isolates pipeline from your operating system and helps to avoid problems with conflicting packages. This can be achieved by running following commands in terminal;

# Install Mamba package manager.
conda install -c conda-forge mamba

# Create virtual environment with Snakemake.
mamba create -c conda-forge -c bioconda -n [name of the environment] snakemake

# Activate virtual environment.
conda activate [name of the environment]

Pipeline utilizes a number of bioinformatic tools that are installed by Snakemake itself. If interested, you can find complete list of used tools from configuration file workflow/envs/env.yaml.

Step 2: Obtain a copy of this pipeline

First step is obviously to obtain a copy of this pipeline to your system. This can be achieved in a number of ways;

Clone this repository by running following commands in your terminal;

# Define directory to which the repository should be cloned.
cd [path to directory]  

# Clone repository
git clone https://github.com/jounikuj/snakemake-qiaseq-mirna.git

Download this repository as zipped file, place it to desired location and unzip it.

Step 3: Set up configuration

Although this pipeline is designed to be as portable as possible, there are some things that must be provided by the user before running this pipeline.

Place your input files to fastq/ directory. Input files are expected to be in .fastq.gz format.
Modify file samples.tsv according to your needs. This file specifies which .fastq.gz file belongs to which sample. File should be provided in following format;
```
> head -n 5 config/samples.tsv

sample  fastq
sampleA fastq/sampleA_S1_L001_R1_001.fastq.gz
sampleA fastq/sampleA_S1_L002_R1_001.fastq.gz
sampleB fastq/sampleA_S2_L001_R1_001.fastq.gz
sampleB fastq/sampleA_S2_L002_R1_001.fastq.gz
...
```
Please note that pipeline renames files according to sample list. For example, .fastq.gz files sampleA_S1_L001_R1_001.fastq.gz and sampleA_S1_L002_R1_001.fastq.gz will produce filenames like sampleA.bam because indicated .fastq.gz files are associated with sampleA in the sample list.
Check file config.yaml. This file contains various parameters such as used adapter sequences and source URLs for used miRNA and piRNA sequences. You are free to try different settings for your data but it is always a good option to start with the defaults.

Step 4: Execute pipeline

After completing the configuration you are ready to proceed to pipeline execution.

Test your configuration by running Snakemake in dryrun mode by executing a following command in your terminal;
```
snakemake --dryrun
```
If no error messages are shown in --dryrun mode, you are ready to proceed into pipeline execution. Execute pipeline by running following command in your terminal;
```
snakemake --use-conda --cores [available cores]
```
Please note that --use-conda parameter is required to run pipeline as it allows Snakemake to install all required dependencies. Snakemake should now print status messages to console and start running. You can follow the execution of pipeline from these messages.

Pipeline description

This section aims to describe the workflow of this pipeline.

Step 1: Read trimming and UMI detection

Raw reads contain a number of different sequences that must be removed before the actual analysis. For this purpose we will use cutadapt and umi_tools. Due to structure of reads prepared with the QIAseq miRNA Library Kit the read trimming and UMI detection is done in sequential order.

First we remove the possible remnants of Illumina Universal Adapter sequence AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC from the 3' end of the read. At this step, reads shorter than 20 bp are discarded from further analysis.
A unique molecular identifier (UMI) of 12 random basepairs is removed from from the 3' end of the read and added to read header. Each RNA molecule is tagged with an unique UMI before sequencing library is amplified with a PCR. Each PCR product originating from the same RNA molecule is expected to have same UMI and this information can be later used to reduce the sequencing errors and quantitative bias caused by PCR amplification.
Adapter sequence GTTCAGAGTTCTACAGTCCGACGATC is trimmed from the 5' end of the reads.
Adapter sequence AACTGTAGGCACCATCAAT is trimmed from the 3' end of the read. By default, pipeline performs some quality-trimming for reads and discards reads shorter than 20 bp.

Step 2: Read alignment

Pipeline downloads FASTA sequences of known miRNAs and piRNAs from miRBase and piRBase. Pipeline uses bowtie2 with --very-sensitive-local parameter to map reads to FASTA sequences. Alignment is done in sequential order:

Reads are first mapped to known miRNA sequences. Unaligned reads are collected to new .fastq.gz file and forwarded to next step.
Unaligned reads are mapped to known piRNA sequences. Again, unaligned reads are collected to new .fastq.gz file sampleA_unaligned.fastq.gz etc.

After each step, aligned reads are converted to BAM files to recude the pipeline's memory usage.

Step 3: Deduplication

Aligned reads are merged to one .bam file to reduce the number of processed files. We use umi_tools to deduplicate raw reads on the basis of their UMI sequences stored in the read header. Reads with the same UMI are collapsed into one read to reflect the state of the sample before library amplification.

Step 4: Read counting

Aligned reads are calculated with samtools idxstats and merged into one file. Read counts are calculated both for the raw reads and deduplicated reads - you can freely choose which one you use in your downstream analyses. Resulting count table should look like this;

> head -n 5 results/counts.tsv

Molecule       sampleA sampleB sampleC
hsa-let-7a-1   0       0       4 
hsa-let-7a-2   0       4       0
hsa-let-7a-3   4       4       0
hsa-let-7b     12      4       4
...

Please note that read counts are not normalized.

Citing this pipeline

At the moment, this pipeline has not been published anywhere. If you use this pipeline in your research, just cite the URL of this repository until we can provide a better option.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
.test		.test
fastq		fastq
workflow		workflow
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
Snakefile		Snakefile
config.yaml		config.yaml
samples.tsv		samples.tsv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

snakemake-qiaseq-mirna

Authors

Usage

Step 1: Install dependencies

Step 2: Obtain a copy of this pipeline

Step 3: Set up configuration

Step 4: Execute pipeline

Pipeline description

Step 1: Read trimming and UMI detection

Step 2: Read alignment

Step 3: Deduplication

Step 4: Read counting

Citing this pipeline

About

Releases

Packages

Languages

License

mywanuo/snakemake-qiaseq-mirna

Folders and files

Latest commit

History

Repository files navigation

snakemake-qiaseq-mirna

Authors

Usage

Step 1: Install dependencies

Step 2: Obtain a copy of this pipeline

Step 3: Set up configuration

Step 4: Execute pipeline

Pipeline description

Step 1: Read trimming and UMI detection

Step 2: Read alignment

Step 3: Deduplication

Step 4: Read counting

Citing this pipeline

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages