RUN pipeline

I prepared 2 pipelines: bash command and Jupyter notebook. The bash command can run directly on a linux workstation with over 80GB of free memory. The Jupyter notebook pipeline is for clusters using PBS to submit jobs(qsub). See the details below.

Run bash pipeline for MeCP2 iCLIP data

Prerequisites

Enviroments

Python3 (Including package Numpy, Pandas, Scipy, pysam, pyBigWig)
cutadapt
fastqc
STAR
bedtools
CLAM (MUST >=v1.2.1)
wigToBigWig
Perl
Homer (including genome files for mm10 and hg19)
samtools

Hardware

Free RAM > 80GB (required)
CPU with more than 20 threads (recommanded)

Files

Human/mouse genome annotation (hg19.gtf/mm10.gtf)
Human/mouse genome sequence (hg19.fasta/mm10.gtf)
Human/mouse genome size (chr_size.txt)
Human/mouse genome index for STAR (generated by STAR using genome sequences & annotations)
rRNA/tRNA annotations for hg19 and mm10

File organization

Three directories should be provide:

script_folder: the full path of subdirectory 'scripts' in this project (i.e.: ~/workspace/MeCP2_iCLIP/scripts)
data_folder: the full path you wish to store all generated data files. This folder should contains 1 subfolder named raw_fq and put unzipped fq files in it (Undetermined_S0_L003_R1_001.fastq)
genome_folder: the full path of your genome files, this folder should organized like (i.e.: ~/workspace/genome):
genome/
├── raw
│   ├── hg19
│   │   ├── chr_size.txt
│   │   ├── hg19.fasta
│   │   ├── hg19.gtf
│   │   └── rRNA_tRNA.bed
│   └── mm10
│   ├── chr_size.txt
│   ├── mm10.fasta
│   ├── mm10.gtf
│   ├── mm10.trim.fasta
│   └── rRNA_tRNA.bed
└── star
├── hg19
│   └── ... files of STAR index for hg19
└── mm10
└── ... files of STAR index for mm10

Run Jupyter notebook pipeline for MeCP2 iCLIP data

Prepare everything as bash pipeline, remember to run next cell only after all jobs submitted by the previous cell completed, this simple script DOES NOT check job status.

Pipeline description

Overview

Step by Step

Demultiplex. NO mismatch allowed in barcode sequence.
Adapter triming.
STAR mapping.
rRNA/tRNA masking. Avoid contamination from rRNAs and tRNAs, generally, this step filtered out ~50% of mapped reads.
PCR duplication removal. Reads mapped to the same locations and carries same UMI were regard as PCR duplication. ~80% mapped reads removed.
CLAM preprocess. This step separates multi-mapped reads and uniquely mapped reads.
CLAM realign. This step realigned multi-mapped reads.
Peakcalling. As no control group available, use permutation peakcalling instead of negative biomial based peak calling for a better results. Tested peakcalling with bin-width of 100, 150 and 200. Bin-width=200 gave us better results (more consistent between different groups).
Motif detection.

Results

Mapping statistic

Motif detection

For all 6 groups, we can see motifs similar to C[ACU]GUG. I'll suggest this motif as MeCP2 binding motif.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
results		results
scripts		scripts
.gitignore		.gitignore
README.md		README.md
pipeline.sh		pipeline.sh
pipline.ipynb		pipline.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RUN pipeline

Run bash pipeline for MeCP2 iCLIP data

Prerequisites

Enviroments

Hardware

Files

File organization

Run Jupyter notebook pipeline for MeCP2 iCLIP data

Pipeline description

Overview

Step by Step

Results

Mapping statistic

Motif detection

About

Releases

Packages

Languages

wkdeng/MeCP2_iCLIP

Folders and files

Latest commit

History

Repository files navigation

RUN pipeline

Run bash pipeline for MeCP2 iCLIP data

Prerequisites

Enviroments

Hardware

Files

File organization

Run Jupyter notebook pipeline for MeCP2 iCLIP data

Pipeline description

Overview

Step by Step

Results

Mapping statistic

Motif detection

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages