I prepared 2 pipelines: bash command and Jupyter notebook. The bash command can run directly on a linux workstation with over 80GB of free memory. The Jupyter notebook pipeline is for clusters using PBS to submit jobs(qsub). See the details below.
- Python3 (Including package Numpy, Pandas, Scipy, pysam, pyBigWig)
- cutadapt
- fastqc
- STAR
- bedtools
- CLAM (MUST >=v1.2.1)
- wigToBigWig
- Perl
- Homer (including genome files for mm10 and hg19)
- samtools
- Free RAM > 80GB (required)
- CPU with more than 20 threads (recommanded)
- Human/mouse genome annotation (hg19.gtf/mm10.gtf)
- Human/mouse genome sequence (hg19.fasta/mm10.gtf)
- Human/mouse genome size (chr_size.txt)
- Human/mouse genome index for STAR (generated by STAR using genome sequences & annotations)
- rRNA/tRNA annotations for hg19 and mm10
Three directories should be provide:
- script_folder: the full path of subdirectory 'scripts' in this project (i.e.: ~/workspace/MeCP2_iCLIP/scripts)
- data_folder: the full path you wish to store all generated data files. This folder should contains 1 subfolder named raw_fq and put unzipped fq files in it (Undetermined_S0_L003_R1_001.fastq)
- genome_folder: the full path of your genome files, this folder should organized like (i.e.: ~/workspace/genome):
genome/
├── raw
│ ├── hg19
│ │ ├── chr_size.txt
│ │ ├── hg19.fasta
│ │ ├── hg19.gtf
│ │ └── rRNA_tRNA.bed
│ └── mm10
│ ├── chr_size.txt
│ ├── mm10.fasta
│ ├── mm10.gtf
│ ├── mm10.trim.fasta
│ └── rRNA_tRNA.bed
└── star
├── hg19
│ └── ... files of STAR index for hg19
└── mm10
└── ... files of STAR index for mm10
Prepare everything as bash pipeline, remember to run next cell only after all jobs submitted by the previous cell completed, this simple script DOES NOT check job status.
- Demultiplex. NO mismatch allowed in barcode sequence.
- Adapter triming.
- STAR mapping.
- rRNA/tRNA masking. Avoid contamination from rRNAs and tRNAs, generally, this step filtered out ~50% of mapped reads.
- PCR duplication removal. Reads mapped to the same locations and carries same UMI were regard as PCR duplication. ~80% mapped reads removed.
- CLAM preprocess. This step separates multi-mapped reads and uniquely mapped reads.
- CLAM realign. This step realigned multi-mapped reads.
- Peakcalling. As no control group available, use permutation peakcalling instead of negative biomial based peak calling for a better results. Tested peakcalling with bin-width of 100, 150 and 200. Bin-width=200 gave us better results (more consistent between different groups).
- Motif detection.
For all 6 groups, we can see motifs similar to C[ACU]GUG. I'll suggest this motif as MeCP2 binding motif.