APAlyzer-QSr is a automatic toolkit performing APA analysis using Quant-Seq REV data. In general it uses PolyA DB3 to clean the PASs indetified from the sequencing data.
- Ruijia Wang
- Bin Tian
If you plan to use the APAlyzer-QSr for-profit, you will need to purchase a license. Please contact [email protected] for more information.
Python 3 and R
Genomes that covered by PolyA DB3 including mm9, hg19 and rn5.
numpy pandas pylab scipy HTSeq matplotlib
All these can be install through
pip install packagename
GenomicRanges Biostrings dplyr GenomicAlignments
All these can be install through
install.packages(packagename)
or
BiocManager::install(packagename)
Although the PASs indetification and clean are using bam files as input, this toolkit also provide scripts to handel the raw fastq files from Quan-Seq REV. The following packages are used for QC, trimming and mapping:
https://www.bioinformatics.babraham.ac.uk/projects/fastqc
https://sourceforge.net/projects/bbmap/
https://github.com/alexdobin/STAR
https://lomereiter.github.io/sambamba
All the reference files used for fastq trimming and PAS clean are stored in REF/ folder
The toolkit contains a shell scripts can simply run all steps in one shot aftering setting the path. You can first put all the fastq files under rootdir/project/rawfastq/, then run:
./raw.pip.example.sh
And a expample of path setting in the shell:
THREADS=24
scrdir=/xxx/APAlyzer_qrev/scripts/
project=project1
rootdir=/xxx/my_rootdir/
geno=mm9
genodir=/xxx/genome/mm9_star/
refdir=/xxx/REF/
8)Define the analysis design; 'YES' or 'NO'; setting whether the analysis design is single sample or multiple replicates
Reps='YES'
samplefile=/xxx/Samples/sample_list.txt # Only need when Reps='YES'
treats="COM1_NT COM2_NT"
controls="COM1_TRT COM2_TRT"
Using the setting above, we can analysis a mouse mm9 dataset, and compare the APA in "COM1_TRT" vs "COM1_NT", and "COM2_TRT" vs "COM2_NT".
Sample file are only need when Reps='YES'. This file is a two-column table contain the information of the sample and group. For instance, in the above case, the sample file should looks like:
Sample | Group |
---|---|
NT1_rep1 | COM1_NT |
NT1_rep2 | COM1_NT |
TRT1_rep1 | COM1_TRT |
TRT1_rep2 | COM1_TRT |
NT2_rep1 | COM2_NT |
NT2_rep2 | COM2_NT |
TRT2_rep1 | COM2_TRT |
TRT2_rep2 | COM2_TRT |
File | Folder | Note |
---|---|---|
QC file | rootdir/project/qccheck/ | fastq QC results |
*.clipped.fastq | rootdir/project/fastq/ | trimmed fq file |
*.sorted.bam | rootdir/project/rawsam/ | mapping bam file |
mapping.summary.txt | rootdir/project/tbl/ | mapping summary |
*.pA2gene_usage.DRPM.fix.tbl | rootdir/project/tbl/ | PAS expression profile |
3mostAPA.*.DRPM.fix.tbl | rootdir/project/tbl/3UTR/ | 3'UTR APA results |
UPS.*.cut2.tbl | rootdir/project/tbl/UPS/ | Upstream APA results |
*.png | rootdir/project/plot/ | Scatter plots of 3'UTR and UPS APA |
PAS.nuc_freq.diff_vs_NC.xlsx | rootdir/project/tbl/ | nucleotide frequency (+/- 150 nt from PAS) of the sites that are changing vs same number of sites that do not change |
*.motif_enrichment.diff_vs_NC.xlsx | rootdir/project/plot/ | motif enrichment (4-mers and 6-mers) within +/- 150 nts of the PAS of the sites that are changing vs sites that do not change |
Column | Description |
---|---|
pAid | ID of each PAS, shown as chromosome:Position:Strand |
chromosome | chromosome ID of PAS |
pA_pos | genomic position of PAS |
strand | strand information of PAS |
GENEID | Entrez Gene ID |
gene_symbol | gene symbol |
gene_desc | gene name description |
gene_Biotype | type of the gene, protein-coding or various types of ncRNAs |
LOCATION | Specific PAS annotation for ncRNAs, including 5'exon, 3'exon, singel(S) exon, other exon and intron |
region | PAS location in annotated genes, including 5'UTR, CDS, intron, 3'UTR and various types of ncRNAs. |
pAtype_1 | PAS location in annotated genes, including 5'UTR, CDS, intron, 3'UTR and various types of ncRNAs. For PASs in 3'UTRs, they are further divided into First (F), Middle (M), and Last (L). If there is only one PAS in 3'UTR, it is called S. |
ext | Whether PAS is located on an extended 3' end region beyond RefSeq/Ensembl annotations, the extrension region is annotated by polyA_DB3, Yes/No |
num_* | reads count columns of each sample |
DRPM_* | normalized expression columns count columns of each sample |
Column | Description |
---|---|
gene_symbol | gene symbol |
chromosome | chromosome ID of PAS |
strand | strand information of PAS |
pA_pos_pA1 | genomic position of proximal PAS |
pA_pos_pA2 | genomic position of distal PAS |
pvalue | p-value of alternative polyadenylation |
num_* | reads count columns of each sample |
DRPM_* | normalized expression columns count columns of each sample |
RE_* | Relative expression of each gene in each sample, RE=distal/proximal |
Log2Ratio_pA1 | log2FC of proximal PAS between two groups |
Log2Ratio_pA2 | log2FC of distal PAS between two groups |
Delta_RA | Delta Relative abundance |
RED | Delta Relative expression, RED=Log2Ratio_pA2 - Log2Ratio_pA1 |
pA1.pAutype | alternative polyadenylation pattern, 'UP' for shortening, 'DN' for lengthening, 'NC' for no change |
4. PAS Region definitions in 'PAS.nuc_freq.diff_vs_NC.xlsx' and '*.motif_enrichment.diff_vs_NC.xlsx':
Region | Range | PAS |
---|---|---|
prx_region_1 | -150 ~ -51 | proximal PAS |
prx_region_2 | -50 ~ -1 | proximal PAS |
prx_region_3 | +1 ~ +50 | proximal PAS |
prx_region_4 | +51 ~ +150 | proximal PAS |
dis_region_1 | -150 ~ -51 | distal PAS |
dis_region_2 | -50 ~ -1 | distal PAS |
dis_region_3 | +1 ~ +50 | distal PAS |
dis_region_4 | +51 ~ +150 | distal PAS |
python scripts/step1_2_3.QC_and_mapping.qcREV.py \
--rootdir ROOTPATH \
--project PRJNAME \
--genodir GENOMEPATH \
--refdir REFPATH \
--threads NUM
python scripts/step4.STAR_log_summarizer.qcREV.py --rootdir ROOTPATH --project PRJNAME
python scripts/step5.LAP.hunter.qcREV.py --rootdir ROOTPATH --project PRJNAME
Rscript scripts/step6_1.LAP2PAS.R $rootdir $project $refdir $geno
Rscript scripts/step6_2.Quant_R.pas2gene.builder.R $rootdir $project $refdir $geno
python scripts/step7_1.qcREV.3UTR.APA.tbler.norep.py --project $project --rootdir $rootdir --genome $geno --control $controls --treatment $treats
python scripts/step7_2.qcREV.UR.APA.tbler.norep.py --project $project --rootdir $rootdir --genome $geno --control $controls --treatment $treats
Rscript scripts/step7_1.qcREV.3UTR.APA.tbler.reps.R $rootdir $project $samplefile $geno "$treats" "$controls"
Rscript scripts/step7_2.qcREV.UR.APA.tbler.reps.R $rootdir $project $samplefile $geno "$treats" "$controls"
python scripts/step8_1.qcREV.3UTR.scatter.plotter.py --project $project --rootdir $rootdir --control $controls --treatment $treats
python scripts/step8_2.qcREV.UPS.scatter.plotter.py --project $project --rootdir $rootdir --control $controls --treatment $treats
Rscript $scrdir/step9.ACGT_FREQ.R $rootdir/$project/tbl/ $geno
Rscript $scrdir/step10.motif_enrichment.R $rootdir/$project/tbl/ $geno