The Statistical Utility for RBP Functions (SURF) is an integrative analysis framework to identify alternative splicing (AS), alternative transcription initiation (ATI), and alternative polyadenylation (APA) events regulated by individual RBPs and elucidate protein-RNA interactions governing these events. We used SURF to analyzed 104 RBP data (K562 cells, available from ENCODE).
A detailed vignette is available here.
You can install the development version of surf
from
GitHub with:
# install.packages("devtools")
devtools::install_github("fchen365/surf")
SURF is versatile in handling ATR event-centric analysis. Provided the data, here are four different things you could do with SURF.
Data | Format | Task | |
---|---|---|---|
1 | genome annotation | any (gtf, gff, …) | parse ATR events |
2 | + RNA-seq | alignment (bam) | detect differential ATR events |
3 | + CLIP-seq | alignment (bam) | detect functional association |
4 | + external RNA-seq | summarized table | differential transcriptional activity |
The four tasks of SURF pipeline should be streamlined. Once you have the data in hand (see the following sub-section), each step can be performed with a single function:
library(surf)
event <- parseEvent(anno_file) # task 1
drr <- drseq(event, rna_seq_sample) # task 2
far <- faseq(drr, clip_seq_sample) # task 3
dar <- daseq(far, getRankings(exprMat), ext_sample) # task 4
Here, anno_file
, rna_seq_sample
, clip_seq_sample
, and ext_sample
are data description, and exprMat
is a table of extra transcriptome
quantification (e.g., TCGA, GTEx, …).
Describing your data should be easy. Simply follow the example below.
For task 1, a file directory will do.
anno_file <- "gencode.v24.annotation.filtered.gtf"
For task 2, surf
needs to know where the alignment files (bam
) are
and the experimental condition
for differential analysis (e.g., RBP
“knock-down” and “wild-type” control).
rna_seq_sample <- data.frame(
row.names = c('sample1', 'sample2', 'sample3', 'sample4'),
bam = paste0("rna-seq/bam/sample", 1:4, ".bam"),
condition = c('knock-down', 'knock-down', 'wild-type', 'wild-type'),
stringsAsFactors = F
)
Similarly for task 3, surf
needs to know where the alignment files
(bam
) are and the experimental condition
(e.g., “IP” and the input
control “SMI”).
rna_seq_sample <- data.frame(
row.names = c('sample5', 'sample6', 'sample7'),
bam = paste0('clip-seq/bam/', 5:7, '.bam'),
condition = c('IP', 'IP', 'SMI'),
stringsAsFactors = F
)
Finally, for task 4, surf
assumes that you have transcriptome
quantification summarized in a table exprMat
, whose rows correspond to
genomic features (e.g., genes, transcripts, …) and columns correspond to
samples. You can use any your favorite measure (e.g. TPM, RPKM, …).
Then, let surf
know of the sample group (condition
):
ext_sample <- data.frame(
row.names = colnames(exprMat),
condition = rep(c('TCGA', 'GTEx'), c(173, 337))
)
Chen, F., Keleş, S. SURF: integrative analysis of a compendium of RNA-seq and CLIP-seq datasets highlights complex governing of alternative transcriptional regulation by RNA-binding proteins. Genome Biol 21, 139 (2020). doi:10.1186/s13059-020-02039-7