This repository contains a collection of R code, RMarkdown documents, and data/annotations which have been used for several different Host-pathogen RNA-Seq analyses performed by our lab.
The main components of this repository are organized as follows:
data/
- Data and annotations generated in our lab, or accessed from open-access journal articles of interest.R/
- R functions for differential expression analysis, plotting, etc. which are generally useful across multiple analyses.Rmd/
- RMarkdown documents each performing some specific task, e.g. RNA-Seq count filtering.
The data/
section is further sub-divided by species for which the resources
are relevant, while the R/
and Rmd/
folders are divided by the function or
task to which they relate.
The RMarkdown documents have been designed in such a way so that each document performs a single function, or some small set of functions, and can be imported as knitr child documents into a larger manuscript-specific analysis.
Individual projects/manuscripts can then build on top of these modular .Rmd
child documents, picking-and-choosing the components which are relevant to
those analyses and extending them with experiment-specific functionality.
The code and functionality within this repository is largely geared towards two types of analysis:
- Differential Expression Analysis
- Co-expression Network Analysis
In addition to these two broad types of analysis, the repo also includes code for handling:
- Batch adjustment and normalization
- Visualization of RNA-Seq sample relationships
- RNA-Seq count filtering
- Enrichment analysis (GO, KEGG, ConsensusPathDB, LeishCyc, ENCODE)
To make use of this functionality you must:
- Create a parent Rmarkdown document
- Clone this repository to an accessible location
- Define a
CONFIG
list variable containing the analysis settings - Use
knit_child()
or thechild=
chunk option to load individual Rmarkdown child documents in the desired order. - Use Rstudio or the
render()
function in the rmarkdown package to render this document into HTML, PDF, etc.
For example, to perform a basic differential expression analysis one might
create a parent manuscript.Rmd
document which contains:
---
title: "Example Differential Expression Manuscript"
---
```{r child='shared/Rmd/init/header_de.Rmd'}
```
```{r child='analysis_settings_v1.0.Rmd'}
```
```{r child='shared/Rmd/init/load_settings_de.Rmd'}
```
Methods
=======
```{r child='shared/Rmd/init/load_counts.Rmd'}
```
```{r child='shared/Rmd/init/load_host_annotations.Rmd'}
```
```{r child='shared/Rmd/init/create_expression_set.Rmd'}
```
```{r child='shared/Rmd/main/filter_counts.Rmd'}
```
```{r child='shared/Rmd/main/gene_visualization.Rmd'}
```
```{r child='shared/Rmd/main/data_prep_de.Rmd'}
```
```{r child='shared/Rmd/main/differential_expression.Rmd'}
```
```{r child='shared/Rmd/results/go_enrichment_de.Rmd'}
```
```{r child='shared/Rmd/results/kegg_enrichment_de.Rmd'}
```
```{r child='shared/Rmd/results/save_output_de.Rmd'}
```
Version Information
===================
```{r}
sessionInfo()
```
In the above example, all but one of the child documents which get imported are contained within this repository.
The one file which is not part of this repo is the file called "analysis_settings_v1.0.Rmd".
This is where all of your experiment-specific settings and desired analysis parameters can be specified.
This includes things like:
- the name of the host and pathogen species being analyzed
- data pre-processing and normalization settings
- p-value cutoffs
- etc.
All of these settings are specified in a single global list variable named CONFIG
,
for example:
```{r}
config <- list(
# general
'pathogen' = 'l. major',
'host' = 'h. sapiens',
'target' = 'host',
'analysis_name' = 'hsapiens-inf-with-lmajor-1.0',
# sample variables of interest
'sample_id' = 'sample_id',
'condition' = 'condition',
'batch' = 'replicate',
# sample metadata
'samples' = tbl_df(read.csv('hsapiens_samples.csv'))
# input count tables
"input_count_tables" = file.path("counts/mmusculus/*.count"),
# normalization / data transformation
'de_cpm' = true,
'de_log2' = true,
'de_voom' = true,
'de_quantile_normalize' = true,
'de_batch_adjust' = 'combat',
# annotations
"organism_db" = "homo.sapiens",
"orgdb_key" = "ensembl",
"organism_genome" = "hg38",
"biomart_dataset" = "hsapiens_gene_ensembl",
# differential expression
"contrast_column" = 'condition',
"de_comparisons" = list(c('uninfected', 'infected')),
)
```
Where possible, reasonable default options have been chosen so that it is not necessary to set every single option, but this is not always possible.
For a complete list of settings, you can find the default configuration options
listed in Rmd/settings/shared.Rmd
, Rmd/settings/coex_network.Rmd
, and
Rmd/settings/differential_expression.Rmd
.
You can also refer to the manuscript specific repositories which show actual example of how this code is being used.
The Bioconductor Genomics Docker container has all the required libraries installed and ready to go.
docker pull denom/bioconductor_docker_genomics:latest
docker run -d \
-v $(pwd):/home/rstudio/data \
-e PASSWORD=bioc \
-p 8787:8787 \
denom/bioconductor_docker_genomics:latest
For questions or suggestions related to this analysis, you are welcome to other submit fixes to Github directly, or contact me at [email protected].