-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathRnotebook.Rmd
326 lines (220 loc) · 11.1 KB
/
Rnotebook.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
---
title: "R Notebook"
output:
html_document:
df_print: paged
---
This is an [R Markdown](http://rmarkdown.rstudio.com) Notebook. When you execute code within the notebook, the results appear beneath the code.
Try executing this chunk by clicking the *Run* button within the chunk or by placing your cursor inside it and pressing *Ctrl+Shift+Enter*.
```{r}
dat <- matrix(sample(0:1000,10*5),nrow = 5,ncol = 10)
colnames(dat) <- c(paste0("cond",1:5),paste0("ctrl",1:5))
rownames(dat) <- paste0("gene",1:5)
```
```{r}
condition <- factor(c(rep('cond',5),rep('ctrl',5)))
table(condition)
coldata <- data.frame(row.names = colnames(dat),condition)
library(DESeq2)
dds <- DESeqDataSetFromMatrix(dat,DataFrame(coldata), design = ~condition)
dds = DESeq(dds)
res <- results(dds)
res = res[order(res$padj),]
summary(res)
write.csv(res,file = "res.csv")
```
Add a new chunk by clicking the *Insert Chunk* button on the toolbar or by pressing *Ctrl+Alt+I*.
When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the *Preview* button or press *Ctrl+Shift+K* to preview the HTML file).
The preview shows you a rendered HTML copy of the contents of the editor. Consequently, unlike *Knit*, *Preview* does not run any R code chunks. Instead, the output of the chunk when it was last run in the editor is displayed.
---
title: "Assignment 1: Data Set Selection and Initial Processing"
output: html_document
---
Note: many code chunks were adapted from Prof. Isserlin's BCB420 lectures (especially low-count filtering, plotting, group definition, etc) and Dr. Boris Steipe's online BCH441 modules (particularly package setup and data download).
## Download the Data Set
I chose GEO dataset [GSE224681](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE224681) for this assignment. First, install the required packages to download dataset of GEO.
```{r prepares, echo=FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
if (!requireNamespace("GEOquery", quietly = TRUE)) {
BiocManager::install("GEOquery")
}
```
Now download the supplementary file (only if it has not been downloaded already):
```{r}
datafile <- "GSE224681_counts.tsv.gz"
if (!file.exists(datafile)) {
files <- GEOquery::getGEOSuppFiles("GSE224681", makeDirectory = FALSE)
}
data <- read.delim(datafile,header=TRUE,check.names=FALSE,row.names = 1)
# data <- read.table(datafile, header = T, row.names = 1, sep= "\t")
dim(data)
```
```{r}
data[1:10,1:4]
```
We see that we have reads of 63925 genes across 6 columns. Each column corresponds to 1 sample. Thus we have 6 samples: size of replicate groups is 3, with and without treatment by cisplatin.
## Initial Assessment
First we renamed the samples:
```{r}
new_data <- data
# rename data colnames
colnames(new_data) <- c(paste0("noCisplatin.", 1:3), paste0("cisplatin.", 1:3))
new_data[1:10,]
```
```{r}
rownames(new_data)[1:10]
```
Get a sense of our data distribution and quality by plotting the data for all samples with and without treatment by cisplatin, as well as calculate some summary statistics:
```{r}
dataToGraph <- new_data
# add a pseudocount so log2 evaluates
boxplot(log2(dataToGraph + 0.0001), las = 2, ylab="log2 counts", main = "Expression distribution, pre-filtering", cex.axis=0.7)
```
```{r}
summary(log2(dataToGraph + 0.0001))
```
```{r}
length(unique(rownames(new_data))) == nrow(new_data)
```
The last line of code returns TRUE, which demonstrates that all IDs listed as rownames in the given data are unique. This means that we do not have duplicate expression values for genes in the data.
Let's also define the 4 groups of interest within our data:
```{r}
# 2 groups
# 1. noCisplatin
# 2. cisplatin
samples <- data.frame(Treatment = c(rep("noCisplatin",3), rep("cisplatin",3)))
rownames(samples) <- colnames(new_data)
samples
```
## Cleaning the Data
We use edgeR to remove uninformative and/or lowly expressed genes by filtering out all reads whose counts are too low.
```{r, echo= FALSE}
if (! requireNamespace("edgeR", quietly = TRUE)) {
BiocManager::install("edgeR")
}
if (! requireNamespace("limma", quietly = TRUE)) {
BiocManager::install("limma")
}
```
For our experiment, the size of the smallest group of replicates n = 3. Use edgeR::cpm to conver to CPM, then filter out any rows with CPM \< 3:
```{r}
cpm_data <- edgeR::cpm(new_data)
rownames(cpm_data) <- rownames(new_data)
filtered_data <- new_data[rowSums(cpm_data >= 1) >= 3, ]
dim(filtered_data)
```
We were able to filter out...
```{r}
length(data[,1]) - length(filtered_data[,1])
```
...50690 uninformative features.
## Data Normalization
Now that we've filtered out low counts, let's see what the distributions for all our samples look like:
```{r}
boxplot(log2(edgeR::cpm(filtered_data) + 0.0001), ylab = "log2 CPM", main = "Expression distribution, pre-normalization", cex = 0.5, cex.axis=0.7, las=2)
```
The boxplot shows 6 sets of outliers (all samples). These are 0 counts that I recovered by adding the pseudocount of 0.0001 to my data so log2 doesn't throw an error warning. Let's see how many of these outliers are there:
```{r}
idx <- 1:6
zeroes <- vector(length=6)
# Count number of 0s in each column.
# aka determine number of 0-counts for all samples.
for (i in idx) {
zero_idx <- which(filtered_data[, i] == 0)
zeroes[i] <- length(filtered_data[zero_idx,i])
}
zeroes
sum(zeroes)
```
Each element of this vector is the number of zero counts for each samples. We see that they total to 137 outliers. Setting the limits to exclude these outliers:
```{r}
boxplot(log2(edgeR::cpm(filtered_data) + 0.0001), ylab = "log2 CPM", main = "Expression distribution, pre-normalization", cex = 0.5, cex.axis=0.7, las=2, ylim=c(-5,15))
```
Normalize the data with edgeR TMM:
```{r}
filtered_matrix <- as.matrix(filtered_data)
# Normalize with respect to treatment type
d <- edgeR::DGEList(counts=filtered_matrix, group=samples$Treatment)
d <- edgeR::calcNormFactors(d)
norm_data <- edgeR::cpm(d)
boxplot(log2(norm_data + 0.0001), ylab = "log2 CPM", main= "Expression distribution, post-normalization", cex = 0.5, cex.axis=0.7, las=2, ylim=c(-5,15))
```
Use an MDS plot to represent how well the normalized samples are distinguished. The first plot is before normalization, and the second plot is after normalization.
```{r}
limma::plotMDS(filtered_data, labels=rownames(samples), col=c("red","blue")[factor(samples$Treatment)], main="Sample feature clustering, pre-normalization")
```
```{r}
limma::plotMDS(d, labels=rownames(samples), col=c("red","blue")[factor(samples$Treatment)], main="Sample feature clustering, post-normalization")
```
We see much clearer separation between different groups (cisplatin vs noCisplatin), as well as tighter clustering between replicates of the same treatment group. This is a good outcome and means our normalization was effective.
## Mapping to HUGO Symbols
Supplementary file reveals that the rownames in the data are Ensemble Gene IDs. We must map from Ensemble IDs to HUGO symbols. For this we need the org.Hs.eg.db package and its functions.
```{r,message=FALSE}
if (!requireNamespace("clusterProfiler", quietly = TRUE))
BiocManager::install("clusterProfiler")
if (!requireNamespace("org.Hs.eg.db", quietly = TRUE))
BiocManager::install("org.Hs.eg.db")
if (!requireNamespace("stringr", quietly = TRUE)) {
install.packages("stringr")
}
library(org.Hs.eg.db)
```
```{r}
# Remove the version number of Ensemble Gene ID
gene.ens.id <- rownames(filtered_data)
gene.ens.id <- stringr::str_replace(gene.ens.id, "\\..*","")
rownames(filtered_data) <- gene.ens.id
# Map from Entrez Gene IDs to HUGO symbols
gene.symbols <- clusterProfiler::bitr(geneID = gene.ens.id,
fromType = "ENSEMBL",
toType = "SYMBOL",
OrgDb = org.Hs.eg.db)
gene.symbols = gene.symbols[!duplicated(gene.symbols$ENSEMBL),]
gene.symbols[1:10,]
```
```{r}
# Remove all unmapped rows
mapped_indices <- rownames(filtered_data) %in% gene.symbols$ENSEMBL
sum(!mapped_indices)
mapped_data <- filtered_data[mapped_indices, ]
# Duplicated HUGO symbols
table(duplicated(gene.symbols$SYMBOL))
# Merge the data based on the same HUGO symbols
# Set HUGO symbols as rownames of new dataframe
mapped_symbols <- gene.symbols$SYMBOL[match(rownames(mapped_data), gene.symbols$ENSEMBL)]
mapped_data <- aggregate(mapped_data, by=list(mapped_symbols), FUN=sum)
mapped_data <- tibble::column_to_rownames(mapped_data,'Group.1')
# Are all HUGO symbols remaining unique? This must return TRUE
nrow(mapped_data) == length(unique(mapped_symbols))
```
```{r}
# Final gene coverage of dataset
dim(mapped_data)
```
```{r}
# Number of rows deleted during mapping?
nrow(filtered_data) - nrow(mapped_data)
```
mapped_data is the final output dataset of this workflow.
## Interpretation
**What are the control and test conditions of the dataset?**
The experiment consists of treating tumor associated macrophages(TAMs) with cisplatin and comparing the induced expression changes with a control group . In this case, the experimental group would be TAMs that are treated with cisplatin, and the control group are TAMs that received no treatment.
**Why is the dataset of interest to me?**
Tumor resistance to chemotherapy and metastatic relapse account for more than 90% of cancer specific mortality. Tumor associated macrophages (TAMs) can process chemotherapeutic agents and impair their action. Little is known about the direct effects of chemotherapy on TAMs.The study explored the role of cisplatin in TAMs. The dataset is from a project studying genes associated with chemotherapy resistance. I thought it would be interesting to explore the mechanism of chemotherapy resistance in TAMs.
**Were there expression values that were not unique for specific genes? How did you handle them?**
Yes. A quick test demonstrated that there are 4 repeated identifiers before the mapping to HUGO symbols. I merged the rows based on the same HUGO symbols.
**Were there expression values that could not be mapped to current HUGO symbols?**
Yes, there were. 1034 expression values were unable to be mapped to HUGO symbols. I removed the unmapped rows.
**How many outliers were removed?**
I noticed 6 outliers in my data. I didn't remove any because I felt that their presence was too consistent across one group to simply be measurement error. However, I did reframe the boxplots to exclude these leftover low counts to better see the rest of the data.
**How did you handle replicates?**
I split my data into 2 groups of 3. One group for each combination of time and treatment, and each group contains 3 replicates. I kept each replicate separate, but made sure that replicates of two different groups are clearly differentiated in their names. Finally, I used n = 3 as the low-count threshold for filtering uninformative features, as per edgeR recommendation.
**What is the final coverage of your dataset?**
The final coverage of my dataset is 12197 features across 6 samples.
Summarizing dimensions of data across entire workflow:
```{r}
summary <- list(initial=dim(data), merged_lanes=dim(new_data), removed_low_counts=dim(filtered_data), mapped=dim(mapped_data))
summary
```