-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathA2_Jun_Hyuk_Park.Rmd
440 lines (355 loc) · 17.6 KB
/
A2_Jun_Hyuk_Park.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
---
author: "Jun Hyuk Park"
date: "2023-03-14"
output:
html_document:
toc: yes
toc_depth: 3
always_allow_html: true
title: 'A2_Jun_Hyuk_Park: Differential Gene expression and Preliminary ORA'
---
# Introduction
In assignment 1, I have cleaned the dataset and applied normalization on the bulk RNAseq dataset of juvenile myelomonocytic leukemia. I downloaded this dataset from GEO with id GSE198919. After filteration and normalization, 118 samples and 18608 genes will be analysed.
First, download require packages for this assignment.
```{r Download required packages}
# In this assignmented, I needed knitr, edgeR, ComplexHeatmap,
# circlize, Biobase, grid, gprofiler2 and magick.
required_packages <- c("knitr", "edgeR", "ComplexHeatmap",
"circlize", "Biobase", "grid", "gprofiler2",
"magick")
for(package in required_packages) {
if(!requireNamespace(package, quietly=TRUE)) {
install.pacakges(package)
}
}
```
# Differential Gene Expression
Read the normalized expression data from A1.
```{r read normalized expression data from A1}
# Normalized expression data from A1 is saved as
# Assignments/Assignments2/normalized_a1_expression_data.tsv"
normalized_expression_data <- read.table(file=file.path(getwd(),
"Assignments", "Assignment2",
"normalized_a1_expression_data.tsv"),
header=TRUE, sep="\t", stringsAsFactors=FALSE,
check.names=FALSE)
```
See the overview of the normalized expression data.
```{r Overview of normalized expression data, fig.align="center"}
# Use knitr::kable to show normalized expression data
knitr::kable(normalized_expression_data[1:10, 1:5],
caption="Table 1. Overview of normalized expression data",
format="html")
```
Create a MDS plot of normalized expression data, differentiating between JMML samples(red) and control samples(black).
```{r Plot with colors in MDS labeled as samples, fig.cap="Figure 1. MDS plot of normalzied expression data. Clustered JMML and control samples", fig.align="center"}
# Depending on the types of the samples, assign the colour.
# If it is JMML, assign red. If it is control, assign black.
pat_colors <- c()
for(i in 1:ncol(normalized_expression_data)) {
if(grepl("LMMJ", colnames(normalized_expression_data)[i])) {
pat_colors[i] <- "red"
} else {
pat_colors[i] <- "black"
}
}
# MDS plot.
limma::plotMDS(normalized_expression_data,
col=pat_colors, cex=0.7)
```
```{r Plot with colors in MDS labeled as single letter, fig.cap="Figure 2. MDS plot of normalization data with single letter. Better view of clustered JMML and control samples.", fig.align='center'}
# If sample is JMML, assign L, if it is control, assing C.
pat_labels <- c()
for(i in 1:ncol(normalized_expression_data)) {
if(grepl("LMMJ", colnames(normalized_expression_data)[i])) {
pat_labels[i] <- "L"
} else {
pat_labels[i] <- "C"
}
}
limma::plotMDS(normalized_expression_data, col=pat_colors, labels=pat_labels)
```
Define the samples groups. In French, JMML is leucémie myélomonocytaire juvénile(LMMJ). Therefore, the samples that are labeled as LMMJ are JMML samples.
```{r Samples labeling, fig.align='center'}
# Create a data frame of groups of each sample.
samples <- data.frame(
lapply(colnames(normalized_expression_data), FUN=function(x) {
temp <- unlist(strsplit(x, split="\\."))
if(grepl("LMMJ", x)) {
return(c(paste(tail(temp, n=2)[1], tail(temp, n=2)[2], sep=" "), "JMML"))
} else {
return(c(paste(tail(temp, n=2)[1], tail(temp, n=2)[2], sep=" "), "Control"))
}
}
))
colnames(samples) <- colnames(normalized_expression_data)
rownames(samples) <- c("cell", "cell_type")
samples <- data.frame(t(samples))
# Overview of the group.
knitr::kable(head(samples, n=10,
caption="Table 2. Grouping of samples by \
their types(JMML, Control). Each sample was \
assigned to one of the groups."), format="html")
```
Design a model with grouping in samples.
```{r Model design, fig.align='center'}
model_design <- model.matrix(~samples$cell_type)
knitr::kable(model_design[1:5,],
caption="Table 3. Overview of model design",
type="html")
```
Fit the model into the normalized expression data.
```{r Fit the model to the expression data}
# Create a matrix of normalized expression data.
expression_matrix <- as.matrix(normalized_expression_data)
rownames(expression_matrix) <- rownames(normalized_expression_data)
colnames(expression_matrix) <- colnames(normalized_expression_data)
# Create an expression set.
minimalSet <- Biobase::ExpressionSet(assayData=expression_matrix)
# Fit our model into the expression set.
fit <- limma::lmFit(minimalSet, model_design)
```
Conduct hypothesis testing on the fitted model. Use BH hypothesis correcting method.
```{r Calculate top hits, fig.align='center'}
# Fit using eBayes method.
fit2 <- limma::eBayes(fit, trend=TRUE)
# Topfit
BH_topfit <- limma::topTable(fit2,
coef=ncol(model_design),
adjust.method="BH",
number=nrow(expression_matrix))
output_hits <- BH_topfit
output_hits <- output_hits[order(output_hits$P.Value),]
knitr::kable(head(output_hits, n=10),
caption="Table 4. Overview of gene expression \
hypothesis testing.")
```
5241 genes were signficiantly differentially expressed. I used 0.05 as threshold because 0.05 is strong threshold enough to conclude that gene is expressed differentially in most of experiments and cases.
```{r Get the number of genes that is significantly differentiated}
length(which(output_hits$P.Value < 0.05))
```
Use various hypothesis correction methods such as holm, hochberg, hommel, bonferroni, BY and fdr. Number of genes passed correction by hypothesis correction method are,
```{r Test in various hypothesis correction methods}
# Use varous hypothesis correction methods such as
# holm, hochberg, hommel, bonferroni, BY and fdr.
holm_topfit <- limma::topTable(fit2,
coef=ncol(model_design),
adjust.method="holm",
number=nrow(expression_matrix))
hochberg_topfit <- limma::topTable(fit2,
coef=ncol(model_design),
adjust.method="hochberg",
number=nrow(expression_matrix))
hommel_topfit <- limma::topTable(fit2,
coef=ncol(model_design),
adjust.method="hommel",
number=nrow(expression_matrix))
bonferroni_topfit <- limma::topTable(fit2,
coef=ncol(model_design),
adjust.method="bonferroni",
number=nrow(expression_matrix))
BY_topfit <- limma::topTable(fit2,
coef=ncol(model_design),
adjust.method="BY",
number=nrow(expression_matrix))
fdr_topfit <- limma::topTable(fit2,
coef=ncol(model_design),
adjust.method="fdr",
number=nrow(expression_matrix))
sprintf("BH hypothesis correction: %s", length(which(BH_topfit$adj.P.Val < 0.05)))
sprintf("holm hypothesis correction: %s", length(which(holm_topfit$adj.P.Val < 0.05)))
sprintf("hochberg hypothesis correction: %s", length(which(hochberg_topfit$adj.P.Val < 0.05)))
sprintf("hommel hypothesis correction: %s", length(which(hommel_topfit$adj.P.Val < 0.05)))
sprintf("bonferroni hypothesis correction: %s", length(which(bonferroni_topfit$adj.P.Val < 0.05)))
sprintf("BY hypothesis correction: %s", length(which(BY_topfit$adj.P.Val < 0.05)))
sprintf("fdr hypothesis correction: %s", length(which(fdr_topfit$adj.P.Val < 0.05)))
```
```{r MA plot where gene of interests coloured, fig.cap="Figure 3. MA plot of normalized expression data. The genes of interests are coloured in red", fig.align="center"}
# Color the genes that are differentially expressed.
plot_ma_col <- rep("black", nrow(normalized_expression_data))
for(i in which(BH_topfit$adj.P.Val < 0.05)) {
plot_ma_col[i] <- "red"
}
status <- rep("Control", nrow(normalized_expression_data))
status[which(BH_topfit$adj.P.Val < 0.05)] <- "Gene of interests"
values <- c("Gene of interests")
col <- c("red")
# Plot MA plot.
limma::plotMA(log2(normalized_expression_data[,]), status=status, values=values, col=col, cex=0.35, main="MA plot of normalized expression data", ylab="Expression log-ratio by samples")
```
Visualize the expression levels using a heatmap.
```{r Create a heatmap, out.height="150%", fig.align="center", dpi=200, fig.cap="Figure 4. Normalized expression data Heatmap. Visualize the expression level varying by genes and samples"}
# Create a RNAseq heatmap of normalized expression data.
heatmap_matrix <- normalized_expression_data[,]
rownames(heatmap_matrix) <- rownames(normalized_expression_data)
colnames(heatmap_matrix) <- colnames(normalized_expression_data)
# Scale the values in heatmap_matrix
heatmap_matrix <- t(scale(t(heatmap_matrix)))
# Assign colours of the cells in heatmap. The lowest expression is blue
# and the highest expression is red.
heatmap_col = circlize::colorRamp2(c(min(heatmap_matrix), 0,
max(heatmap_matrix)), c("blue", "white", "red"))
# Create a heatmap using ComplexHeatmap::Heatmap
current_heatmap <- ComplexHeatmap::Heatmap(as.matrix(heatmap_matrix[,]),
name="Heatmap of normalized dataset",
show_row_dend = TRUE,
show_column_dend = TRUE,
col=heatmap_col,
show_column_names = TRUE,
show_row_names = FALSE,
show_heatmap_legend = TRUE,
column_names_gp = grid::gpar(
fontsize=2.5),
column_names_rot = 45,
use_raster=FALSE,
heatmap_legend_param=list(
title="Gene expression level")
)
ComplexHeatmap::draw(current_heatmap, heatmap_legend_side="bottom")
```
Annotate the heatmap with division into JMML and control samples. Visualize the top hits.
```{r Create an annotated heatmap, out.height="150%", fig.align="center", dpi=200, fig.cap="Figure 5. Annotated and clustered heatmap. Can see more distinct contrast between JMML samples and control samples."}
# Find top hits with P-value < 0.05.
top_hits <- rownames(output_hits)[output_hits$P.Value<0.05]
# Filter with the genes of interests and scale it.
heatmap_matrix_tophits <- t(
scale(t(heatmap_matrix[which(rownames(heatmap_matrix) %in% top_hits),])))
heatmap_matrix_tophits <- heatmap_matrix_tophits[,
c(grep(colnames(heatmap_matrix_tophits),
pattern = "LMMJ"),
which(!grepl(colnames(heatmap_matrix_tophits),
pattern = "LMMJ")))]
heatmap_col <- circlize::colorRamp2(c(min(heatmap_matrix_tophits), 0,
max(heatmap_matrix_tophits)), c("blue", "white", "red"))
class_definitions <- c()
for(i in 1:ncol(heatmap_matrix_tophits)) {
if(grepl("LMMJ", colnames(heatmap_matrix_tophits)[i])) {
class_definitions[i] <- "JMML"
} else {
class_definitions[i] <- "Control"
}
}
# Create an annotation of the heatmap.
ha_colours <- c("orange", "green")
names(ha_colours) <- c("JMML", "Control")
ha <- ComplexHeatmap::HeatmapAnnotation(df=data.frame(
type = class_definitions),
col = list(type = ha_colours))
# Create a heatmap.
current_heatmap <- ComplexHeatmap::Heatmap(as.matrix(heatmap_matrix_tophits),
name="Heatmap of normalized dataset",
cluster_rows = TRUE,
cluster_columns = FALSE,
show_row_dend = TRUE,
show_column_dend = FALSE,
col=heatmap_col,
show_column_names = TRUE,
show_row_names = FALSE,
show_heatmap_legend = TRUE,
top_annotation=ha,
column_names_gp = grid::gpar(fontsize=2.4),
column_names_rot = 45,
use_raster=FALSE,
heatmap_legend_param=list(title="Gene expression level")
)
ComplexHeatmap::draw(current_heatmap, heatmap_legend_side="bottom",
annotation_legend_side="bottom")
```
# Thresholded over-representation analysis
I used glmQLFit method to see which genes are upregulated or downregulated.
```{r glmQLFit to identify upregulated and downregulated genes}
# Gene expression anaylsis
d <- edgeR::DGEList(counts=normalized_expression_data,group=samples$cell_type)
d <- edgeR::estimateDisp(d, model_design)
fit <- edgeR::glmQLFit(d, model_design)
# Conduct glmQLFTest on the dataset using model design.
qlf.pos_vs_neg <- edgeR::glmQLFTest(fit, coef="samples$cell_typeJMML")
```
5997 genes passed the threshold p-value < 0.05
```{r See how many genes are differentially expressed with p-value}
qlf_output_hits <- edgeR::topTags(qlf.pos_vs_neg, sort.by="PValue",
n=nrow(normalized_expression_data))
length(which(qlf_output_hits$table$PValue < 0.05))
```
3490 genes passed the hypothesis correction with p-value < 0.05.
```{r See how many genes passed the hypothesis correction}
length(which(qlf_output_hits$table$FDR < 0.05))
```
Here is the overview of differential expression test using glmQLFit method.
```{r Overview of differential expression}
knitr::kable(head(qlf.pos_vs_neg$table, n=10),
catpion="Table 5. Overview of differential \
expression test result. The test has shown \
some genes were significantly expressed differentially",
format="html")
```
In this over representation analysis, I will use gene ontology biological process, molecular function and cellular component dataset.
```{r See annotation source and version of them}
sources <- c("GO:BP", "GO:MF", "GO:CC")
gprofiler2::get_version_info(organism = "hsapiens")$sources[sources]
```
Get the list of the genes that were upregulated and the p-value of differential gene expression is < 0.05. On this gene list, conduct over representation analysis using gost method from gprofiler2 R package. Plot the result of the anaylsis.
```{r "Differentially expressed genes, over representation analysis, plot", fig.cap="Figure 6. Over representation analysis result by gprofiler. Various related annotationa data were found.", fig.align="center"}
# Conduct a gprofiler query on the gene list.
whole_genes <- rownames(qlf_output_hits$table)[
which(qlf_output_hits$table$PValue < 0.05
& qlf_output_hits$table$logFC > 0)]
whole_gene_gost_result <- gprofiler2::gost(
query=whole_genes, sources=sources, )
gprofiler2::gostplot(whole_gene_gost_result, capped=FALSE, interactive=TRUE)
```
1043 gene sets were returned with threshold p-value 0.05.
```{r See how many gene sets were returned}
nrow(whole_gene_gost_result$result)
```
Differentiate between upregulated genes and downregulated genes.
```{r Differentiate between upregulated and downregulated genes}
# Differentiate between upregulated and downregulated genes
upregulated_genes <- rownames(qlf_output_hits$table)[
which(qlf_output_hits$table$PValue < 0.05
& qlf_output_hits$table$logFC > 0)]
downregulated_genes <- rownames(qlf_output_hits)[
which(qlf_output_hits$table$PValue < 0.05
& qlf_output_hits$table$logFC < 0)]
```
Overview of upregulated genes.
```{r upregulated genes}
head(upregulated_genes, n=10)
```
Overview of downregulated genes.
```{r downregulated genes}
head(downregulated_genes, n=10)
```
Thresholded gene set enrichment analysis on upregulated genes.
```{r Run gProfiler gost on the upregualted genes, fig.cap="Figure 7. Gprofiler result on upregulated genes. Several annotation data were found", fig.align="center"}
upregulated_gost_result <- gprofiler2::gost(query=upregulated_genes, organism="hsapiens", sources=sources)
(upregulated_gost_result_plot <- gprofiler2::gostplot(upregulated_gost_result, capped = FALSE, interactive=TRUE))
```
Overview of lowest p-value query result in gene set enrichment analysis on upregulated genes.
```{r Head of upregulated gost result}
knitr::kable(head(upregulated_gost_result$result[
order(upregulated_gost_result$result$p_value),], n=5),
caption="Table 6. Top 5 annotation terms found in the over \
reprepresentation analysis on upregualted genes.",
format="html")
```
Thresholded gene set enrichment analysis on downregulated genes.
```{r gprofiler gost quey on downregulated genes, fig.cap="Figure 8. Gprofiler result on downregulated genes. Several annotation data were found ", fig.align="center"}
downregulated_gost_result <- gprofiler2::gost(query=downregulated_genes, organism="hsapiens", sources=sources)
(downregulated_plot <- gprofiler2::gostplot(downregulated_gost_result, capped = FALSE, interactive=TRUE))
```
Overview of lowest p-value query result in gene set enrichment analysis on downregulated genes.
```{r Overview of lowest p-value query result, fig.align="center"}
knitr::kable(head(downregulated_gost_result$result[
order(downregulated_gost_result$result$p_value),], n=5),
caption="Table 7. Top 5 annotation terms found in the over \
representation analysis on down regualted genes")
```
# Interpretation
**1. Do the over-representation results support conclusions or mechanism discussed in the original paper?**
The over-representation results support the original paper. The authors of the original paper said that there was hypermethylation in JMML samples resulting less expressions of genes.<sup>[1]</sup> My over representation analysis result showed less gene expressions in some genes and it corresponds to the findings in the origianl paper.
**2. Can you find evidence, i.e. publications, to support some of the results that you see. How does this evidence support your results.**
Yes, there are some evidences that support some of the results that I see. Ras gene family are know to be oncogenes of JMML.<sup>[2]</sup> Its subfamilies such as HRAS, KRAS, NRAS were found in the gene list of differentially expressed genes.
# Reference
[1] Cavé H, Strullu M, Caye A, Arfeuille C. Transcriptome and methylation analysis of JMML patients compare to healthy prenatal and postnatal samples. Dec 2022. APHP Hôpital Robert Debré.
[2] Bos JL. ras oncogenes in human cancer: a review. Cancer Res. Sep 1989;49(17):4682-9. Erratum in: Cancer Res 1990 Feb 15;50(4):1352.