Assignment #2

Assignment 2 - Differential Gene expression and Preliminary ORA

Objective: Investigate rank genes according to differential expression then perform enrichment analysis with threshold.

Time estimated: 10 h; taken 15 h;

Date started: 2023-3-7 ; completed: 2023-3-14

Introduction

As documented in Assignment #1, we used the normalized and mapped dataset with source:

https://www-ncbi-nlm-nih-gov/geo/query/acc.cgi?acc=GSE104406	Aging Human Hematopoietic Stem Cells Manifest Profound Epigenetic Reprogramming of Enhancers That May Predispose to Leukemia (RNA-Seq of HSCe)

To avoid repetitive calculation, the normalized count and the sample matrix was exported and imported using the following code:

write.csv(samples, "A2_data/GSE104406_samples",row.names = TRUE)
write.csv(normalized_counts_annot, "GSE104406_finalized_normalized_counts",row.names = TRUE)

Then imported with read.table()

Error: the column name of normalized count was raw integer instead of gene id Solution: add row.names = 1

normalized_count_data <- read.table("A2_data/GSE104406_finalized_normalized_counts",
                                    header = TRUE, row.names = 1,
                                    sep = ",",stringsAsFactors = FALSE,
                                    check.names=FALSE)

Differential Gene Expression

We design the models.
- According to the result from assignment 1, gender is suspected to have potential affects, to procure a better model this factor is also considered aside from age group.
- We use lm() instead of glmQLFit() because our dispersion is low, and we want to also account for the gender factor
Issue: young group is mapped to indicator variable 1, not as expected 0. Solution: use reveal(as.factor(), ref = “Young”) to map
inspect with edgeR
- As we did in Assignment 1, we applyed TMM(weighted trimmed mean of M-values) as the recommanded method for normalizing RNA seq dataset to make samples more consistent, and we filter out the weakly expressed genes, thus no further filtering or normalization is needed
p-value and threshold
- We found an abnormally big fold change(100), thus apply log2().
- For purpose of keeping as much valid gene as possible, we use a threshold p-value of 0.05, which is the most common one used for statistical testing.
Volcano

Error in if (output_hits$P.Value >= 0.05 | output_hits$logFC == 1) { : the condition has length > 1 Solution: Used package EnhancedVolcano
Heatmap
- find number of colors with colorRamp2() in package ComplexHeatmap

Thresholded over-representation analysis

Split top hits to up and down regulated genes by removing abs()

up_hits <- output_hits$hgnc_symbol[output_hits$adj.P.Val < 0.05 & output_hits$logFC > 1]

First inspect in multiple databases with package gprofiler2
Use GO:BP as we used in the G:Profiler with the same reason
Visualize top pathways with geom_point() , also include precision and p-values using colors
Combine the lists
Search for the relationship using NCBI and UofT library using the names of the pathways and aging. Quickly screen through the abstract and discussion of the papers and find supporting evidences of the interelationship.

💡 Conclusion and outlook: Successfully perform differential gene expression analysis and enrichment analysis. Found evidences supporting that aging can affect the gene expression hence through the pathways impair functions.

Shield:

This work is licensed under a Creative Commons Attribution 4.0 International License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Assignment #2

Assignment 2 - Differential Gene expression and Preliminary ORA

Introduction

Differential Gene Expression

Thresholded over-representation analysis

Clone this wiki locally