Skip to content

Assignment #2

Lola-W edited this page Mar 15, 2023 · 2 revisions

Assignment 2 - Differential Gene expression and Preliminary ORA


Objective: Investigate rank genes according to differential expression then perform enrichment analysis with threshold.

Time estimated: 10 h; taken 15 h;

Date started: 2023-3-7 ; completed: 2023-3-14


Introduction

  • As documented in Assignment #1, we used the normalized and mapped dataset with source:

    https://www-ncbi-nlm-nih-gov/geo/query/acc.cgi?acc=GSE104406 Aging Human Hematopoietic Stem Cells Manifest Profound Epigenetic Reprogramming of Enhancers That May Predispose to Leukemia (RNA-Seq of HSCe)
  • To avoid repetitive calculation, the normalized count and the sample matrix was exported and imported using the following code:

    write.csv(samples, "A2_data/GSE104406_samples",row.names = TRUE)
    write.csv(normalized_counts_annot, "GSE104406_finalized_normalized_counts",row.names = TRUE)

    Then imported with read.table()

    Error: the column name of normalized count was raw integer instead of gene id Solution: add row.names = 1

    normalized_count_data <- read.table("A2_data/GSE104406_finalized_normalized_counts",
                                        header = TRUE, row.names = 1,
                                        sep = ",",stringsAsFactors = FALSE,
                                        check.names=FALSE)

Differential Gene Expression

  1. We design the models.

    • According to the result from assignment 1, gender is suspected to have potential affects, to procure a better model this factor is also considered aside from age group.
    • We use lm() instead of glmQLFit() because our dispersion is low, and we want to also account for the gender factor

    Issue: young group is mapped to indicator variable 1, not as expected 0. Solution: use reveal(as.factor(), ref = “Young”) to map

  2. inspect with edgeR

    • As we did in Assignment 1, we applyed TMM(weighted trimmed mean of M-values) as the recommanded method for normalizing RNA seq dataset to make samples more consistent, and we filter out the weakly expressed genes, thus no further filtering or normalization is needed
  3. p-value and threshold

    • We found an abnormally big fold change(100), thus apply log2().
    • For purpose of keeping as much valid gene as possible, we use a threshold p-value of 0.05, which is the most common one used for statistical testing.
  4. Volcano

    Error in if (output_hits$P.Value >= 0.05 | output_hits$logFC == 1) { : the condition has length > 1 Solution: Used package EnhancedVolcano

  5. Heatmap

    • find number of colors with colorRamp2() in package ComplexHeatmap

Thresholded over-representation analysis

  • Split top hits to up and down regulated genes by removing abs()

    up_hits <- output_hits$hgnc_symbol[output_hits$adj.P.Val < 0.05 & output_hits$logFC > 1]
  • First inspect in multiple databases with package gprofiler2

  • Use GO:BP as we used in the G:Profiler with the same reason

  • Visualize top pathways with geom_point() , also include precision and p-values using colors

  • Combine the lists

  • Search for the relationship using NCBI and UofT library using the names of the pathways and aging. Quickly screen through the abstract and discussion of the papers and find supporting evidences of the interelationship.

💡 Conclusion and outlook: Successfully perform differential gene expression analysis and enrichment analysis. Found evidences supporting that aging can affect the gene expression hence through the pathways impair functions.