Skip to content

Assignment #1

Lola-W edited this page Feb 15, 2023 · 1 revision

Assignment #1 - Data set selection and initial Processing.

Assignment #1 - Data set selection and initial Processing.


Objective: Data exploration on expression dataset

Time estimated: 10 h; taken 12 h;

Date started: 2023-2-10 ; completed: 2023-1-14


Select an Expression Data Set

Clean the data

  1. Download the data with GEO2R and see the infos

    kable(data.frame(head(Meta(gse))), format = "html")
  2. Assess data quality for the control and test conditions

    Issue: Error in cpm.default(abs(raw_dat[, 2:21])) : library sizes should be finite and non-negative Solution: summary(raw_dat) instead of table(raw_dat), found a row of NA, removed.

Apply Normalization

  • Method?
    • plots: pre- and after-, fig.align to make side by side
    • Used four types of plots
      • It is notable that from MDS Plot, some differences may be caused by gender, should be noticed in future analysis.

Map to HUGO symbols

  • Map rows to HUGO gene symbols
    1. Search for human dataset starting with ENSG

      ensembl <- useMart("ensembl")
      kable(head(datasets[grep(datasets$dataset,
                        pattern = "sapiens"),]),format = "html")
      
      # ENSG and HGNC
      kable(searchAttributes(mart = ensembl, 'ensembl|hgnc')[1:12,] , 
            format="html") %>%
        row_spec(c(1,11), background = "yellow")
    2. unmapped rows:

      1 alignment_not_unique
      10 ENSG00000108264
      • 2 types of missing mapping: either is invalid ensembl(i.e. side notes instead of meaningful data); or the emsembl id AND its corresponding hgnc exists, but not included in our mapping db
      • alignment_noy_unique should be removed before in data cleaning
    • No rows that map to more than one symbol
    • Multiple rows that map to the same symbol: should keep all, because cannot expect 1-on-1 mapping

💡 Conclusion and outlook: Selected and cleaned data, normalized and mapped to HUGO symbols