Skip to content

3 Assignment 1

Arianne Beauregard edited this page Apr 7, 2023 · 2 revisions

Assignment 1 - Data set selection and initial Processing

Time

Estimated time: 12h

Actual time: About 12h over several days

Steps

Select a dataset

  • Used the GEOmetadb package to find a dataset
  • Tried using getSQLiteFile() function, but error
  • Manually downloaded from https://gbnci.cancer.gov/geo/GEOmetadb.sqlite.gz
  • Followed vignette from GEOmetadb and lecture3 notes
  • Dataset chosen: GSE155955
    • Borrego, S. L., Fahrmann, J., Hou, J., Lin, D. W., Tromberg, B. J., Fiehn, O., & Kaiser, P. (2021). Lipid remodeling in response to methionine stress in MDA-MBA-468 triple-negative breast cancer cells. Journal of lipid research, 62, 100056. https://doi.org/10.1016/j.jlr.2021.100056

Downloading dataset

  • Used GEOquery package

Mapping to HUGO Symbols

  • Dataset included gene symbols and Entrez IDs
  • The dataset contained a few duplicate symbols; upon checking, I realized that some of the symbols were from HUGO and some were from other sources (e.g OMIM)
  • Decided to map to both Entrez IDs and HUGO symbols (if available, if not, then official symbol on NCBI)

Normalization

  • Dataset had ERCC RNA controls
  • Followed convention from here (in supplementary materials)

Submission

To check if compiles:

docker run --rm -it -v ${PWD}:/home/rstudio/projects --user rstudio risserlin/bcb420-base-image /usr/local/bin/R -e "rmarkdown::render('/home/rstudio/projects/ArianneChristina_Beauregard/Assignment1/Assignment1.nb.html',output_file='/home/rstudio/projects/test.html')" > processing_output_filename

References

Evans C, Hardin J, Stoebel DM. Selecting between-sample RNA-Seq normalization methods from the perspective of their assumptions. Brief Bioinform. 2018 Sep 28;19(5):776-792. doi: 10.1093/bib/bbx008. PMID: 28334202; PMCID: PMC6171491. (for selecting normalization)