3 Assignment 1

Assignment 1 - Data set selection and initial Processing

Time

Estimated time: 12h

Actual time: About 12h over several days

Steps

Select a dataset

Used the GEOmetadb package to find a dataset
Tried using getSQLiteFile() function, but error
Manually downloaded from https://gbnci.cancer.gov/geo/GEOmetadb.sqlite.gz
Followed vignette from GEOmetadb and lecture3 notes
Dataset chosen: GSE155955
- Borrego, S. L., Fahrmann, J., Hou, J., Lin, D. W., Tromberg, B. J., Fiehn, O., & Kaiser, P. (2021). Lipid remodeling in response to methionine stress in MDA-MBA-468 triple-negative breast cancer cells. Journal of lipid research, 62, 100056. https://doi.org/10.1016/j.jlr.2021.100056

Downloading dataset

Used GEOquery package

Mapping to HUGO Symbols

Dataset included gene symbols and Entrez IDs
The dataset contained a few duplicate symbols; upon checking, I realized that some of the symbols were from HUGO and some were from other sources (e.g OMIM)
Decided to map to both Entrez IDs and HUGO symbols (if available, if not, then official symbol on NCBI)

Normalization

Dataset had ERCC RNA controls
Followed convention from here (in supplementary materials)

Submission

To check if compiles:

docker run --rm -it -v ${PWD}:/home/rstudio/projects --user rstudio risserlin/bcb420-base-image /usr/local/bin/R -e "rmarkdown::render('/home/rstudio/projects/ArianneChristina_Beauregard/Assignment1/Assignment1.nb.html',output_file='/home/rstudio/projects/test.html')" > processing_output_filename

References

Evans C, Hardin J, Stoebel DM. Selecting between-sample RNA-Seq normalization methods from the perspective of their assumptions. Brief Bioinform. 2018 Sep 28;19(5):776-792. doi: 10.1093/bib/bbx008. PMID: 28334202; PMCID: PMC6171491. (for selecting normalization)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly