-
Notifications
You must be signed in to change notification settings - Fork 0
A01. Data Selection and Initial Processing
- Find a bulk RNA-seq dataset from GEOmetadb that a) is from the last five years; b) involves native human cells/tissues; c) has large coverage; and d) is high quality
- Clean the data and map to HUGO symbols
- Normalize the data
1.5 hours
When I tried if(!file.exists("GEOmetadb.sqlite")) GEOmetadb::getSQLiteFile()
on my laptop, I obtained the following error message:
I think there is an issue with downloading the file, so instead I'm going to try downloading the file manually and putting it into the relevant location, as suggested by Prof. Isserlin on the discussion board.
Using a modification of the GEOmetadb code provided by Prof. Isserlin, I have chosen to use the following dataset: GSE109607.
After looking at this dataset, I found that the column names of the supplemental counts data is not informative of what cell line/treatment combination the counts refer to. I tried emailing the corresponding author, but got no reply.
I haven chosen the following dataset: GSE152418.
Calculating Summary Statistics
The summary statistics I've chosen to include in the report are: a) the number of samples in the data set; b) the number of genes; c) the number of genes that lack ENSG*** identifiers; and d) the number of unique genes in the dataset. Since the dataset only includes the ensembl IDs, I'm not sure how to identify non-coding RNAs. However, from the ensembl naming system, if the ensembl id starts with ENSG***, it should at least be a gene.
I also calculated numerical summaries (i.e. min, max, mean, median, quantiles), but it doesn't seem too useful other than the fact that the mins for all columns are greater than zero, which could indicate some sort of equipment error. I may include it as justification when I clean the data rather than in the summary statistics? To be decided as of now.
Mapping the data to HUGO symbols
First, I need to choose the parameters for the identifier mapping. Since my genes are human genes (and all have ENSG*** identifiers as seen in the summary statistics), I'm choosing to use the hsapiens_gene_ensembl dataset with the "ensembl" mart. For the filters, I had to choose between ensembl_gene_id
and ensembl_gene_id_version
, since I already knew they all have ENSG*** identifiers from the summary statistics. Thus, I used sum(grepl("\.", colorectal_countdata$Gene))
to see if any of the entries had a version. This resulted in a value of 0, thus I decided to go with ensembl_gene_id
as my filter. Finally, since we're trying to convert Human Ensembl Gene IDs to HGNC symbols, the attributes that were used are ensembl_gene_id
and ensembl_symbol
.
When mapping the unfiltered data to the HGNC symbols, there are 1036 missing HGNC symbols and 15 with duplicated symbols. After filtering for low counts using the cpm()
function from edgeR
, this has gone done to 161 missing symbols (approximately 0.98% of the total) and 1 duplicated symbol.
For the duplicated symbol, the HGNC symbol is PRODH. The two Ensembl IDs that map to it are ENSG00000277196 and ENSG00000133808. Looking at their respective Ensembl pages, they seem to be splice variants as they refer to different transcripts. Thus, I have chosen to combine the two into one entry to reflect the transcripts of PRODH as a whole.
For the missing symbols, my dataset does not provide gene names of their own (i.e. only Ensembl IDs). Since they only represent 0.98% of the dataset, I have decided to remove them as they represent a small fraction of the overall number of genes in the paper. While in lecture, the professor said in practice we could keep these despite being unable to do pathway analysis with them later on, the assignment specifies having non-missing and unique HGNC symbols and thus I have opted to remove them.
Update: I decided to change the version of Ensembl I am using. The paper specified that they used Ensembl Hg38.86, corresponding to the version from Oct 2016. Since this is not currently available, I am using Ensembl from May 2015 (version 80), as it is closest in date. After doing this, there are only 44 missing genes in the dataset, representing ~0.075% of the dataset, and nine duplicated genes.
Once again, since the number of missing genes is very small, I have decided to remove them from the analysis per the specifications of the assignment. For the duplicated genes, I do not want to discard RNA-8S5 because the paper specifically mentions ribosomal RNA. Since rRNA has multiple copies across the genome, this post suggests concatenating these. However, for the other duplications, I have decided to discard them.