Add reference-based gene symbol conversion #14

jashapiro · 2024-11-26T18:31:23Z

After a number of fits and starts with how best to do the gene identifier conversion, I ended up deciding the best way to handle things was to use a set of defined references where the conversion between Ensembl ids and gene symbols could be made uniform and simpler.

This PR includes two main parts:

a notebook for downloading example datasets from 10x and compiling the reference gene table
modifications to the conversion scripts to use those references

The table construction happens in a notebook in the data-raw, which also includes some exploration of the conversions. I could probably move the exploration out to a separate notebook and just have a simple script for constructing the table; let me know if you think that would be a better approach. I also store the table only for internal use; I suppose I could make it available outside the package as well, but I wasn't sure whether that would be worth the effort. Maybe it would?

The reference table is built using both a 10x 2020 reference that corresponds to Ensembl 98, and the newly release 2024 reference from Ensembl 110, as well as the default which is the ScPCA reference. I also include both the "plain" gene symbol conversions and the results from make.unique()

For the conversion scripts, I kept the ability to convert based on an SCE object, but moved the primary conversion for ensembl id lists to be based on the reference table. In the case of converting an SCE object, I left the internal row rata as the default source for conversion). There are options for using each of the different references as well as whether to make the results unique (which will always use the precomputed make.unique values). Let me know if you think there are other options that would be good to include, or any thoughts you might have on how we might change this for easier use.

I expect we will want to add a bit more to the docs about the different references and when you might favor one or another, but I wanted to get the code in its current state up for discussion.

sjspielman

I've done a super cursory review here with some initial thoughts, but I've done so at the end of the day and I very much plan to have look with fresh eyes tomorrow! Some of my reviews can also be described as "how does someone without fresh eyes approach the notebook?"

And for now, one question:

I also store the table only for internal use; I suppose I could make it available outside the package as well, but I wasn't sure whether that would be worth the effort. Maybe it would?

Wondering where else you might envision.. like a public S3 bucket?

sjspielman · 2024-11-26T18:55:46Z

R/convert-gene-ids.R

+#' simple conversion of Ensembl gene ids to gene symbols based on either the
+#' ScPCA reference gene list or a 10x reference gene list as used by Cell Ranger.


Or an SCE, it seems

sjspielman · 2024-11-26T19:00:59Z

R/convert-gene-ids.R

+  } else {
+    ensembl_ids <- rownames(sce)
+  }
+  if (!all(startsWith(ensembl_ids, "ENSG"))) {


If we want to keep this flexible for non-human in the future, this could be "ENS".

sjspielman · 2024-11-26T19:02:37Z