-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add reference-based gene symbol conversion #14
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've done a super cursory review here with some initial thoughts, but I've done so at the end of the day and I very much plan to have look with fresh eyes tomorrow! Some of my reviews can also be described as "how does someone without fresh eyes approach the notebook?"
And for now, one question:
I also store the table only for internal use; I suppose I could make it available outside the package as well, but I wasn't sure whether that would be worth the effort. Maybe it would?
Wondering where else you might envision.. like a public S3 bucket?
#' simple conversion of Ensembl gene ids to gene symbols based on either the | ||
#' ScPCA reference gene list or a 10x reference gene list as used by Cell Ranger. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or an SCE, it seems
R/convert-gene-ids.R
Outdated
} else { | ||
ensembl_ids <- rownames(sce) | ||
} | ||
if (!all(startsWith(ensembl_ids, "ENSG"))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to keep this flexible for non-human in the future, this could be "ENS"
.
data-raw/build_gene_reference.Rmd
Outdated
|
||
## Look at some stats for the tables | ||
|
||
We expect that many of the Ensembl IDs in the ScPCA data will not have corresponding values in the 10x references, but some of the 10x references have Ensembl IDs not present in the ScPCA data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We expect that many of the Ensembl IDs in the ScPCA data will not have corresponding values in the 10x references, but some of the 10x references have Ensembl IDs not present in the ScPCA data. | |
We expect that many of the Ensembl IDs in the ScPCA data will not have corresponding values in the 10x references, but some of the 10x references will have Ensembl IDs not present in the ScPCA data. |
data-raw/build_gene_reference.Rmd
Outdated
|
||
### BAC removal | ||
|
||
In Ensembl version 104, the version we are using for ScPCA, BAC-based gene IDs were removed and replaced with simply the Ensembl gene ID. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One does not simply replace with the Ensembl gene ID.
this is a recreational-for-fun review comment as I do not mind the use of "simply" 😉 😅
data-raw/build_gene_reference.Rmd
Outdated
``` | ||
|
||
|
||
## Look at some stats for the tables |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At some point later" I got a little bit lost in keeping in my brain "are we checking for scpca ids not in 10x or vice versa?" But, I think you've written everything clearly! Just my brain keeping track, albeit late in a long day....
Either way, maybe a slightly more informative header here (and/or subheaders below?) might help future brains stay on top of what's going on.
data-raw/build_gene_reference.Rmd
Outdated
) | ||
``` | ||
|
||
More than I expected, but it looks like most of these are cases where the gene symbol has been updated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see about 900 rows here... how did you make this conclusion about "most cases"?
data-raw/build_gene_reference.Rmd
Outdated
``` | ||
|
||
More than I expected, but it looks like most of these are cases where the gene symbol has been updated. | ||
What we are really interested in is the cases where the process of making the gene symbol unique had different effects: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What we are really interested in is the cases where the process of making the gene symbol unique had different effects: | |
What we are really interested in are the cases where the process of making the gene symbol unique had different effects: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate just a bit on "different effects"?
What I meant here was that the way I include the table |
Aha. It might be worth exporting since it is a useful piece of information but that said I can't immediately think of applications besides gene conversion as performed within rOpenScPCA here. Maybe something to circle back to if/when it becomes clear this information is otherwise helpful? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Second round of review (or first complete round of review, depending on your perspective) has landed!
The table construction happens in a notebook in the
data-raw
, which also includes some exploration of the conversions. I could probably move the exploration out to a separate notebook and just have a simple script for constructing the table; let me know if you think that would be a better approach.
Given how "brief" the exploration steps are, I think this is fine. If the exploration was a big lift I'd definitely say split it up. But, I did leave another comment that maybe code could be firmed up with a download script, an exploration notebook, and a wrapper shell script. I don't really feel strongly though.
There are options for using each of the different references as well as whether to make the results unique (which will always use the precomputed
make.unique
values). Let me know if you think there are other options that would be good to include, or any thoughts you might have on how we might change this for easier use.
I think the usage is pretty straight-forward.
Something that might, but really only might!, bother me is that one explicitly indicates sce
as a reference for sce_to_symbols()
but instead that is implicit for ensembl_to_symbol()
if an SCE is provided. At a minimum, I'd print a warning/message if the reference is being ignored (which I left in an inline comment), but I'd also consider adding "sce"
to the default reference vector in ensembl_to_symbol()
. In this case, you'd have to tweak the opening stopifnot()
a smidge, but only a smidge.
Overall, I feel strongly about a message to be clear that reference is being ignored, but not as strongly about changing the argument options, so you should modify if/as you see fit.
One additional use case I can envision is if anyone wants to provide their own mapping. I can potentially see this happening if someone wants to precisely match Seurat's mapping, for example, but on the other hand that might be behavior we do not want to accommodate in rOpenScpCA
! Currently, the only way one would do that is to add that information to the SCE object of interest before running functions (or they'd have to write it themselves, which is fine for a more unique use case imo). So, an additional aspect here could be to allow for a TSV with a custom mapping of gene ids and gene symbols, but it's not something I think should be prioritized.
@sjspielman I updated this PR to split the script to build the data from the notebook to evaluate it. I believe I also addressed all of your comments, adding a message when an SCE is being used as the reference, fixing spacing, and updating docs. I also made the data an exported object, which means the contents are now documented as well. Should be ready for another look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a few more small comments, but this is about there! No big comments at this point.
R/data.R
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
data-raw/README.md
Outdated
It starts with the example ScPCA-formatted SCE object stored in test data, extracting the table of gene ids and gene symbols. | ||
This is combined with the reference information extracted from example 10x Genomics datasets. | ||
The full table is saved in `data/scpca_gene_reference.rda`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It starts with the example ScPCA-formatted SCE object stored in test data, extracting the table of gene ids and gene symbols. | |
This is combined with the reference information extracted from example 10x Genomics datasets. | |
The full table is saved in `data/scpca_gene_reference.rda`. | |
The script extracts the table of gene ids and gene symbols from the example ScPCA-formatted SCE object and combines it with the reference information extracted from example 10x Genomics datasets. | |
The full table is saved in `data/scpca_gene_reference.rda`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
take it or leave it, your text is also fine.
actually though, should we say "example sce" or "test sce"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might also say that re-running this script will overwrite the data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the text
``` | ||
|
||
|
||
# Load the R data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Load the R data | |
# Load the R data | |
) | ||
``` | ||
|
||
This definitely happens more than I would have hoped, though my hope was probably unreasonable, as 14 cases is not bad. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
14 seems pretty good tbh, or maybe my hopes were unreasonable in the opposite direction 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd add some text at the top of this notebook to explicitly state as a reminder that you're looking at 2020, not 2024, throughout, because 2020 wasn't using Ensembl yet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or, is it also worth adding a section to confirm that 2024 and scpca are already as similar as we expect, since we have a notebook anyways?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't do that at this stage because I couldn't actually find any real datasets that use the new reference. I don't think this level of analysis is necessary at this time.
# read in SCE for testing | ||
sce <- readRDS(test_path("data", "scpca_sce.rds")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder whether it's starting to get worthwhile to have a setup.R
script that at least load the sce file? Separate PR, if at all, though.
Co-authored-by: Stephanie Spielman <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sjspielman I believe I addressed your comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't do that at this stage because I couldn't actually find any real datasets that use the new reference. I don't think this level of analysis is necessary at this time.
data-raw/README.md
Outdated
It starts with the example ScPCA-formatted SCE object stored in test data, extracting the table of gene ids and gene symbols. | ||
This is combined with the reference information extracted from example 10x Genomics datasets. | ||
The full table is saved in `data/scpca_gene_reference.rda`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the text
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Just two things (but I don't need to see again!):
- I do feel fairly strongly that the notebook and or the
data-raw/README.md
need(s) to plainly state that the notebook is only comparing scpca and the 2020 10x reference, not 2024. Currently, the first notebook text makes it sound like you're setting up to do more comparisons ("but some of the 10x references have Ensembl IDs not present in the ScPCA data."), and similarly the README says "among" (vs between) when only 2 are being compared. - The notebook needs a new title since it is no longer setting up the conversion which was moved to the script.
data-raw/explore_gene_reference.Rmd
Outdated
@@ -0,0 +1,120 @@ | |||
--- | |||
title: "Gene ID Conversion data setup" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exploration, not setup
After a number of fits and starts with how best to do the gene identifier conversion, I ended up deciding the best way to handle things was to use a set of defined references where the conversion between Ensembl ids and gene symbols could be made uniform and simpler.
This PR includes two main parts:
The table construction happens in a notebook in the
data-raw
, which also includes some exploration of the conversions. I could probably move the exploration out to a separate notebook and just have a simple script for constructing the table; let me know if you think that would be a better approach. I also store the table only for internal use; I suppose I could make it available outside the package as well, but I wasn't sure whether that would be worth the effort. Maybe it would?The reference table is built using both a 10x 2020 reference that corresponds to Ensembl 98, and the newly release 2024 reference from Ensembl 110, as well as the default which is the ScPCA reference. I also include both the "plain" gene symbol conversions and the results from
make.unique()
For the conversion scripts, I kept the ability to convert based on an SCE object, but moved the primary conversion for ensembl id lists to be based on the reference table. In the case of converting an SCE object, I left the internal row rata as the default source for conversion). There are options for using each of the different references as well as whether to make the results unique (which will always use the precomputed
make.unique
values). Let me know if you think there are other options that would be good to include, or any thoughts you might have on how we might change this for easier use.I expect we will want to add a bit more to the docs about the different references and when you might favor one or another, but I wanted to get the code in its current state up for discussion.