Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add reference-based gene symbol conversion #14

Merged
merged 16 commits into from
Dec 3, 2024
Merged

Conversation

jashapiro
Copy link
Member

After a number of fits and starts with how best to do the gene identifier conversion, I ended up deciding the best way to handle things was to use a set of defined references where the conversion between Ensembl ids and gene symbols could be made uniform and simpler.

This PR includes two main parts:

  • a notebook for downloading example datasets from 10x and compiling the reference gene table
  • modifications to the conversion scripts to use those references

The table construction happens in a notebook in the data-raw, which also includes some exploration of the conversions. I could probably move the exploration out to a separate notebook and just have a simple script for constructing the table; let me know if you think that would be a better approach. I also store the table only for internal use; I suppose I could make it available outside the package as well, but I wasn't sure whether that would be worth the effort. Maybe it would?

The reference table is built using both a 10x 2020 reference that corresponds to Ensembl 98, and the newly release 2024 reference from Ensembl 110, as well as the default which is the ScPCA reference. I also include both the "plain" gene symbol conversions and the results from make.unique()

For the conversion scripts, I kept the ability to convert based on an SCE object, but moved the primary conversion for ensembl id lists to be based on the reference table. In the case of converting an SCE object, I left the internal row rata as the default source for conversion). There are options for using each of the different references as well as whether to make the results unique (which will always use the precomputed make.unique values). Let me know if you think there are other options that would be good to include, or any thoughts you might have on how we might change this for easier use.

I expect we will want to add a bit more to the docs about the different references and when you might favor one or another, but I wanted to get the code in its current state up for discussion.

@jashapiro jashapiro requested a review from sjspielman November 26, 2024 18:31
Copy link
Member

@sjspielman sjspielman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done a super cursory review here with some initial thoughts, but I've done so at the end of the day and I very much plan to have look with fresh eyes tomorrow! Some of my reviews can also be described as "how does someone without fresh eyes approach the notebook?"

And for now, one question:

I also store the table only for internal use; I suppose I could make it available outside the package as well, but I wasn't sure whether that would be worth the effort. Maybe it would?

Wondering where else you might envision.. like a public S3 bucket?

Comment on lines +6 to +7
#' simple conversion of Ensembl gene ids to gene symbols based on either the
#' ScPCA reference gene list or a 10x reference gene list as used by Cell Ranger.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or an SCE, it seems

} else {
ensembl_ids <- rownames(sce)
}
if (!all(startsWith(ensembl_ids, "ENSG"))) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to keep this flexible for non-human in the future, this could be "ENS".


## Look at some stats for the tables

We expect that many of the Ensembl IDs in the ScPCA data will not have corresponding values in the 10x references, but some of the 10x references have Ensembl IDs not present in the ScPCA data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
We expect that many of the Ensembl IDs in the ScPCA data will not have corresponding values in the 10x references, but some of the 10x references have Ensembl IDs not present in the ScPCA data.
We expect that many of the Ensembl IDs in the ScPCA data will not have corresponding values in the 10x references, but some of the 10x references will have Ensembl IDs not present in the ScPCA data.

data-raw/build_gene_reference.Rmd Outdated Show resolved Hide resolved

### BAC removal

In Ensembl version 104, the version we are using for ScPCA, BAC-based gene IDs were removed and replaced with simply the Ensembl gene ID.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One does not simply replace with the Ensembl gene ID.

this is a recreational-for-fun review comment as I do not mind the use of "simply" 😉 😅

```


## Look at some stats for the tables
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At some point later" I got a little bit lost in keeping in my brain "are we checking for scpca ids not in 10x or vice versa?" But, I think you've written everything clearly! Just my brain keeping track, albeit late in a long day....

Either way, maybe a slightly more informative header here (and/or subheaders below?) might help future brains stay on top of what's going on.

)
```

More than I expected, but it looks like most of these are cases where the gene symbol has been updated.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see about 900 rows here... how did you make this conclusion about "most cases"?

```

More than I expected, but it looks like most of these are cases where the gene symbol has been updated.
What we are really interested in is the cases where the process of making the gene symbol unique had different effects:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
What we are really interested in is the cases where the process of making the gene symbol unique had different effects:
What we are really interested in are the cases where the process of making the gene symbol unique had different effects:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate just a bit on "different effects"?

@jashapiro
Copy link
Member Author

And for now, one question:

I also store the table only for internal use; I suppose I could make it available outside the package as well, but I wasn't sure whether that would be worth the effort. Maybe it would?

Wondering where else you might envision.. like a public S3 bucket?

What I meant here was that the way I include the table scpca_gene_reference in the package means that it is an internal object. It could be accessed as rOpenScPCA:::scpca_gene_reference but not with only :: or by loading the package with library().

@sjspielman
Copy link
Member

What I meant here was that the way I include the table scpca_gene_reference in the package means that it is an internal object. It could be accessed as rOpenScPCA:::scpca_gene_reference but not with only :: or by loading the package with library().

Aha. It might be worth exporting since it is a useful piece of information but that said I can't immediately think of applications besides gene conversion as performed within rOpenScPCA here. Maybe something to circle back to if/when it becomes clear this information is otherwise helpful?

Copy link
Member

@sjspielman sjspielman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second round of review (or first complete round of review, depending on your perspective) has landed!

The table construction happens in a notebook in the data-raw, which also includes some exploration of the conversions. I could probably move the exploration out to a separate notebook and just have a simple script for constructing the table; let me know if you think that would be a better approach.

Given how "brief" the exploration steps are, I think this is fine. If the exploration was a big lift I'd definitely say split it up. But, I did leave another comment that maybe code could be firmed up with a download script, an exploration notebook, and a wrapper shell script. I don't really feel strongly though.

There are options for using each of the different references as well as whether to make the results unique (which will always use the precomputed make.unique values). Let me know if you think there are other options that would be good to include, or any thoughts you might have on how we might change this for easier use.

I think the usage is pretty straight-forward.

Something that might, but really only might!, bother me is that one explicitly indicates sce as a reference for sce_to_symbols() but instead that is implicit for ensembl_to_symbol() if an SCE is provided. At a minimum, I'd print a warning/message if the reference is being ignored (which I left in an inline comment), but I'd also consider adding "sce" to the default reference vector in ensembl_to_symbol(). In this case, you'd have to tweak the opening stopifnot() a smidge, but only a smidge.
Overall, I feel strongly about a message to be clear that reference is being ignored, but not as strongly about changing the argument options, so you should modify if/as you see fit.

One additional use case I can envision is if anyone wants to provide their own mapping. I can potentially see this happening if someone wants to precisely match Seurat's mapping, for example, but on the other hand that might be behavior we do not want to accommodate in rOpenScpCA! Currently, the only way one would do that is to add that information to the SCE object of interest before running functions (or they'd have to write it themselves, which is fine for a more unique use case imo). So, an additional aspect here could be to allow for a TSV with a custom mapping of gene ids and gene symbols, but it's not something I think should be prioritized.

tests/testthat/test-convert-gene-ids.R Outdated Show resolved Hide resolved
R/convert-gene-ids.R Show resolved Hide resolved
R/convert-gene-ids.R Outdated Show resolved Hide resolved
R/convert-gene-ids.R Outdated Show resolved Hide resolved
R/convert-gene-ids.R Outdated Show resolved Hide resolved
data-raw/build_gene_reference.nb.html Outdated Show resolved Hide resolved
@jashapiro
Copy link
Member Author

jashapiro commented Dec 2, 2024

@sjspielman I updated this PR to split the script to build the data from the notebook to evaluate it. I believe I also addressed all of your comments, adding a message when an SCE is being used as the reference, fixing spacing, and updating docs. I also made the data an exported object, which means the contents are now documented as well.

Should be ready for another look.

@jashapiro jashapiro requested a review from sjspielman December 2, 2024 19:52
Copy link
Member

@sjspielman sjspielman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few more small comments, but this is about there! No big comments at this point.

R/data.R Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Comment on lines 7 to 9
It starts with the example ScPCA-formatted SCE object stored in test data, extracting the table of gene ids and gene symbols.
This is combined with the reference information extracted from example 10x Genomics datasets.
The full table is saved in `data/scpca_gene_reference.rda`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
It starts with the example ScPCA-formatted SCE object stored in test data, extracting the table of gene ids and gene symbols.
This is combined with the reference information extracted from example 10x Genomics datasets.
The full table is saved in `data/scpca_gene_reference.rda`.
The script extracts the table of gene ids and gene symbols from the example ScPCA-formatted SCE object and combines it with the reference information extracted from example 10x Genomics datasets.
The full table is saved in `data/scpca_gene_reference.rda`.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

take it or leave it, your text is also fine.

actually though, should we say "example sce" or "test sce"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might also say that re-running this script will overwrite the data

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the text

```


# Load the R data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Load the R data
# Load the R data

data-raw/explore_gene_reference.Rmd Outdated Show resolved Hide resolved
data-raw/explore_gene_reference.Rmd Outdated Show resolved Hide resolved
)
```

This definitely happens more than I would have hoped, though my hope was probably unreasonable, as 14 cases is not bad.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

14 seems pretty good tbh, or maybe my hopes were unreasonable in the opposite direction 😄

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add some text at the top of this notebook to explicitly state as a reminder that you're looking at 2020, not 2024, throughout, because 2020 wasn't using Ensembl yet

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or, is it also worth adding a section to confirm that 2024 and scpca are already as similar as we expect, since we have a notebook anyways?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't do that at this stage because I couldn't actually find any real datasets that use the new reference. I don't think this level of analysis is necessary at this time.

Comment on lines -1 to -2
# read in SCE for testing
sce <- readRDS(test_path("data", "scpca_sce.rds"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether it's starting to get worthwhile to have a setup.R script that at least load the sce file? Separate PR, if at all, though.

Copy link
Member Author

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sjspielman I believe I addressed your comments.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't do that at this stage because I couldn't actually find any real datasets that use the new reference. I don't think this level of analysis is necessary at this time.

Comment on lines 7 to 9
It starts with the example ScPCA-formatted SCE object stored in test data, extracting the table of gene ids and gene symbols.
This is combined with the reference information extracted from example 10x Genomics datasets.
The full table is saved in `data/scpca_gene_reference.rda`.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the text

@jashapiro jashapiro requested a review from sjspielman December 3, 2024 14:19
Copy link
Member

@sjspielman sjspielman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍
Just two things (but I don't need to see again!):

  • I do feel fairly strongly that the notebook and or the data-raw/README.md need(s) to plainly state that the notebook is only comparing scpca and the 2020 10x reference, not 2024. Currently, the first notebook text makes it sound like you're setting up to do more comparisons ("but some of the 10x references have Ensembl IDs not present in the ScPCA data."), and similarly the README says "among" (vs between) when only 2 are being compared.
  • The notebook needs a new title since it is no longer setting up the conversion which was moved to the script.

@@ -0,0 +1,120 @@
---
title: "Gene ID Conversion data setup"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exploration, not setup

@jashapiro jashapiro merged commit 6e85ffc into main Dec 3, 2024
2 checks passed
@jashapiro jashapiro deleted the jashapiro/dedup-symbols branch December 3, 2024 16:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants