-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add functions for Seurat conversion #15
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a pre-review! All I did was read over the code to get an overall sense of what's here, and left very smol comments along the way. I still need to actually review & run the code next week.
#' @param dedup_method Method to handle duplicated gene symbols. If `unique`, | ||
#' the gene symbols will be made unique following standard Seurat procedures. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a Seurat version to specify here, or do they all (to our knowledge) follow the same approach?
R/make-seurat.R
Outdated
} else { | ||
create_seurat_assay <- SeuratObject::CreateAssayObject | ||
sobj[["RNA"]] <- as(sobj[["RNA"]], "Assay") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see specifying "v4"
(or anything else) would work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it wouldn't as match.arg()
will fail.
Co-authored-by: Stephanie Spielman <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the quick look. Made a few quick updates.
R/make-seurat.R
Outdated
} else { | ||
create_seurat_assay <- SeuratObject::CreateAssayObject | ||
sobj[["RNA"]] <- as(sobj[["RNA"]], "Assay") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it wouldn't as match.arg()
will fail.
update with "latest" test data, be better with paths?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've now gone over the code much more carefully and run it in a variety of conditions! I think you've done a pretty thorough job here; I can't think of much more ground to cover given how much ground already has been covered! have a 🌮.
While running the code, I confirmed that only processed
(not (un)filtered
) objects work with the code as is now, and similarly merged objects are a no-go. We should probably:
- Add to docs that processed only & no merged
- Add some checks for this at the top of the function to bail early with an informative error if it's not a single processed library
I'm frankly not convinced we ever want to accommodate (un)filtered
, but merged I think would be useful to accommodate. If you agree, we should open that issue and come back to it.
Some specific feedback for your opening questions:
adding the spliced counts as a separate assay. The latter could maybe be an option, but it seemed easier to just do it for completeness.
Yeah, just do it.
I do a fair amount of pre-conversion to avoid some of the Seurat warnings about the names of various things; I feel fine about the PCA/UMAP conversion, as that is pretty clear and simple. However, I'm a bit less sure if I want to keep the gene name conversion where
_
is replaced by-
without a message. I like having the option that I added to the id conversion function, but I am of two minds about whether that kind of name change should result in a message to the user when converting a full object.
I think the message is reasonable since it mirrors the kind of message you'd get from Seurat if we didn't do that conversion ourselves. But, if we keep it, I'd make it indeed a message()
(not warning()
which it currently is), in part b/c that's also what Seurat does, but mostly because it's an expected change that happens when moving into Seurat-land which suggests message > warning.
I also need to add some tests to be sure
altExps
are converted as expected; I think they are, but have not formally tested it.
I gather this comment is now outdated given there are some tests! I did run the code with ScPCA CITE-seq and multiplexed SCEs successfully, so that's good! I did leave one in-line comment related to some of the behavior here too, but nothing major.
|
||
# Downloads a test data file from OpenScPCA and places it in the test directory | ||
|
||
setwd(here::here()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally I prefer to avoid setwd()
if we can, but given the role and scope of this script, I don't really mind it here. I see (from the build_gene_reference.R
script) you already had some experience which convinced you to use setwd()
rather than tossing it all into here::here()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, for some reason test_path
wasn't playing well with here()
; I tried to do some nesting but it wasn't working suggestions welcome.
R/sum-duplicate-genes.R
Outdated
#' @param sce a SingleCellExperiment object with duplicated row names | ||
#' @param normalize a logical indicating whether to normalize the expression | ||
#' values. Default is TRUE | ||
#' @param recalculate_reduced_dims a logical indicating whether to recalculate | ||
#' PCA and UMAP. If FALSE, the input reduced dimensions are copied over. If | ||
#' TRUE, the highly variable genes are also recalculated with the new values | ||
#' stored in metadata. Default is FALSE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency with other docs...
#' @param sce a SingleCellExperiment object with duplicated row names | |
#' @param normalize a logical indicating whether to normalize the expression | |
#' values. Default is TRUE | |
#' @param recalculate_reduced_dims a logical indicating whether to recalculate | |
#' PCA and UMAP. If FALSE, the input reduced dimensions are copied over. If | |
#' TRUE, the highly variable genes are also recalculated with the new values | |
#' stored in metadata. Default is FALSE | |
#' @param sce a SingleCellExperiment object with duplicated row names. | |
#' @param normalize a logical indicating whether to normalize the expression | |
#' values. Default is TRUE. | |
#' @param recalculate_reduced_dims a logical indicating whether to recalculate | |
#' PCA and UMAP. If FALSE, the input reduced dimensions are copied over. If | |
#' TRUE, the highly variable genes are also recalculated with the new values | |
#' stored in metadata. Default is FALSE. |
@@ -85,10 +89,17 @@ ensembl_to_symbol <- function( | |||
) | |||
} | |||
if (!leave_na && any(missing_symbols)) { | |||
warning("Not all input ids have corresponding gene symbols, using input ids for missing values.") | |||
message("Not all input ids have corresponding gene symbols, using input ids for missing values.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
R/sum-duplicate-genes.R
Outdated
#' Genes with the same name are merged by summing their raw expression counts. | ||
#' If requested, the log-normalized expression values are recalculated, otherwise | ||
#' this is left blank. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd add a smidge more here about why this function is helpful, since it's exported (which I agree with)
sobj[[alt_exp_name]] <- create_seurat_assay(counts = counts(alt_exp)) | ||
if ("logcounts" %in% assayNames(alt_exp)) { | ||
sobj[[alt_exp_name]]$data <- logcounts(alt_exp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth noting that these lines may both give the Seurat warning
Warning: Feature names cannot have underscores ('_'), replacing with dashes ('-')
I got this when running a multiplexed library through, so this was coming from the MULTI_X
names. We might want to wrangle the altexps a bit more first if we want to avoid this Seurat warning. I think would be good since it's not very transparent to users that this is where the warning comes from, and we might add our own warning (message?) instead to be clearer what's changing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added conversion for altExps with a warning.
I actually would go the other way, and want unfiltered or at least filtered data to work, but I think merged data are a whole other kettle that I have no particular desire to support, as I have yet to see anyone working with them, let alone converting them.
I think you get a warning from Seurat, which I why I used a warning. It does seem severe enough to warrant a warning. |
fine with me, but fyi, i think this is the spot https://github.com/satijalab/seurat/blob/1549dcb3075eaeac01c925c4b4bb73c73450fc50/R/utilities.R#L2885 |
It's actually from here: https://github.com/satijalab/seurat-object/blob/e840bab6f1a9220104005a8043a087734fa02903/R/assay.R#L643 |
Co-authored-by: Stephanie Spielman <[email protected]>
unfiltered too should work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be ready for another look when you get a chance.
I fixed handling of less processed (filtered and unfiltered objects) as well as better handling of altExps that might have underscores in the feature names. I also updated docs.
I did not do anything about merged data; I wasn't sure what the best method for identifying merged data might be, so I am basically punting on that problem for now.
sobj[[alt_exp_name]] <- create_seurat_assay(counts = counts(alt_exp)) | ||
if ("logcounts" %in% assayNames(alt_exp)) { | ||
sobj[[alt_exp_name]]$data <- logcounts(alt_exp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added conversion for altExps with a warning.
there were a bunch of unused packages in there for some reason, which will slow testing.
} else { | ||
data_name <- NULL | ||
} | ||
sobj <- Seurat::as.Seurat(sce, data = data_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests clearly pass, but I'm unable to run sce_to_seurat()
standalone and the error is showing up at this line, both with a processed and filtered SCE. Attaching Seurat(Object) didn't help. Are you able to run this outside of the test suite?
Error in UseMethod(generic = "DefaultAssay<-", object = object) :
no applicable method for 'DefaultAssay<-' applied to an object of class "call"
> sessionInfo()
R version 4.4.0 (2024-04-24)
Platform: aarch64-apple-darwin20
Running under: macOS 15.1
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: America/New_York
tzcode source: internal
attached base packages:
[1] stats4 stats graphics grDevices datasets utils methods base
other attached packages:
[1] rOpenScPCA_0.1.0 SingleCellExperiment_1.26.0 SummarizedExperiment_1.34.0
[4] Biobase_2.64.0 GenomicRanges_1.56.1 GenomeInfoDb_1.40.1
[7] IRanges_2.38.1 S4Vectors_0.42.1 BiocGenerics_0.50.0
[10] MatrixGenerics_1.16.0 matrixStats_1.4.1 Seurat_5.1.0
[13] SeuratObject_5.0.2 sp_2.1-4 testthat_3.2.2
loaded via a namespace (and not attached):
[1] RcppAnnoy_0.0.22 splines_4.4.0 later_1.3.2
[4] tibble_3.2.1 polyclip_1.10-7 fastDummies_1.7.4
[7] lifecycle_1.0.4 rprojroot_2.0.4 globals_0.16.3
[10] lattice_0.22-6 MASS_7.3-60.2 magrittr_2.0.3
[13] plotly_4.10.4 yaml_2.3.10 remotes_2.5.0
[16] httpuv_1.6.15 sctransform_0.4.1 spam_2.10-0
[19] sessioninfo_1.2.2 pkgbuild_1.4.4 spatstat.sparse_3.1-0
[22] reticulate_1.39.0 cowplot_1.1.3 pbapply_1.7-2
[25] RColorBrewer_1.1-3 abind_1.4-5 pkgload_1.4.0
[28] zlibbioc_1.50.0 Rtsne_0.17 purrr_1.0.2
[31] GenomeInfoDbData_1.2.12 ggrepel_0.9.6 irlba_2.3.5.1
[34] listenv_0.9.1 spatstat.utils_3.1-0 goftest_1.2-3
[37] RSpectra_0.16-2 spatstat.random_3.3-1 fitdistrplus_1.2-1
[40] parallelly_1.38.0 leiden_0.4.3.1 codetools_0.2-20
[43] DelayedArray_0.30.1 tidyselect_1.2.1 UCSC.utils_1.0.0
[46] pdfCluster_1.0-4 spatstat.explore_3.3-2 jsonlite_1.8.8
[49] BiocNeighbors_1.22.0 ellipsis_0.3.2 progressr_0.14.0
[52] ggridges_0.5.6 survival_3.7-0 tools_4.4.0
[55] ica_1.0-3 Rcpp_1.0.13 glue_1.7.0
[58] gridExtra_2.3 SparseArray_1.4.8 usethis_3.0.0
[61] dplyr_1.1.4 withr_3.0.2 BiocManager_1.30.25
[64] fastmap_1.2.0 bluster_1.14.0 fansi_1.0.6
[67] digest_0.6.37 R6_2.5.1 mime_0.12
[70] colorspace_2.1-1 scattermore_1.2 tensor_1.5
[73] spatstat.data_3.1-2 utf8_1.2.4 tidyr_1.3.1
[76] generics_0.1.3 renv_1.0.11 data.table_1.16.0
[79] httr_1.4.7 htmlwidgets_1.6.4 S4Arrays_1.4.1
[82] uwot_0.2.2 pkgconfig_2.0.3 gtable_0.3.5
[85] lmtest_0.9-40 XVector_0.44.0 brio_1.1.5
[88] htmltools_0.5.8.1 profvis_0.4.0 dotCall64_1.1-1
[91] scales_1.3.0 png_0.1-8 spatstat.univar_3.0-1
[94] geometry_0.5.0 rstudioapi_0.17.1 reshape2_1.4.4
[97] nlme_3.1-164 magic_1.6-1 cachem_1.1.0
[100] zoo_1.8-12 stringr_1.5.1 KernSmooth_2.23-24
[103] parallel_4.4.0 miniUI_0.1.1.1 desc_1.4.3
[106] pillar_1.9.0 grid_4.4.0 vctrs_0.6.5
[109] RANN_2.6.2 urlchecker_1.0.1 promises_1.3.0
[112] xtable_1.8-4 cluster_2.1.6 waldo_0.6.1
[115] cli_3.6.3 compiler_4.4.0 rlang_1.1.4
[118] crayon_1.5.3 future.apply_1.11.2 plyr_1.8.9
[121] fs_1.6.4 stringi_1.8.4 deldir_2.0-4
[124] viridisLite_0.4.2 BiocParallel_1.38.0 munsell_0.5.1
[127] lazyeval_0.2.2 devtools_2.4.5 spatstat.geom_3.3-2
[130] Matrix_1.7-0 RcppHNSW_0.6.0 patchwork_1.2.0
[133] future_1.34.0 ggplot2_3.5.1 shiny_1.9.1
[136] ROCR_1.0-11 igraph_2.0.3 memoise_2.0.1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
914e500 seems to have done it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably related to an renv
bug in the latest version; do you have a variable named object
in your environment? (Seurat must be doing something bad here too, because it shouldn't read a global variable wherever it is doing that, but I can't work around all the Seurat bugs...) It should not affect running the function if you load the package from an external source. I just pushed a temporary fix until renv
update comes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you have a variable named object in your environment?
No definitely not.
...But I just went back to that commit hash and re-ran my exact code, and uh.... magic? Clearly, black magic. But black magic from devtools or testthat.
Restarting R session...
- Project '~/ALSF/open-scpca/rOpenScPCA' loaded. [renv 1.0.11]
> devtools::load_all(".")
ℹ Loading rOpenScPCA
> sce <- readRDS("tests/testthat/data/scpca_sce.rds")
> s <- sce_to_seurat(sce)
Error in UseMethod(generic = "DefaultAssay<-", object = object) :
no applicable method for 'DefaultAssay<-' applied to an object of class "call"
In addition: Warning message:
In ensembl_to_symbol(ensembl_ids, sce = sce, unique = unique, seurat_compatible = seurat_compatible) :
Replacing underscores ('_') with dashes ('-') in gene symbols for Seurat compatibility.
>
> object
test_check("rOpenScPCA")
I restarted R again, and...
Restarting R session...
- Project '~/ALSF/open-scpca/rOpenScPCA' loaded. [renv 1.0.11]
> object
test_check("rOpenScPCA")
I have made a short film.
https://github.com/user-attachments/assets/9b96416a-83f2-4091-b8d9-7aab22fa5d05
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, this was at the previous commit; this behavior no longer happens after 914e500, so at least what I was experiencing was consistent with the known bug. Wild stuff, though.
Punting seems very fine. Some ideas for checks: the presence of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a go! It seems to be all be working for me locally, and reviews addressed what I had caught before. We may find other things to tweak later, but this seems set for the first implementation 🚀
Closes #8
I added two main functions here:
sce_to_seurat()
andsum_duplicate_genes()
.The first of these does the main conversion, with the second being a kind of helper for creating a matrix where all genes with the same name have their expression values summed.
I let
as.Seurat()
do most of the work, but I added a few extra bits to get things into useful places, such as transferring over the highly variable genes, and adding the spliced counts as a separate assay. The latter could maybe be an option, but it seemed easier to just do it for completeness.I do a fair amount of pre-conversion to avoid some of the Seurat warnings about the names of various things; I feel fine about the PCA/UMAP conversion, as that is pretty clear and simple. Howeber, I'm a bit less sure if I want to keep the gene name conversion where
_
is replaced buy-
without a message. I like having the option that I added to the id conversion function, but I am of two minds about whether that kind of name change should result in a message to the user when converting a full object.The one calculation that I add in is the
scale.data
slot, which lots of downstream analysis needs (even though this is a non-sparse matrix, making the object substantially bigger).Most of the documentation is there, though I did not yet add example code, which is part of why this remains a draft for now.
I also need to add some tests to be sure
altExps
are converted as expected; I think they are, but have not formally tested it.Let me know what you think, and especially if the core functionality is working for you!
One other question is what we thing the default id conversion should be; right now it uses whatever is in the SCE, but maybe we should use one of the built-in tables by default? Curious to hear thoughts on that front as well.