Add functions for Seurat conversion #15

jashapiro · 2024-12-06T20:59:00Z

Closes #8

I added two main functions here: sce_to_seurat() and sum_duplicate_genes().

The first of these does the main conversion, with the second being a kind of helper for creating a matrix where all genes with the same name have their expression values summed.

I let as.Seurat() do most of the work, but I added a few extra bits to get things into useful places, such as transferring over the highly variable genes, and adding the spliced counts as a separate assay. The latter could maybe be an option, but it seemed easier to just do it for completeness.

I do a fair amount of pre-conversion to avoid some of the Seurat warnings about the names of various things; I feel fine about the PCA/UMAP conversion, as that is pretty clear and simple. Howeber, I'm a bit less sure if I want to keep the gene name conversion where _ is replaced buy - without a message. I like having the option that I added to the id conversion function, but I am of two minds about whether that kind of name change should result in a message to the user when converting a full object.

The one calculation that I add in is the scale.data slot, which lots of downstream analysis needs (even though this is a non-sparse matrix, making the object substantially bigger).

Most of the documentation is there, though I did not yet add example code, which is part of why this remains a draft for now.
I also need to add some tests to be sure altExps are converted as expected; I think they are, but have not formally tested it.

Let me know what you think, and especially if the core functionality is working for you!

One other question is what we thing the default id conversion should be; right now it uses whatever is in the SCE, but maybe we should use one of the built-in tables by default? Curious to hear thoughts on that front as well.

sjspielman

This is a pre-review! All I did was read over the code to get an overall sense of what's here, and left very smol comments along the way. I still need to actually review & run the code next week.

R/sum-duplicate-genes.R

R/make-seurat.R

sjspielman · 2024-12-06T21:20:54Z

R/make-seurat.R

+#' @param dedup_method Method to handle duplicated gene symbols. If `unique`,
+#'   the gene symbols will be made unique following standard Seurat procedures.


Is there a Seurat version to specify here, or do they all (to our knowledge) follow the same approach?

R/make-seurat.R

sjspielman · 2024-12-06T21:28:07Z

R/make-seurat.R

+  } else {
+    create_seurat_assay <- SeuratObject::CreateAssayObject
+    sobj[["RNA"]] <- as(sobj[["RNA"]], "Assay")
+  }


Ah, I see specifying "v4" (or anything else) would work.

No, it wouldn't as match.arg() will fail.

R/make-seurat.R

R/sum-duplicate-genes.R

Co-authored-by: Stephanie Spielman <[email protected]>

jashapiro

Thanks for the quick look. Made a few quick updates.

jashapiro · 2024-12-06T22:02:33Z

R/make-seurat.R

+  } else {
+    create_seurat_assay <- SeuratObject::CreateAssayObject
+    sobj[["RNA"]] <- as(sobj[["RNA"]], "Assay")
+  }


No, it wouldn't as match.arg() will fail.

R/make-seurat.R

R/sum-duplicate-genes.R

update with "latest" test data, be better with paths?

sjspielman

I've now gone over the code much more carefully and run it in a variety of conditions! I think you've done a pretty thorough job here; I can't think of much more ground to cover given how much ground already has been covered! have a 🌮.

While running the code, I confirmed that only processed (not (un)filtered) objects work with the code as is now, and similarly merged objects are a no-go. We should probably:

Add to docs that processed only & no merged
Add some checks for this at the top of the function to bail early with an informative error if it's not a single processed library

I'm frankly not convinced we ever want to accommodate (un)filtered, but merged I think would be useful to accommodate. If you agree, we should open that issue and come back to it.

Some specific feedback for your opening questions:

adding the spliced counts as a separate assay. The latter could maybe be an option, but it seemed easier to just do it for completeness.

Yeah, just do it.

I do a fair amount of pre-conversion to avoid some of the Seurat warnings about the names of various things; I feel fine about the PCA/UMAP conversion, as that is pretty clear and simple. However, I'm a bit less sure if I want to keep the gene name conversion where _ is replaced by - without a message. I like having the option that I added to the id conversion function, but I am of two minds about whether that kind of name change should result in a message to the user when converting a full object.

I think the message is reasonable since it mirrors the kind of message you'd get from Seurat if we didn't do that conversion ourselves. But, if we keep it, I'd make it indeed a message() (not warning() which it currently is), in part b/c that's also what Seurat does, but mostly because it's an expected change that happens when moving into Seurat-land which suggests message > warning.

I also need to add some tests to be sure altExps are converted as expected; I think they are, but have not formally tested it.

I gather this comment is now outdated given there are some tests! I did run the code with ScPCA CITE-seq and multiplexed SCEs successfully, so that's good! I did leave one in-line comment related to some of the behavior here too, but nothing major.

R/sum-duplicate-genes.R

sjspielman · 2024-12-09T18:35:46Z

data-raw/download_test_data.R

+
+# Downloads a test data file from OpenScPCA and places it in the test directory
+
+setwd(here::here())


Generally I prefer to avoid setwd() if we can, but given the role and scope of this script, I don't really mind it here. I see (from the build_gene_reference.R script) you already had some experience which convinced you to use setwd() rather than tossing it all into here::here()

yeah, for some reason test_path wasn't playing well with here(); I tried to do some nesting but it wasn't working suggestions welcome.

sjspielman · 2024-12-09T18:39:12Z

R/sum-duplicate-genes.R

+#' @param sce a SingleCellExperiment object with duplicated row names
+#' @param normalize a logical indicating whether to normalize the expression
+#'   values. Default is TRUE
+#' @param recalculate_reduced_dims a logical indicating whether to recalculate
+#'   PCA and UMAP. If FALSE, the input reduced dimensions are copied over. If
+#'   TRUE, the highly variable genes are also recalculated with the new values
+#'   stored in metadata. Default is FALSE


For consistency with other docs...

Suggested change

#' @param sce a SingleCellExperiment object with duplicated row names

#' @param normalize a logical indicating whether to normalize the expression

#' values. Default is TRUE

#' @param recalculate_reduced_dims a logical indicating whether to recalculate

#' PCA and UMAP. If FALSE, the input reduced dimensions are copied over. If

#' TRUE, the highly variable genes are also recalculated with the new values

#' stored in metadata. Default is FALSE

#' @param sce a SingleCellExperiment object with duplicated row names.

#' @param normalize a logical indicating whether to normalize the expression

#' values. Default is TRUE.

#' @param recalculate_reduced_dims a logical indicating whether to recalculate

#' PCA and UMAP. If FALSE, the input reduced dimensions are copied over. If

#' TRUE, the highly variable genes are also recalculated with the new values

#' stored in metadata. Default is FALSE.

sjspielman · 2024-12-09T18:41:50Z

R/convert-gene-ids.R

@@ -85,10 +89,17 @@ ensembl_to_symbol <- function(
    )
  }
  if (!leave_na && any(missing_symbols)) {
-    warning("Not all input ids have corresponding gene symbols, using input ids for missing values.")
+    message("Not all input ids have corresponding gene symbols, using input ids for missing values.")


R/make-seurat.R

R/sum-duplicate-genes.R

sjspielman · 2024-12-09T19:35:02Z

R/sum-duplicate-genes.R

+#' Genes with the same name are merged by summing their raw expression counts.
+#' If requested, the log-normalized expression values are recalculated, otherwise
+#' this is left blank.


I'd add a smidge more here about why this function is helpful, since it's exported (which I agree with)

R/sum-duplicate-genes.R

sjspielman · 2024-12-09T20:17:25Z

R/make-seurat.R

+    sobj[[alt_exp_name]] <- create_seurat_assay(counts = counts(alt_exp))
+    if ("logcounts" %in% assayNames(alt_exp)) {
+      sobj[[alt_exp_name]]$data <- logcounts(alt_exp)


Worth noting that these lines may both give the Seurat warning

Warning: Feature names cannot have underscores ('_'), replacing with dashes ('-')

I got this when running a multiplexed library through, so this was coming from the MULTI_X names. We might want to wrangle the altexps a bit more first if we want to avoid this Seurat warning. I think would be good since it's not very transparent to users that this is where the warning comes from, and we might add our own warning (message?) instead to be clearer what's changing.

Added conversion for altExps with a warning.

jashapiro · 2024-12-09T20:40:47Z

I'm frankly not convinced we ever want to accommodate (un)filtered, but merged I think would be useful to accommodate. If you agree, we should open that issue and come back to it.

I actually would go the other way, and want unfiltered or at least filtered data to work, but I think merged data are a whole other kettle that I have no particular desire to support, as I have yet to see anyone working with them, let alone converting them.

I think the message is reasonable since it mirrors the kind of message you'd get from Seurat if we didn't do that conversion ourselves. But, if we keep it, I'd make it indeed a message() (not warning() which it currently is), in part b/c that's also what Seurat does, but mostly because it's an expected change that happens when moving into Seurat-land which suggests message > warning.

I think you get a warning from Seurat, which I why I used a warning. It does seem severe enough to warrant a warning.

sjspielman · 2024-12-09T20:51:59Z

I think you get a warning from Seurat, which I why I used a warning. It does seem severe enough to warrant a warning.

fine with me, but fyi, i think this is the spot https://github.com/satijalab/seurat/blob/1549dcb3075eaeac01c925c4b4bb73c73450fc50/R/utilities.R#L2885

jashapiro · 2024-12-09T21:03:51Z

I think you get a warning from Seurat, which I why I used a warning. It does seem severe enough to warrant a warning.

fine with me, but fyi, i think this is the spot https://github.com/satijalab/seurat/blob/1549dcb3075eaeac01c925c4b4bb73c73450fc50/R/utilities.R#L2885

It's actually from here: https://github.com/satijalab/seurat-object/blob/e840bab6f1a9220104005a8043a087734fa02903/R/assay.R#L643

Co-authored-by: Stephanie Spielman <[email protected]>

unfiltered too should work

jashapiro

I think this should be ready for another look when you get a chance.

I fixed handling of less processed (filtered and unfiltered objects) as well as better handling of altExps that might have underscores in the feature names. I also updated docs.

I did not do anything about merged data; I wasn't sure what the best method for identifying merged data might be, so I am basically punting on that problem for now.

jashapiro · 2024-12-10T20:04:42Z

R/make-seurat.R

+    sobj[[alt_exp_name]] <- create_seurat_assay(counts = counts(alt_exp))
+    if ("logcounts" %in% assayNames(alt_exp)) {
+      sobj[[alt_exp_name]]$data <- logcounts(alt_exp)


Added conversion for altExps with a warning.

there were a bunch of unused packages in there for some reason, which will slow testing.

sjspielman · 2024-12-13T17:38:07Z

R/make-seurat.R

+  } else {
+    data_name <- NULL
+  }
+  sobj <- Seurat::as.Seurat(sce, data = data_name)


Tests clearly pass, but I'm unable to run sce_to_seurat() standalone and the error is showing up at this line, both with a processed and filtered SCE. Attaching Seurat(Object) didn't help. Are you able to run this outside of the test suite?

Error in UseMethod(generic = "DefaultAssay<-", object = object) : no applicable method for 'DefaultAssay<-' applied to an object of class "call"

> sessionInfo() R version 4.4.0 (2024-04-24) Platform: aarch64-apple-darwin20 Running under: macOS 15.1 Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0 locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 time zone: America/New_York tzcode source: internal attached base packages: [1] stats4 stats graphics grDevices datasets utils methods base other attached packages: [1] rOpenScPCA_0.1.0 SingleCellExperiment_1.26.0 SummarizedExperiment_1.34.0 [4] Biobase_2.64.0 GenomicRanges_1.56.1 GenomeInfoDb_1.40.1 [7] IRanges_2.38.1 S4Vectors_0.42.1 BiocGenerics_0.50.0 [10] MatrixGenerics_1.16.0 matrixStats_1.4.1 Seurat_5.1.0 [13] SeuratObject_5.0.2 sp_2.1-4 testthat_3.2.2 loaded via a namespace (and not attached): [1] RcppAnnoy_0.0.22 splines_4.4.0 later_1.3.2 [4] tibble_3.2.1 polyclip_1.10-7 fastDummies_1.7.4 [7] lifecycle_1.0.4 rprojroot_2.0.4 globals_0.16.3 [10] lattice_0.22-6 MASS_7.3-60.2 magrittr_2.0.3 [13] plotly_4.10.4 yaml_2.3.10 remotes_2.5.0 [16] httpuv_1.6.15 sctransform_0.4.1 spam_2.10-0 [19] sessioninfo_1.2.2 pkgbuild_1.4.4 spatstat.sparse_3.1-0 [22] reticulate_1.39.0 cowplot_1.1.3 pbapply_1.7-2 [25] RColorBrewer_1.1-3 abind_1.4-5 pkgload_1.4.0 [28] zlibbioc_1.50.0 Rtsne_0.17 purrr_1.0.2 [31] GenomeInfoDbData_1.2.12 ggrepel_0.9.6 irlba_2.3.5.1 [34] listenv_0.9.1 spatstat.utils_3.1-0 goftest_1.2-3 [37] RSpectra_0.16-2 spatstat.random_3.3-1 fitdistrplus_1.2-1 [40] parallelly_1.38.0 leiden_0.4.3.1 codetools_0.2-20 [43] DelayedArray_0.30.1 tidyselect_1.2.1 UCSC.utils_1.0.0 [46] pdfCluster_1.0-4 spatstat.explore_3.3-2 jsonlite_1.8.8 [49] BiocNeighbors_1.22.0 ellipsis_0.3.2 progressr_0.14.0 [52] ggridges_0.5.6 survival_3.7-0 tools_4.4.0 [55] ica_1.0-3 Rcpp_1.0.13 glue_1.7.0 [58] gridExtra_2.3 SparseArray_1.4.8 usethis_3.0.0 [61] dplyr_1.1.4 withr_3.0.2 BiocManager_1.30.25 [64] fastmap_1.2.0 bluster_1.14.0 fansi_1.0.6 [67] digest_0.6.37 R6_2.5.1 mime_0.12 [70] colorspace_2.1-1 scattermore_1.2 tensor_1.5 [73] spatstat.data_3.1-2 utf8_1.2.4 tidyr_1.3.1 [76] generics_0.1.3 renv_1.0.11 data.table_1.16.0 [79] httr_1.4.7 htmlwidgets_1.6.4 S4Arrays_1.4.1 [82] uwot_0.2.2 pkgconfig_2.0.3 gtable_0.3.5 [85] lmtest_0.9-40 XVector_0.44.0 brio_1.1.5 [88] htmltools_0.5.8.1 profvis_0.4.0 dotCall64_1.1-1 [91] scales_1.3.0 png_0.1-8 spatstat.univar_3.0-1 [94] geometry_0.5.0 rstudioapi_0.17.1 reshape2_1.4.4 [97] nlme_3.1-164 magic_1.6-1 cachem_1.1.0 [100] zoo_1.8-12 stringr_1.5.1 KernSmooth_2.23-24 [103] parallel_4.4.0 miniUI_0.1.1.1 desc_1.4.3 [106] pillar_1.9.0 grid_4.4.0 vctrs_0.6.5 [109] RANN_2.6.2 urlchecker_1.0.1 promises_1.3.0 [112] xtable_1.8-4 cluster_2.1.6 waldo_0.6.1 [115] cli_3.6.3 compiler_4.4.0 rlang_1.1.4 [118] crayon_1.5.3 future.apply_1.11.2 plyr_1.8.9 [121] fs_1.6.4 stringi_1.8.4 deldir_2.0-4 [124] viridisLite_0.4.2 BiocParallel_1.38.0 munsell_0.5.1 [127] lazyeval_0.2.2 devtools_2.4.5 spatstat.geom_3.3-2 [130] Matrix_1.7-0 RcppHNSW_0.6.0 patchwork_1.2.0 [133] future_1.34.0 ggplot2_3.5.1 shiny_1.9.1 [136] ROCR_1.0-11 igraph_2.0.3 memoise_2.0.1

914e500 seems to have done it

This is probably related to an renv bug in the latest version; do you have a variable named object in your environment? (Seurat must be doing something bad here too, because it shouldn't read a global variable wherever it is doing that, but I can't work around all the Seurat bugs...) It should not affect running the function if you load the package from an external source. I just pushed a temporary fix until renv update comes.

do you have a variable named object in your environment?

No definitely not.

...But I just went back to that commit hash and re-ran my exact code, and uh.... magic? Clearly, black magic. But black magic from devtools or testthat.

Restarting R session... - Project '~/ALSF/open-scpca/rOpenScPCA' loaded. [renv 1.0.11] > devtools::load_all(".") ℹ Loading rOpenScPCA > sce <- readRDS("tests/testthat/data/scpca_sce.rds") > s <- sce_to_seurat(sce) Error in UseMethod(generic = "DefaultAssay<-", object = object) : no applicable method for 'DefaultAssay<-' applied to an object of class "call" In addition: Warning message: In ensembl_to_symbol(ensembl_ids, sce = sce, unique = unique, seurat_compatible = seurat_compatible) : Replacing underscores ('_') with dashes ('-') in gene symbols for Seurat compatibility. > > object test_check("rOpenScPCA")

I restarted R again, and...

Restarting R session... - Project '~/ALSF/open-scpca/rOpenScPCA' loaded. [renv 1.0.11] > object test_check("rOpenScPCA")

I have made a short film.
https://github.com/user-attachments/assets/9b96416a-83f2-4091-b8d9-7aab22fa5d05

Again, this was at the previous commit; this behavior no longer happens after 914e500, so at least what I was experiencing was consistent with the known bug. Wild stuff, though.

sjspielman · 2024-12-13T17:41:03Z

I did not do anything about merged data; I wasn't sure what the best method for identifying merged data might be, so I am basically punting on that problem for now.

Punting seems very fine. Some ideas for checks: the presence of metadata(sce)$library_metadata and/or number of unique library IDs sce$library_id (particularly the latter) seem like contenders for indicators that users aren't likely to modify.

sjspielman

I think this is a go! It seems to be all be working for me locally, and reviews addressed what I had caught before. We may find other things to tweak later, but this seems set for the first implementation 🚀

jashapiro added 6 commits December 5, 2024 17:12

Add function for merging duplicate gene names

b7c8385

add tests and working functions

07a1e12

rename merge to sum-duplicates

8a74309

Add seurat_compatible option to conversion

39d7b56

Add make-seurat functions and tests

6a316f7

document

868aa6d

jashapiro requested a review from sjspielman December 6, 2024 21:03

sjspielman reviewed Dec 6, 2024

View reviewed changes

jashapiro and others added 2 commits December 6, 2024 17:03

Apply suggestions from code review

248c1c1

Co-authored-by: Stephanie Spielman <[email protected]>

responses to initial review

70fba8f

jashapiro requested a review from sjspielman December 6, 2024 22:05

jashapiro commented Dec 6, 2024

View reviewed changes

jashapiro added 4 commits December 6, 2024 17:33

add altexp handling

9ef3821

Warn about gene id conversions,

36296d9

Add Seurat v5 checks

80f0703

add download script for test data

b2ebf8b

update with "latest" test data, be better with paths?

sjspielman reviewed Dec 9, 2024

View reviewed changes

jashapiro and others added 9 commits December 9, 2024 16:06

Apply suggestions from code review

1d80018

Co-authored-by: Stephanie Spielman <[email protected]>

add filtered test data

2188402

Support filtered data

a90d281

unfiltered too should work

hvg fixes for summing

40c4c96

add handling of altexps with underscores in names

17b86b3

update documentation

927f050

pre-commit update

01d0849

add examples

55442c5

docs update

9fe0918

jashapiro commented Dec 10, 2024

View reviewed changes

jashapiro marked this pull request as ready for review December 10, 2024 20:10

jashapiro requested a review from sjspielman December 12, 2024 16:21

jashapiro added 2 commits December 12, 2024 13:39

Update licensing to remove build note

0e13cfd

clean up renv

f79f070

there were a bunch of unused packages in there for some reason, which will slow testing.

sjspielman reviewed Dec 13, 2024

View reviewed changes

work around renv bug

914e500

sjspielman approved these changes Dec 13, 2024

View reviewed changes

add a quick check for merged data

7c546fd

sjspielman mentioned this pull request Dec 13, 2024

hello-clusters notebook: Perform and evaluate clustering AlexsLemonade/OpenScPCA-analysis#874

Merged

jashapiro merged commit 0424106 into main Dec 13, 2024
2 checks passed

jashapiro deleted the jashapiro/8-seurat-conversion branch December 13, 2024 19:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add functions for Seurat conversion #15

Add functions for Seurat conversion #15

jashapiro commented Dec 6, 2024

sjspielman left a comment

sjspielman Dec 6, 2024

sjspielman Dec 6, 2024

jashapiro Dec 6, 2024

jashapiro left a comment

jashapiro Dec 6, 2024

sjspielman left a comment

sjspielman Dec 9, 2024

jashapiro Dec 9, 2024

sjspielman Dec 9, 2024

sjspielman Dec 9, 2024

sjspielman Dec 9, 2024

sjspielman Dec 9, 2024

jashapiro Dec 10, 2024

jashapiro commented Dec 9, 2024

sjspielman commented Dec 9, 2024

jashapiro commented Dec 9, 2024

jashapiro left a comment

jashapiro Dec 10, 2024

sjspielman Dec 13, 2024

sjspielman Dec 13, 2024

jashapiro Dec 13, 2024

sjspielman Dec 13, 2024

sjspielman Dec 13, 2024 •

edited

Loading

sjspielman commented Dec 13, 2024

sjspielman left a comment

		#' @param dedup_method Method to handle duplicated gene symbols. If `unique`,
		#' the gene symbols will be made unique following standard Seurat procedures.


		# Downloads a test data file from OpenScPCA and places it in the test directory

		setwd(here::here())

Add functions for Seurat conversion #15

Add functions for Seurat conversion #15

Conversation

jashapiro commented Dec 6, 2024

sjspielman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jashapiro left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sjspielman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jashapiro commented Dec 9, 2024

sjspielman commented Dec 9, 2024

jashapiro commented Dec 9, 2024

jashapiro left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sjspielman Dec 13, 2024 • edited Loading

Choose a reason for hiding this comment

sjspielman commented Dec 13, 2024

sjspielman left a comment

Choose a reason for hiding this comment

sjspielman Dec 13, 2024 •

edited

Loading