Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add functions for Seurat conversion #15

Merged
merged 25 commits into from
Dec 13, 2024
Merged

Conversation

jashapiro
Copy link
Member

Closes #8

I added two main functions here: sce_to_seurat() and sum_duplicate_genes().

The first of these does the main conversion, with the second being a kind of helper for creating a matrix where all genes with the same name have their expression values summed.

I let as.Seurat() do most of the work, but I added a few extra bits to get things into useful places, such as transferring over the highly variable genes, and adding the spliced counts as a separate assay. The latter could maybe be an option, but it seemed easier to just do it for completeness.

I do a fair amount of pre-conversion to avoid some of the Seurat warnings about the names of various things; I feel fine about the PCA/UMAP conversion, as that is pretty clear and simple. Howeber, I'm a bit less sure if I want to keep the gene name conversion where _ is replaced buy - without a message. I like having the option that I added to the id conversion function, but I am of two minds about whether that kind of name change should result in a message to the user when converting a full object.

The one calculation that I add in is the scale.data slot, which lots of downstream analysis needs (even though this is a non-sparse matrix, making the object substantially bigger).

Most of the documentation is there, though I did not yet add example code, which is part of why this remains a draft for now.
I also need to add some tests to be sure altExps are converted as expected; I think they are, but have not formally tested it.

Let me know what you think, and especially if the core functionality is working for you!

One other question is what we thing the default id conversion should be; right now it uses whatever is in the SCE, but maybe we should use one of the built-in tables by default? Curious to hear thoughts on that front as well.

@jashapiro jashapiro requested a review from sjspielman December 6, 2024 21:03
Copy link
Member

@sjspielman sjspielman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a pre-review! All I did was read over the code to get an overall sense of what's here, and left very smol comments along the way. I still need to actually review & run the code next week.

R/sum-duplicate-genes.R Outdated Show resolved Hide resolved
R/make-seurat.R Outdated Show resolved Hide resolved
Comment on lines +20 to +21
#' @param dedup_method Method to handle duplicated gene symbols. If `unique`,
#' the gene symbols will be made unique following standard Seurat procedures.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a Seurat version to specify here, or do they all (to our knowledge) follow the same approach?

R/make-seurat.R Show resolved Hide resolved
R/make-seurat.R Show resolved Hide resolved
R/make-seurat.R Outdated Show resolved Hide resolved
R/make-seurat.R Outdated Show resolved Hide resolved
R/make-seurat.R Outdated
Comment on lines 103 to 106
} else {
create_seurat_assay <- SeuratObject::CreateAssayObject
sobj[["RNA"]] <- as(sobj[["RNA"]], "Assay")
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see specifying "v4" (or anything else) would work.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it wouldn't as match.arg() will fail.

R/make-seurat.R Outdated Show resolved Hide resolved
R/sum-duplicate-genes.R Show resolved Hide resolved
@jashapiro jashapiro requested a review from sjspielman December 6, 2024 22:05
Copy link
Member Author

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick look. Made a few quick updates.

R/make-seurat.R Outdated
Comment on lines 103 to 106
} else {
create_seurat_assay <- SeuratObject::CreateAssayObject
sobj[["RNA"]] <- as(sobj[["RNA"]], "Assay")
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it wouldn't as match.arg() will fail.

R/make-seurat.R Outdated Show resolved Hide resolved
R/sum-duplicate-genes.R Show resolved Hide resolved
Copy link
Member

@sjspielman sjspielman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've now gone over the code much more carefully and run it in a variety of conditions! I think you've done a pretty thorough job here; I can't think of much more ground to cover given how much ground already has been covered! have a 🌮.

While running the code, I confirmed that only processed (not (un)filtered) objects work with the code as is now, and similarly merged objects are a no-go. We should probably:

  • Add to docs that processed only & no merged
  • Add some checks for this at the top of the function to bail early with an informative error if it's not a single processed library

I'm frankly not convinced we ever want to accommodate (un)filtered, but merged I think would be useful to accommodate. If you agree, we should open that issue and come back to it.

Some specific feedback for your opening questions:

adding the spliced counts as a separate assay. The latter could maybe be an option, but it seemed easier to just do it for completeness.

Yeah, just do it.

I do a fair amount of pre-conversion to avoid some of the Seurat warnings about the names of various things; I feel fine about the PCA/UMAP conversion, as that is pretty clear and simple. However, I'm a bit less sure if I want to keep the gene name conversion where _ is replaced by - without a message. I like having the option that I added to the id conversion function, but I am of two minds about whether that kind of name change should result in a message to the user when converting a full object.

I think the message is reasonable since it mirrors the kind of message you'd get from Seurat if we didn't do that conversion ourselves. But, if we keep it, I'd make it indeed a message() (not warning() which it currently is), in part b/c that's also what Seurat does, but mostly because it's an expected change that happens when moving into Seurat-land which suggests message > warning.

I also need to add some tests to be sure altExps are converted as expected; I think they are, but have not formally tested it.

I gather this comment is now outdated given there are some tests! I did run the code with ScPCA CITE-seq and multiplexed SCEs successfully, so that's good! I did leave one in-line comment related to some of the behavior here too, but nothing major.

R/sum-duplicate-genes.R Outdated Show resolved Hide resolved

# Downloads a test data file from OpenScPCA and places it in the test directory

setwd(here::here())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally I prefer to avoid setwd() if we can, but given the role and scope of this script, I don't really mind it here. I see (from the build_gene_reference.R script) you already had some experience which convinced you to use setwd() rather than tossing it all into here::here()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, for some reason test_path wasn't playing well with here(); I tried to do some nesting but it wasn't working suggestions welcome.

Comment on lines 8 to 14
#' @param sce a SingleCellExperiment object with duplicated row names
#' @param normalize a logical indicating whether to normalize the expression
#' values. Default is TRUE
#' @param recalculate_reduced_dims a logical indicating whether to recalculate
#' PCA and UMAP. If FALSE, the input reduced dimensions are copied over. If
#' TRUE, the highly variable genes are also recalculated with the new values
#' stored in metadata. Default is FALSE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency with other docs...

Suggested change
#' @param sce a SingleCellExperiment object with duplicated row names
#' @param normalize a logical indicating whether to normalize the expression
#' values. Default is TRUE
#' @param recalculate_reduced_dims a logical indicating whether to recalculate
#' PCA and UMAP. If FALSE, the input reduced dimensions are copied over. If
#' TRUE, the highly variable genes are also recalculated with the new values
#' stored in metadata. Default is FALSE
#' @param sce a SingleCellExperiment object with duplicated row names.
#' @param normalize a logical indicating whether to normalize the expression
#' values. Default is TRUE.
#' @param recalculate_reduced_dims a logical indicating whether to recalculate
#' PCA and UMAP. If FALSE, the input reduced dimensions are copied over. If
#' TRUE, the highly variable genes are also recalculated with the new values
#' stored in metadata. Default is FALSE.

@@ -85,10 +89,17 @@ ensembl_to_symbol <- function(
)
}
if (!leave_na && any(missing_symbols)) {
warning("Not all input ids have corresponding gene symbols, using input ids for missing values.")
message("Not all input ids have corresponding gene symbols, using input ids for missing values.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

R/make-seurat.R Outdated Show resolved Hide resolved
R/make-seurat.R Outdated Show resolved Hide resolved
R/sum-duplicate-genes.R Outdated Show resolved Hide resolved
Comment on lines 3 to 5
#' Genes with the same name are merged by summing their raw expression counts.
#' If requested, the log-normalized expression values are recalculated, otherwise
#' this is left blank.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add a smidge more here about why this function is helpful, since it's exported (which I agree with)

R/sum-duplicate-genes.R Outdated Show resolved Hide resolved
Comment on lines +120 to +122
sobj[[alt_exp_name]] <- create_seurat_assay(counts = counts(alt_exp))
if ("logcounts" %in% assayNames(alt_exp)) {
sobj[[alt_exp_name]]$data <- logcounts(alt_exp)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth noting that these lines may both give the Seurat warning

Warning: Feature names cannot have underscores ('_'), replacing with dashes ('-')

I got this when running a multiplexed library through, so this was coming from the MULTI_X names. We might want to wrangle the altexps a bit more first if we want to avoid this Seurat warning. I think would be good since it's not very transparent to users that this is where the warning comes from, and we might add our own warning (message?) instead to be clearer what's changing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added conversion for altExps with a warning.

@jashapiro
Copy link
Member Author

I'm frankly not convinced we ever want to accommodate (un)filtered, but merged I think would be useful to accommodate. If you agree, we should open that issue and come back to it.

I actually would go the other way, and want unfiltered or at least filtered data to work, but I think merged data are a whole other kettle that I have no particular desire to support, as I have yet to see anyone working with them, let alone converting them.

I think the message is reasonable since it mirrors the kind of message you'd get from Seurat if we didn't do that conversion ourselves. But, if we keep it, I'd make it indeed a message() (not warning() which it currently is), in part b/c that's also what Seurat does, but mostly because it's an expected change that happens when moving into Seurat-land which suggests message > warning.

I think you get a warning from Seurat, which I why I used a warning. It does seem severe enough to warrant a warning.

@sjspielman
Copy link
Member

I think you get a warning from Seurat, which I why I used a warning. It does seem severe enough to warrant a warning.

fine with me, but fyi, i think this is the spot https://github.com/satijalab/seurat/blob/1549dcb3075eaeac01c925c4b4bb73c73450fc50/R/utilities.R#L2885

@jashapiro
Copy link
Member Author

I think you get a warning from Seurat, which I why I used a warning. It does seem severe enough to warrant a warning.

fine with me, but fyi, i think this is the spot https://github.com/satijalab/seurat/blob/1549dcb3075eaeac01c925c4b4bb73c73450fc50/R/utilities.R#L2885

It's actually from here: https://github.com/satijalab/seurat-object/blob/e840bab6f1a9220104005a8043a087734fa02903/R/assay.R#L643

Copy link
Member Author

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be ready for another look when you get a chance.

I fixed handling of less processed (filtered and unfiltered objects) as well as better handling of altExps that might have underscores in the feature names. I also updated docs.

I did not do anything about merged data; I wasn't sure what the best method for identifying merged data might be, so I am basically punting on that problem for now.

Comment on lines +120 to +122
sobj[[alt_exp_name]] <- create_seurat_assay(counts = counts(alt_exp))
if ("logcounts" %in% assayNames(alt_exp)) {
sobj[[alt_exp_name]]$data <- logcounts(alt_exp)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added conversion for altExps with a warning.

@jashapiro jashapiro marked this pull request as ready for review December 10, 2024 20:10
@jashapiro jashapiro requested a review from sjspielman December 12, 2024 16:21
there were a bunch of unused packages in there for some reason, which will slow testing.
} else {
data_name <- NULL
}
sobj <- Seurat::as.Seurat(sce, data = data_name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests clearly pass, but I'm unable to run sce_to_seurat() standalone and the error is showing up at this line, both with a processed and filtered SCE. Attaching Seurat(Object) didn't help. Are you able to run this outside of the test suite?

Error in UseMethod(generic = "DefaultAssay<-", object = object) : 
  no applicable method for 'DefaultAssay<-' applied to an object of class "call"
> sessionInfo()
R version 4.4.0 (2024-04-24)
Platform: aarch64-apple-darwin20
Running under: macOS 15.1

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats4    stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] rOpenScPCA_0.1.0            SingleCellExperiment_1.26.0 SummarizedExperiment_1.34.0
 [4] Biobase_2.64.0              GenomicRanges_1.56.1        GenomeInfoDb_1.40.1        
 [7] IRanges_2.38.1              S4Vectors_0.42.1            BiocGenerics_0.50.0        
[10] MatrixGenerics_1.16.0       matrixStats_1.4.1           Seurat_5.1.0               
[13] SeuratObject_5.0.2          sp_2.1-4                    testthat_3.2.2             

loaded via a namespace (and not attached):
  [1] RcppAnnoy_0.0.22        splines_4.4.0           later_1.3.2            
  [4] tibble_3.2.1            polyclip_1.10-7         fastDummies_1.7.4      
  [7] lifecycle_1.0.4         rprojroot_2.0.4         globals_0.16.3         
 [10] lattice_0.22-6          MASS_7.3-60.2           magrittr_2.0.3         
 [13] plotly_4.10.4           yaml_2.3.10             remotes_2.5.0          
 [16] httpuv_1.6.15           sctransform_0.4.1       spam_2.10-0            
 [19] sessioninfo_1.2.2       pkgbuild_1.4.4          spatstat.sparse_3.1-0  
 [22] reticulate_1.39.0       cowplot_1.1.3           pbapply_1.7-2          
 [25] RColorBrewer_1.1-3      abind_1.4-5             pkgload_1.4.0          
 [28] zlibbioc_1.50.0         Rtsne_0.17              purrr_1.0.2            
 [31] GenomeInfoDbData_1.2.12 ggrepel_0.9.6           irlba_2.3.5.1          
 [34] listenv_0.9.1           spatstat.utils_3.1-0    goftest_1.2-3          
 [37] RSpectra_0.16-2         spatstat.random_3.3-1   fitdistrplus_1.2-1     
 [40] parallelly_1.38.0       leiden_0.4.3.1          codetools_0.2-20       
 [43] DelayedArray_0.30.1     tidyselect_1.2.1        UCSC.utils_1.0.0       
 [46] pdfCluster_1.0-4        spatstat.explore_3.3-2  jsonlite_1.8.8         
 [49] BiocNeighbors_1.22.0    ellipsis_0.3.2          progressr_0.14.0       
 [52] ggridges_0.5.6          survival_3.7-0          tools_4.4.0            
 [55] ica_1.0-3               Rcpp_1.0.13             glue_1.7.0             
 [58] gridExtra_2.3           SparseArray_1.4.8       usethis_3.0.0          
 [61] dplyr_1.1.4             withr_3.0.2             BiocManager_1.30.25    
 [64] fastmap_1.2.0           bluster_1.14.0          fansi_1.0.6            
 [67] digest_0.6.37           R6_2.5.1                mime_0.12              
 [70] colorspace_2.1-1        scattermore_1.2         tensor_1.5             
 [73] spatstat.data_3.1-2     utf8_1.2.4              tidyr_1.3.1            
 [76] generics_0.1.3          renv_1.0.11             data.table_1.16.0      
 [79] httr_1.4.7              htmlwidgets_1.6.4       S4Arrays_1.4.1         
 [82] uwot_0.2.2              pkgconfig_2.0.3         gtable_0.3.5           
 [85] lmtest_0.9-40           XVector_0.44.0          brio_1.1.5             
 [88] htmltools_0.5.8.1       profvis_0.4.0           dotCall64_1.1-1        
 [91] scales_1.3.0            png_0.1-8               spatstat.univar_3.0-1  
 [94] geometry_0.5.0          rstudioapi_0.17.1       reshape2_1.4.4         
 [97] nlme_3.1-164            magic_1.6-1             cachem_1.1.0           
[100] zoo_1.8-12              stringr_1.5.1           KernSmooth_2.23-24     
[103] parallel_4.4.0          miniUI_0.1.1.1          desc_1.4.3             
[106] pillar_1.9.0            grid_4.4.0              vctrs_0.6.5            
[109] RANN_2.6.2              urlchecker_1.0.1        promises_1.3.0         
[112] xtable_1.8-4            cluster_2.1.6           waldo_0.6.1            
[115] cli_3.6.3               compiler_4.4.0          rlang_1.1.4            
[118] crayon_1.5.3            future.apply_1.11.2     plyr_1.8.9             
[121] fs_1.6.4                stringi_1.8.4           deldir_2.0-4           
[124] viridisLite_0.4.2       BiocParallel_1.38.0     munsell_0.5.1          
[127] lazyeval_0.2.2          devtools_2.4.5          spatstat.geom_3.3-2    
[130] Matrix_1.7-0            RcppHNSW_0.6.0          patchwork_1.2.0        
[133] future_1.34.0           ggplot2_3.5.1           shiny_1.9.1            
[136] ROCR_1.0-11             igraph_2.0.3            memoise_2.0.1 

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

914e500 seems to have done it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably related to an renv bug in the latest version; do you have a variable named object in your environment? (Seurat must be doing something bad here too, because it shouldn't read a global variable wherever it is doing that, but I can't work around all the Seurat bugs...) It should not affect running the function if you load the package from an external source. I just pushed a temporary fix until renv update comes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you have a variable named object in your environment?

No definitely not.

...But I just went back to that commit hash and re-ran my exact code, and uh.... magic? Clearly, black magic. But black magic from devtools or testthat.

Restarting R session...

- Project '~/ALSF/open-scpca/rOpenScPCA' loaded. [renv 1.0.11]
> devtools::load_all(".")
ℹ Loading rOpenScPCA
> sce <- readRDS("tests/testthat/data/scpca_sce.rds")
> s <- sce_to_seurat(sce)
Error in UseMethod(generic = "DefaultAssay<-", object = object) : 
  no applicable method for 'DefaultAssay<-' applied to an object of class "call"
In addition: Warning message:
In ensembl_to_symbol(ensembl_ids, sce = sce, unique = unique, seurat_compatible = seurat_compatible) :
  Replacing underscores ('_') with dashes ('-') in gene symbols for Seurat compatibility.
> 
> object
test_check("rOpenScPCA")

I restarted R again, and...

Restarting R session...

- Project '~/ALSF/open-scpca/rOpenScPCA' loaded. [renv 1.0.11]
> object
test_check("rOpenScPCA")

I have made a short film.
https://github.com/user-attachments/assets/9b96416a-83f2-4091-b8d9-7aab22fa5d05

Copy link
Member

@sjspielman sjspielman Dec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, this was at the previous commit; this behavior no longer happens after 914e500, so at least what I was experiencing was consistent with the known bug. Wild stuff, though.

@sjspielman
Copy link
Member

I did not do anything about merged data; I wasn't sure what the best method for identifying merged data might be, so I am basically punting on that problem for now.

Punting seems very fine. Some ideas for checks: the presence of metadata(sce)$library_metadata and/or number of unique library IDs sce$library_id (particularly the latter) seem like contenders for indicators that users aren't likely to modify.

Copy link
Member

@sjspielman sjspielman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a go! It seems to be all be working for me locally, and reviews addressed what I had caught before. We may find other things to tweak later, but this seems set for the first implementation 🚀

@jashapiro jashapiro merged commit 0424106 into main Dec 13, 2024
2 checks passed
@jashapiro jashapiro deleted the jashapiro/8-seurat-conversion branch December 13, 2024 19:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create function for Seurat conversion
2 participants