Merging SCEs: Use union of features #247

sjspielman · 2023-12-18T15:57:08Z

Towards #150

This PR takes some next steps towards simplifying altexp merging:

A little reorganization in merge_sce_list() so that all checks that might cause us to stop() are at the top of the function, for faster failing.
Add a check for altExps: Each given altexp name must have the same features, or we stop().
Create new function create_sce_with_all_features(), which is currently only used for the main SCE (altExp usage in next PR), which ensures that a given vector of features (here, vector representing union of relevant features) is always reflected in the SCE. I am extra careful to ensure the row order matches, as well.
- There is one TODO here about how to handle rowData columns like gene_ids which may get filled with NAs. I haven't really explored this situation in depth; waiting to do so in the next PR(s) when I start applying this function to the altExps where, at least for our current ScPCA data, this is likely to matter.
- If you have specific thoughts though, happy to get going on this TODO now!
Tests are present via actual test-driven development

… prepare to update code with _union_ of main SCE features

…create a new SCE with all expected features

jashapiro · 2023-12-18T16:50:05Z

Create new function create_sce_with_all_features(), which is currently only used for the main SCE (altExp usage in next PR), which ensures that a given vector of features (here, vector representing union of relevant features) is always reflected in the SCE. I am extra careful to ensure the row order matches, as well.

I actually don't think we want to do this for main Exps. Sorry if I was not clear about that! There are some stats that could break with NA values, and I think it is better for us to leave the main experiments as they are, the intersection of the sampled genes.

We can keep this function around, but I don't think we want it to be used right now. For the AltExps, there may be an argument to use it, but I think what might actually be better to create separate AltExps if there are two with the same name that have different feature complements, e.g., adt-1 and adt-2.

jashapiro · 2023-12-18T16:58:31Z

Create new function create_sce_with_all_features(), which is currently only used for the main SCE (altExp usage in next PR), which ensures that a given vector of features (here, vector representing union of relevant features) is always reflected in the SCE. I am extra careful to ensure the row order matches, as well.

I actually don't think we want to do this for main Exps. Sorry if I was not clear about that! There are some stats that could break with NA values, and I think it is better for us to leave the main experiments as they are, the intersection of the sampled genes.

We can keep this function around, but I don't think we want it to be used right now. For the AltExps, there may be an argument to use it, but I think what might actually be better to create separate AltExps if there are two with the same name that have different feature complements, e.g., adt-1 and adt-2.

Just following up on this a bit... having started to look at the code: I think you can leave most of what you have in the prepare function: we just won't give it the union by default...

jashapiro

I made a number of comments below, but I am sort of undecided about how much of the "expansion" capabilities we want to keep. Part of me thinks we don't need it yet, as per my earlier comment I don't think we want to expand main SCE assays at this point. We may want to do that for AltExps, but for now I think merging only if the AltExps match is sufficient (we could allow feature order changes, if we want to be generous).

But I don't want to abandon the work you have put in here, as it might be useful in the future! So the comments below are about what might happen if we wanted to keep this capability:

For the assays themselves, this looks fine, with some small modifications to keep things sparse.

But I think the rowData expansion needs to be rewritten, as the current codes doesn't handle mixed data types correctly. I would expect a separate function that takes a data frame and a list of features is probably going to be useful here: Basically much like the create_sce_with_features but create_rowData_with_features (I deleted all) to handle the bookkeeping there, as I think it will be a bit complicated and worth testing directly.

I think the best plan might be to branch off this code and link the branch in a new issue: "Allow SCE merge with unions of features" that we can come back to later, then go back to finishing the main feature we want for now, which would leave the main object merge as it is and add the ability to include AltExps if they are all the same type (or missing).

jashapiro · 2023-12-18T16:52:14Z

R/merge_sce_list.R

+    purrr::map(\(sce){
+      all(c("library_id", "sample_id") %in% names(metadata(sce)))
+    }) |>
+    unlist()


Suggested change

purrr::map(\(sce){

all(c("library_id", "sample_id") %in% names(metadata(sce)))

}) |>

unlist()

purrr::map_lgl(\(sce){

all(c("library_id", "sample_id") %in% names(metadata(sce)))

})

jashapiro · 2023-12-18T16:54:46Z

R/merge_sce_list.R

+      # find their union of features
+      altexp_name_features <- altexp_list |>
+        purrr::map(rownames) |>
+        purrr::reduce(union) |>
+        sort()


Given that we are going to test that they are all equal, we don't need to get the union here: we can just test that all are equal to the first.

jashapiro · 2023-12-18T16:57:00Z

R/merge_sce_list.R

  sce_list <- sce_list |>
    purrr::imap(
      prepare_sce_for_merge,
      batch_column = batch_column,
      cell_id_column = cell_id_column,
-      shared_features = shared_features,
+      all_features = sce_full_features,


Let's just call this merge_features? Nothing says we need it to be "all" or "shared", etc.

Also updating the variable to match above changes

Suggested change

all_features = sce_full_features,

merge_features = sce_shared_features,

jashapiro · 2023-12-18T17:25:20Z

R/merge_sce_list.R

+  # First, obtain the union of features for all (main) SCE objects
+  # this will also be the final order of features/rows in the final SCE
+  sce_full_features <- sce_list |>
+    purrr::map(rownames) |>
+    purrr::reduce(union)


Going back to intersect here, because I think that is a better default. We could add this as an option to the function, but there may be unintended consequences to having NA values in the matrix for downstream steps, so I don't want to do this now.

Suggested change

# First, obtain the union of features for all (main) SCE objects

# this will also be the final order of features/rows in the final SCE

sce_full_features <- sce_list |>

purrr::map(rownames) |>

purrr::reduce(union)

# First, obtain the intersection of features for all (main) SCE objects

# this will also be the final order of features/rows in the final SCE

sce_shared_features <- sce_list |>

purrr::map(rownames) |>

purrr::reduce(intersect)

Noting this intersect code is already higher up in the script as a check to ensure there are intersecting features (I moved it there in this PR), so I'll just delete this and use what we've got above.

jashapiro · 2023-12-18T17:27:10Z

R/merge_sce_list.R

@@ -255,12 +298,20 @@ prepare_sce_for_merge <- function(
  sce_name,
  batch_column,
  cell_id_column,
-  shared_features,
+  all_features,


Suggested change

all_features,

merge_features,

jashapiro · 2023-12-18T17:35:11Z

R/merge_sce_list.R

+  if (length(feature_diff) == 0) {
+
+    # Simply reorder and return
+    return(sce[all_features,])


Suggested change

if (length(feature_diff) == 0) {

# Simply reorder and return

return(sce[all_features,])

if (length(missing_features) == 0) {

# Simply reorder and/or filter and return

return(sce[features,])

jashapiro · 2023-12-18T17:48:07Z

R/merge_sce_list.R

+    new_matrix <- matrix(
+      data = NA,
+      nrow = length(all_features),
+      ncol = ncol(sce),
+      dimnames = list(
+        all_features,
+        colnames(sce)
+      )
+    )


Name updates, but also we can save some effort later if we make this a Matrix::Matrix to match the SCE object and then we can make it sparse, as we expect it to be. Note that I also use NA_real_ to get the right type info in...

Suggested change

new_matrix <- matrix(

data = NA,

nrow = length(all_features),

ncol = ncol(sce),

dimnames = list(

all_features,

colnames(sce)

)

)

new_matrix <- Matrix::Matrix(

data = NA_real_,

nrow = length(all_features),

ncol = ncol(sce),

dimnames = list(

all_features,

colnames(sce)

)

sparse = TRUE

)

jashapiro · 2023-12-18T17:52:23Z

R/merge_sce_list.R

+          new_matrix[present_features, colnames(sce)] <- as.matrix(
+            assay(sce, assay_name)[present_features, colnames(sce)]
+          )


If you use Matrix::Matrix above, you should not need to do the conversion:

Suggested change

new_matrix[present_features, colnames(sce)] <- as.matrix(

assay(sce, assay_name)[present_features, colnames(sce)]

)

new_matrix[present_features, colnames(sce)] <- (

assay(sce, assay_name)[present_features, colnames(sce)]

)

I think I would move the new matrix creation into this map function too. I'm not sure how I feel about reusing the existing one in the case of multiple assays. It might be fine, but if it is it means we are making a copy anyway, so there is not going to be a cost to creating it here.

jashapiro · 2023-12-18T17:55:36Z

R/merge_sce_list.R

+    rowdata_colnames <- colnames(rowData(sce))
+    new_rowData <- matrix(


This can't be a matrix, because data in rowData is of mixed types. It is probably fine to start with the matrix here to get the dimensions right, but we want the new rowData to be a data frame of some kind.

jashapiro · 2023-12-18T17:56:36Z

R/merge_sce_list.R

+    new_rowData[present_features, rowdata_colnames] <- as.matrix(
+      rowData(sce)[present_features, rowdata_colnames]
+    )


We don't want to do any matrix conversion here, because that will not preserve types.

Suggested change

new_rowData[present_features, rowdata_colnames] <- as.matrix(

rowData(sce)[present_features, rowdata_colnames]

)

new_rowData[present_features, rowdata_colnames] <- (

rowData(sce)[present_features, rowdata_colnames]

)

sjspielman · 2023-12-18T18:43:51Z

After some additional discussion, we'll circle back to this later with #249.

Note that some of these commits will probably be cherry-picked out into other PRs to address #150.

sjspielman · 2024-01-02T19:29:48Z

We're very unlikely to proceed with this code, so I'm going to close this PR out. But the branch lives on and is linked from #249 juuust in case.

sjspielman added 5 commits December 18, 2023 09:09

Fail when altExps of the same name have a different set of features

7c464ae

some spacing and a TODO for which current test will need updating

13d0c7d

Reorg: Move all checks to the top of the merge_sce_list function, and…

4bc3c2b

… prepare to update code with _union_ of main SCE features

Add create_sce_with_all_features function and unit test - we can now …

f35b7ce

…create a new SCE with all expected features

Update existing test to match code changes

5ced475

sjspielman requested a review from jashapiro December 18, 2023 16:11

woops, need to document

6dc245d

jashapiro reviewed Dec 18, 2023

View reviewed changes

sjspielman mentioned this pull request Dec 18, 2023

Consider union of genes when merging, not intersection #249

Open

sjspielman marked this pull request as draft December 18, 2023 18:43

sjspielman changed the title ~~Merging altexps: Use union of features~~ Merging SCEs: Use union of features Dec 19, 2023

sjspielman closed this Jan 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merging SCEs: Use union of features #247

Merging SCEs: Use union of features #247

sjspielman commented Dec 18, 2023

jashapiro commented Dec 18, 2023

jashapiro commented Dec 18, 2023

jashapiro left a comment

jashapiro Dec 18, 2023

jashapiro Dec 18, 2023

jashapiro Dec 18, 2023

jashapiro Dec 18, 2023

sjspielman Dec 18, 2023

jashapiro Dec 18, 2023

jashapiro Dec 18, 2023

jashapiro Dec 18, 2023

jashapiro Dec 18, 2023

jashapiro Dec 18, 2023

jashapiro Dec 18, 2023

sjspielman commented Dec 18, 2023

sjspielman commented Jan 2, 2024

	all_features = sce_full_features,
	merge_features = sce_shared_features,

		rowdata_colnames <- colnames(rowData(sce))
		new_rowData <- matrix(

Merging SCEs: Use union of features #247

Merging SCEs: Use union of features #247

Conversation

sjspielman commented Dec 18, 2023

jashapiro commented Dec 18, 2023

jashapiro commented Dec 18, 2023

jashapiro left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sjspielman commented Dec 18, 2023

sjspielman commented Jan 2, 2024