Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merging SCEs: Use union of features #247

Closed
wants to merge 6 commits into from

Conversation

sjspielman
Copy link
Member

Towards #150

This PR takes some next steps towards simplifying altexp merging:

  • A little reorganization in merge_sce_list() so that all checks that might cause us to stop() are at the top of the function, for faster failing.
  • Add a check for altExps: Each given altexp name must have the same features, or we stop().
  • Create new function create_sce_with_all_features(), which is currently only used for the main SCE (altExp usage in next PR), which ensures that a given vector of features (here, vector representing union of relevant features) is always reflected in the SCE. I am extra careful to ensure the row order matches, as well.
    • There is one TODO here about how to handle rowData columns like gene_ids which may get filled with NAs. I haven't really explored this situation in depth; waiting to do so in the next PR(s) when I start applying this function to the altExps where, at least for our current ScPCA data, this is likely to matter.
    • If you have specific thoughts though, happy to get going on this TODO now!
  • Tests are present via actual test-driven development

@sjspielman sjspielman requested a review from jashapiro December 18, 2023 16:11
@jashapiro
Copy link
Member

  • Create new function create_sce_with_all_features(), which is currently only used for the main SCE (altExp usage in next PR), which ensures that a given vector of features (here, vector representing union of relevant features) is always reflected in the SCE. I am extra careful to ensure the row order matches, as well.

I actually don't think we want to do this for main Exps. Sorry if I was not clear about that! There are some stats that could break with NA values, and I think it is better for us to leave the main experiments as they are, the intersection of the sampled genes.

We can keep this function around, but I don't think we want it to be used right now. For the AltExps, there may be an argument to use it, but I think what might actually be better to create separate AltExps if there are two with the same name that have different feature complements, e.g., adt-1 and adt-2.

@jashapiro
Copy link
Member

  • Create new function create_sce_with_all_features(), which is currently only used for the main SCE (altExp usage in next PR), which ensures that a given vector of features (here, vector representing union of relevant features) is always reflected in the SCE. I am extra careful to ensure the row order matches, as well.

I actually don't think we want to do this for main Exps. Sorry if I was not clear about that! There are some stats that could break with NA values, and I think it is better for us to leave the main experiments as they are, the intersection of the sampled genes.

We can keep this function around, but I don't think we want it to be used right now. For the AltExps, there may be an argument to use it, but I think what might actually be better to create separate AltExps if there are two with the same name that have different feature complements, e.g., adt-1 and adt-2.

Just following up on this a bit... having started to look at the code: I think you can leave most of what you have in the prepare function: we just won't give it the union by default...

Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a number of comments below, but I am sort of undecided about how much of the "expansion" capabilities we want to keep. Part of me thinks we don't need it yet, as per my earlier comment I don't think we want to expand main SCE assays at this point. We may want to do that for AltExps, but for now I think merging only if the AltExps match is sufficient (we could allow feature order changes, if we want to be generous).

But I don't want to abandon the work you have put in here, as it might be useful in the future! So the comments below are about what might happen if we wanted to keep this capability:

For the assays themselves, this looks fine, with some small modifications to keep things sparse.

But I think the rowData expansion needs to be rewritten, as the current codes doesn't handle mixed data types correctly. I would expect a separate function that takes a data frame and a list of features is probably going to be useful here: Basically much like the create_sce_with_features but create_rowData_with_features (I deleted all) to handle the bookkeeping there, as I think it will be a bit complicated and worth testing directly.


I think the best plan might be to branch off this code and link the branch in a new issue: "Allow SCE merge with unions of features" that we can come back to later, then go back to finishing the main feature we want for now, which would leave the main object merge as it is and add the ability to include AltExps if they are all the same type (or missing).

Comment on lines +100 to +103
purrr::map(\(sce){
all(c("library_id", "sample_id") %in% names(metadata(sce)))
}) |>
unlist()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
purrr::map(\(sce){
all(c("library_id", "sample_id") %in% names(metadata(sce)))
}) |>
unlist()
purrr::map_lgl(\(sce){
all(c("library_id", "sample_id") %in% names(metadata(sce)))
})

Comment on lines +128 to +132
# find their union of features
altexp_name_features <- altexp_list |>
purrr::map(rownames) |>
purrr::reduce(union) |>
sort()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we are going to test that they are all equal, we don't need to get the union here: we can just test that all are equal to the first.

sce_list <- sce_list |>
purrr::imap(
prepare_sce_for_merge,
batch_column = batch_column,
cell_id_column = cell_id_column,
shared_features = shared_features,
all_features = sce_full_features,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just call this merge_features? Nothing says we need it to be "all" or "shared", etc.

Also updating the variable to match above changes

Suggested change
all_features = sce_full_features,
merge_features = sce_shared_features,

Comment on lines +150 to +154
# First, obtain the union of features for all (main) SCE objects
# this will also be the final order of features/rows in the final SCE
sce_full_features <- sce_list |>
purrr::map(rownames) |>
purrr::reduce(union)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going back to intersect here, because I think that is a better default. We could add this as an option to the function, but there may be unintended consequences to having NA values in the matrix for downstream steps, so I don't want to do this now.

Suggested change
# First, obtain the union of features for all (main) SCE objects
# this will also be the final order of features/rows in the final SCE
sce_full_features <- sce_list |>
purrr::map(rownames) |>
purrr::reduce(union)
# First, obtain the intersection of features for all (main) SCE objects
# this will also be the final order of features/rows in the final SCE
sce_shared_features <- sce_list |>
purrr::map(rownames) |>
purrr::reduce(intersect)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noting this intersect code is already higher up in the script as a check to ensure there are intersecting features (I moved it there in this PR), so I'll just delete this and use what we've got above.

@@ -255,12 +298,20 @@ prepare_sce_for_merge <- function(
sce_name,
batch_column,
cell_id_column,
shared_features,
all_features,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
all_features,
merge_features,

Comment on lines +402 to +405
if (length(feature_diff) == 0) {

# Simply reorder and return
return(sce[all_features,])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (length(feature_diff) == 0) {
# Simply reorder and return
return(sce[all_features,])
if (length(missing_features) == 0) {
# Simply reorder and/or filter and return
return(sce[features,])

Comment on lines +410 to +418
new_matrix <- matrix(
data = NA,
nrow = length(all_features),
ncol = ncol(sce),
dimnames = list(
all_features,
colnames(sce)
)
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Name updates, but also we can save some effort later if we make this a Matrix::Matrix to match the SCE object and then we can make it sparse, as we expect it to be. Note that I also use NA_real_ to get the right type info in...

Suggested change
new_matrix <- matrix(
data = NA,
nrow = length(all_features),
ncol = ncol(sce),
dimnames = list(
all_features,
colnames(sce)
)
)
new_matrix <- Matrix::Matrix(
data = NA_real_,
nrow = length(all_features),
ncol = ncol(sce),
dimnames = list(
all_features,
colnames(sce)
)
sparse = TRUE
)

Comment on lines +428 to +430
new_matrix[present_features, colnames(sce)] <- as.matrix(
assay(sce, assay_name)[present_features, colnames(sce)]
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you use Matrix::Matrix above, you should not need to do the conversion:

Suggested change
new_matrix[present_features, colnames(sce)] <- as.matrix(
assay(sce, assay_name)[present_features, colnames(sce)]
)
new_matrix[present_features, colnames(sce)] <- (
assay(sce, assay_name)[present_features, colnames(sce)]
)

I think I would move the new matrix creation into this map function too. I'm not sure how I feel about reusing the existing one in the case of multiple assays. It might be fine, but if it is it means we are making a copy anyway, so there is not going to be a cost to creating it here.

Comment on lines +441 to +442
rowdata_colnames <- colnames(rowData(sce))
new_rowData <- matrix(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can't be a matrix, because data in rowData is of mixed types. It is probably fine to start with the matrix here to get the dimensions right, but we want the new rowData to be a data frame of some kind.

Comment on lines +453 to +455
new_rowData[present_features, rowdata_colnames] <- as.matrix(
rowData(sce)[present_features, rowdata_colnames]
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want to do any matrix conversion here, because that will not preserve types.

Suggested change
new_rowData[present_features, rowdata_colnames] <- as.matrix(
rowData(sce)[present_features, rowdata_colnames]
)
new_rowData[present_features, rowdata_colnames] <- (
rowData(sce)[present_features, rowdata_colnames]
)

@sjspielman
Copy link
Member Author

After some additional discussion, we'll circle back to this later with #249.

Note that some of these commits will probably be cherry-picked out into other PRs to address #150.

@sjspielman sjspielman marked this pull request as draft December 18, 2023 18:43
@sjspielman sjspielman changed the title Merging altexps: Use union of features Merging SCEs: Use union of features Dec 19, 2023
@sjspielman
Copy link
Member Author

We're very unlikely to proceed with this code, so I'm going to close this PR out. But the branch lives on and is linked from #249 juuust in case.

@sjspielman sjspielman closed this Jan 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants