-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merging SCEs: Use union of features #247
Conversation
… prepare to update code with _union_ of main SCE features
…create a new SCE with all expected features
I actually don't think we want to do this for main Exps. Sorry if I was not clear about that! There are some stats that could break with NA values, and I think it is better for us to leave the main experiments as they are, the intersection of the sampled genes. We can keep this function around, but I don't think we want it to be used right now. For the AltExps, there may be an argument to use it, but I think what might actually be better to create separate AltExps if there are two with the same name that have different feature complements, e.g., |
Just following up on this a bit... having started to look at the code: I think you can leave most of what you have in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made a number of comments below, but I am sort of undecided about how much of the "expansion" capabilities we want to keep. Part of me thinks we don't need it yet, as per my earlier comment I don't think we want to expand main SCE assays at this point. We may want to do that for AltExps, but for now I think merging only if the AltExps match is sufficient (we could allow feature order changes, if we want to be generous).
But I don't want to abandon the work you have put in here, as it might be useful in the future! So the comments below are about what might happen if we wanted to keep this capability:
For the assays themselves, this looks fine, with some small modifications to keep things sparse.
But I think the rowData expansion needs to be rewritten, as the current codes doesn't handle mixed data types correctly. I would expect a separate function that takes a data frame and a list of features is probably going to be useful here: Basically much like the create_sce_with_features
but create_rowData_with_features
(I deleted all
) to handle the bookkeeping there, as I think it will be a bit complicated and worth testing directly.
I think the best plan might be to branch off this code and link the branch in a new issue: "Allow SCE merge with unions of features" that we can come back to later, then go back to finishing the main feature we want for now, which would leave the main object merge as it is and add the ability to include AltExps if they are all the same type (or missing).
purrr::map(\(sce){ | ||
all(c("library_id", "sample_id") %in% names(metadata(sce))) | ||
}) |> | ||
unlist() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
purrr::map(\(sce){ | |
all(c("library_id", "sample_id") %in% names(metadata(sce))) | |
}) |> | |
unlist() | |
purrr::map_lgl(\(sce){ | |
all(c("library_id", "sample_id") %in% names(metadata(sce))) | |
}) |
# find their union of features | ||
altexp_name_features <- altexp_list |> | ||
purrr::map(rownames) |> | ||
purrr::reduce(union) |> | ||
sort() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that we are going to test that they are all equal, we don't need to get the union here: we can just test that all are equal to the first.
sce_list <- sce_list |> | ||
purrr::imap( | ||
prepare_sce_for_merge, | ||
batch_column = batch_column, | ||
cell_id_column = cell_id_column, | ||
shared_features = shared_features, | ||
all_features = sce_full_features, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's just call this merge_features
? Nothing says we need it to be "all" or "shared", etc.
Also updating the variable to match above changes
all_features = sce_full_features, | |
merge_features = sce_shared_features, |
# First, obtain the union of features for all (main) SCE objects | ||
# this will also be the final order of features/rows in the final SCE | ||
sce_full_features <- sce_list |> | ||
purrr::map(rownames) |> | ||
purrr::reduce(union) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Going back to intersect here, because I think that is a better default. We could add this as an option to the function, but there may be unintended consequences to having NA values in the matrix for downstream steps, so I don't want to do this now.
# First, obtain the union of features for all (main) SCE objects | |
# this will also be the final order of features/rows in the final SCE | |
sce_full_features <- sce_list |> | |
purrr::map(rownames) |> | |
purrr::reduce(union) | |
# First, obtain the intersection of features for all (main) SCE objects | |
# this will also be the final order of features/rows in the final SCE | |
sce_shared_features <- sce_list |> | |
purrr::map(rownames) |> | |
purrr::reduce(intersect) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noting this intersect
code is already higher up in the script as a check to ensure there are intersecting features (I moved it there in this PR), so I'll just delete this and use what we've got above.
@@ -255,12 +298,20 @@ prepare_sce_for_merge <- function( | |||
sce_name, | |||
batch_column, | |||
cell_id_column, | |||
shared_features, | |||
all_features, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all_features, | |
merge_features, |
if (length(feature_diff) == 0) { | ||
|
||
# Simply reorder and return | ||
return(sce[all_features,]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (length(feature_diff) == 0) { | |
# Simply reorder and return | |
return(sce[all_features,]) | |
if (length(missing_features) == 0) { | |
# Simply reorder and/or filter and return | |
return(sce[features,]) |
new_matrix <- matrix( | ||
data = NA, | ||
nrow = length(all_features), | ||
ncol = ncol(sce), | ||
dimnames = list( | ||
all_features, | ||
colnames(sce) | ||
) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Name updates, but also we can save some effort later if we make this a Matrix::Matrix to match the SCE object and then we can make it sparse, as we expect it to be. Note that I also use NA_real_
to get the right type info in...
new_matrix <- matrix( | |
data = NA, | |
nrow = length(all_features), | |
ncol = ncol(sce), | |
dimnames = list( | |
all_features, | |
colnames(sce) | |
) | |
) | |
new_matrix <- Matrix::Matrix( | |
data = NA_real_, | |
nrow = length(all_features), | |
ncol = ncol(sce), | |
dimnames = list( | |
all_features, | |
colnames(sce) | |
) | |
sparse = TRUE | |
) |
new_matrix[present_features, colnames(sce)] <- as.matrix( | ||
assay(sce, assay_name)[present_features, colnames(sce)] | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you use Matrix::Matrix above, you should not need to do the conversion:
new_matrix[present_features, colnames(sce)] <- as.matrix( | |
assay(sce, assay_name)[present_features, colnames(sce)] | |
) | |
new_matrix[present_features, colnames(sce)] <- ( | |
assay(sce, assay_name)[present_features, colnames(sce)] | |
) |
I think I would move the new matrix creation into this map function too. I'm not sure how I feel about reusing the existing one in the case of multiple assays. It might be fine, but if it is it means we are making a copy anyway, so there is not going to be a cost to creating it here.
rowdata_colnames <- colnames(rowData(sce)) | ||
new_rowData <- matrix( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can't be a matrix, because data in rowData is of mixed types. It is probably fine to start with the matrix here to get the dimensions right, but we want the new rowData to be a data frame of some kind.
new_rowData[present_features, rowdata_colnames] <- as.matrix( | ||
rowData(sce)[present_features, rowdata_colnames] | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't want to do any matrix conversion here, because that will not preserve types.
new_rowData[present_features, rowdata_colnames] <- as.matrix( | |
rowData(sce)[present_features, rowdata_colnames] | |
) | |
new_rowData[present_features, rowdata_colnames] <- ( | |
rowData(sce)[present_features, rowdata_colnames] | |
) |
We're very unlikely to proceed with this code, so I'm going to close this PR out. But the branch lives on and is linked from #249 juuust in case. |
Towards #150
This PR takes some next steps towards simplifying altexp merging:
merge_sce_list()
so that all checks that might cause us tostop()
are at the top of the function, for faster failing.stop()
.create_sce_with_all_features()
, which is currently only used for the main SCE (altExp usage in next PR), which ensures that a given vector of features (here, vector representing union of relevant features) is always reflected in the SCE. I am extra careful to ensure the row order matches, as well.TODO
here about how to handle rowData columns likegene_ids
which may get filled withNA
s. I haven't really explored this situation in depth; waiting to do so in the next PR(s) when I start applying this function to the altExps where, at least for our current ScPCA data, this is likely to matter.