-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add calculate_cell_cluster_metrics() function #23
base: main
Are you sure you want to change the base?
Conversation
for more information, see https://pre-commit.ci
R/evaluate-clusters.R
Outdated
#' The cell id column's values should match either the PC matrix row names, or the | ||
#' SingleCellExperiment/Seurat object cell ids. Typically this data frame will be | ||
#' output from the `rOpenScPCA::calculate_clusters()` function. | ||
#' @param ... Additional argument are passed on to the respective `calculate_purity()` and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recognize that I didn't look to closely into these arguments and if someone wanted to specify a different argument for purity versus silhouette they would not be able to do so this way because all arguments get pass to both functions. If we think this will be a common use case I can go back and adjust.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since these arguments are passed to different functions with different expectations, these are likely to conflict. I think it is probably better for this convenience function not to use ...
. But I don't know that we need to pass further options; if something more complex is needed, the user can run purrr::map
on their own.
R/evaluate-clusters.R
Outdated
#' | ||
#' set.seed(2024) | ||
#' | ||
#' sce_object <- splatter::simpleSimulate(nGenes = 1000, verbose = FALSE) |> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added these steps just because I wanted to illustrating how I was testing this but if this is too much detail for this example we can trim this down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's probably too much detail just in the sense that a novice might look at this and say, "oh no, do i need splatter?"
I would simplify to by assuming an sce_object
variable is already known/exists. Consistent with other evaluation function examples, you don't need to pull out the PCA either; just pass in the object directly. Let's have the example therefore just "run" (i.e., keep the \dontrun{}
construct!) sweep_clusters()
and calculate_cell_cluster_metrics()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an example sce_object that already exists I can pull from? If so how do I call it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since these examples are not run, it is fine to just "assume" an sce_object
exists for this section, and you can start from the sweep_clusters()
step (skipping the prep).
Still left to do here:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this contribution! My main comment here (aside from my earlier misread) is that I still wonder if we want this function to work on a list, rather than just on a single clustering data frame. The convenience of calculating all the metrics at once makes sense to me, but if we have a function that evaluates a single data frame, then turning that into evaluating the list is a very simple addition of a purrr
wrapper, and I think that seems more transparent. But others may disagree!
R/evaluate-clusters.R
Outdated
#' The cell id column's values should match either the PC matrix row names, or the | ||
#' SingleCellExperiment/Seurat object cell ids. Typically this data frame will be | ||
#' output from the `rOpenScPCA::calculate_clusters()` function. | ||
#' @param ... Additional argument are passed on to the respective `calculate_purity()` and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since these arguments are passed to different functions with different expectations, these are likely to conflict. I think it is probably better for this convenience function not to use ...
. But I don't know that we need to pass further options; if something more complex is needed, the user can run purrr::map
on their own.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for starting this!! I left some initial comments here, but before I look more, I have a thought about the use cases for this function - currently it's written to only run on a sweep list, but it makes sense to me to also make this flexible enough to run on single data frame (e.g., not a list of data frames). This means updates related to the input argument sweep_list
:
- First, give it a more flexible name... Maybe something like
cluster_results
? I don't love it, but I'm not sure of how else to communicate that it might be either a list of dfs or df, so really I don't hate it either! - Second, add a check if it's a data frame and if so, make it a list of length 1 with the given data frame in it. So the code might look like:
if (data frame) { listify it}
else { run the existing stopifnot checks}
- I suppose we'd need another check to determine whether to return the list or the first index from the eval'd df, too, so you might actually structure this code by first defining an
is_df
variable or so, and using that for both checks (the opening sanity check to transform it into a list to play nicely with purrr, and the final check for what to return.
R/evaluate-clusters.R
Outdated
#' | ||
#' This wrapper function can be used to evaluate clusters calculated using `sweep_clusters()` function. | ||
#' Input should be be a list of data frames with the resulting clusters from all parameter combinations provided to | ||
#' the `sweep_clusters()` function. Output |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Output"...? (missing rest of sentence)
R/evaluate-clusters.R
Outdated
#' | ||
calculate_cell_cluster_metrics <- function(x, | ||
sweep_list, | ||
evals = c("purity", "silhouette"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To match the function title more closely (and you'll want to replace this variable name when used in the function, too):
evals = c("purity", "silhouette"), | |
metric = c("purity", "silhouette"), |
R/evaluate-clusters.R
Outdated
#' | ||
#' set.seed(2024) | ||
#' | ||
#' sce_object <- splatter::simpleSimulate(nGenes = 1000, verbose = FALSE) |> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's probably too much detail just in the sense that a novice might look at this and say, "oh no, do i need splatter?"
I would simplify to by assuming an sce_object
variable is already known/exists. Consistent with other evaluation function examples, you don't need to pull out the PCA either; just pass in the object directly. Let's have the example therefore just "run" (i.e., keep the \dontrun{}
construct!) sweep_clusters()
and calculate_cell_cluster_metrics()
Co-authored-by: Joshua Shapiro <[email protected]>
Co-authored-by: Stephanie Spielman <[email protected]>
Co-authored-by: Stephanie Spielman <[email protected]>
for more information, see https://pre-commit.ci
…savvy/multi_sweep
for more information, see https://pre-commit.ci
…savvy/multi_sweep
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
Co-authored-by: Stephanie Spielman <[email protected]>
for more information, see https://pre-commit.ci
Background
This PR is for #10. Tried to follow the context there. But since this is my first PR on this project I might be missing some insights so Im just posting this as a draft first so that someone can check that I'm in the generally right direction.
Summary
This is a function that takes output from
sweep_clusters()
and runs evals on it. It can runcalculate_silhouette()
and/orcalculate_purity()
on all elements of the list of data frames that are outputed fromsweep_clusters()
.An additional very nit picky thing I have in here but I was very minorly thrown off by the examples in sweep_clusters() being named
cluster_df
when they are actually lists of data frames and not data frames out right. If you don't like this change no worries. The documentation itself is very clear but I'm just a person who kinda goes straight for the examples first.Requested feedback
Am I understanding this right? It doesn't seem like this function will be that useful but I trust ya'll have more context and knowledge about the needs of the project than I who just started looking at this stuff last week lol.
Side side question that is also very minor, can we name the data frames in the list according to the combo of their parameters or do we find that would be too clunky? I like names in my lists but this is maybe a cansavvy quirk we don't need to subject everyone else to.