Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add calculate_cell_cluster_metrics() function #23

Open
wants to merge 27 commits into
base: main
Choose a base branch
from

Conversation

cansavvy
Copy link
Collaborator

Background

This PR is for #10. Tried to follow the context there. But since this is my first PR on this project I might be missing some insights so Im just posting this as a draft first so that someone can check that I'm in the generally right direction.

Summary

This is a function that takes output from sweep_clusters() and runs evals on it. It can run calculate_silhouette() and/or calculate_purity() on all elements of the list of data frames that are outputed from sweep_clusters().

An additional very nit picky thing I have in here but I was very minorly thrown off by the examples in sweep_clusters() being named cluster_df when they are actually lists of data frames and not data frames out right. If you don't like this change no worries. The documentation itself is very clear but I'm just a person who kinda goes straight for the examples first.

Requested feedback

Am I understanding this right? It doesn't seem like this function will be that useful but I trust ya'll have more context and knowledge about the needs of the project than I who just started looking at this stuff last week lol.

Side side question that is also very minor, can we name the data frames in the list according to the combo of their parameters or do we find that would be too clunky? I like names in my lists but this is maybe a cansavvy quirk we don't need to subject everyone else to.

#' The cell id column's values should match either the PC matrix row names, or the
#' SingleCellExperiment/Seurat object cell ids. Typically this data frame will be
#' output from the `rOpenScPCA::calculate_clusters()` function.
#' @param ... Additional argument are passed on to the respective `calculate_purity()` and
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recognize that I didn't look to closely into these arguments and if someone wanted to specify a different argument for purity versus silhouette they would not be able to do so this way because all arguments get pass to both functions. If we think this will be a common use case I can go back and adjust.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since these arguments are passed to different functions with different expectations, these are likely to conflict. I think it is probably better for this convenience function not to use .... But I don't know that we need to pass further options; if something more complex is needed, the user can run purrr::map on their own.

#'
#' set.seed(2024)
#'
#' sce_object <- splatter::simpleSimulate(nGenes = 1000, verbose = FALSE) |>
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added these steps just because I wanted to illustrating how I was testing this but if this is too much detail for this example we can trim this down.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's probably too much detail just in the sense that a novice might look at this and say, "oh no, do i need splatter?"

I would simplify to by assuming an sce_object variable is already known/exists. Consistent with other evaluation function examples, you don't need to pull out the PCA either; just pass in the object directly. Let's have the example therefore just "run" (i.e., keep the \dontrun{} construct!) sweep_clusters() and calculate_cell_cluster_metrics()

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an example sce_object that already exists I can pull from? If so how do I call it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since these examples are not run, it is fine to just "assume" an sce_object exists for this section, and you can start from the sweep_clusters() step (skipping the prep).

@cansavvy cansavvy requested a review from sjspielman December 18, 2024 15:00
@cansavvy
Copy link
Collaborator Author

cansavvy commented Dec 18, 2024

Still left to do here:

Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this contribution! My main comment here (aside from my earlier misread) is that I still wonder if we want this function to work on a list, rather than just on a single clustering data frame. The convenience of calculating all the metrics at once makes sense to me, but if we have a function that evaluates a single data frame, then turning that into evaluating the list is a very simple addition of a purrr wrapper, and I think that seems more transparent. But others may disagree!

R/evaluate-clusters.R Outdated Show resolved Hide resolved
#' The cell id column's values should match either the PC matrix row names, or the
#' SingleCellExperiment/Seurat object cell ids. Typically this data frame will be
#' output from the `rOpenScPCA::calculate_clusters()` function.
#' @param ... Additional argument are passed on to the respective `calculate_purity()` and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since these arguments are passed to different functions with different expectations, these are likely to conflict. I think it is probably better for this convenience function not to use .... But I don't know that we need to pass further options; if something more complex is needed, the user can run purrr::map on their own.

Copy link
Member

@sjspielman sjspielman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for starting this!! I left some initial comments here, but before I look more, I have a thought about the use cases for this function - currently it's written to only run on a sweep list, but it makes sense to me to also make this flexible enough to run on single data frame (e.g., not a list of data frames). This means updates related to the input argument sweep_list:

  • First, give it a more flexible name... Maybe something like cluster_results? I don't love it, but I'm not sure of how else to communicate that it might be either a list of dfs or df, so really I don't hate it either!
  • Second, add a check if it's a data frame and if so, make it a list of length 1 with the given data frame in it. So the code might look like:
if (data frame) { listify it}
else { run the existing stopifnot checks}
  • I suppose we'd need another check to determine whether to return the list or the first index from the eval'd df, too, so you might actually structure this code by first defining an is_df variable or so, and using that for both checks (the opening sanity check to transform it into a list to play nicely with purrr, and the final check for what to return.

#'
#' This wrapper function can be used to evaluate clusters calculated using `sweep_clusters()` function.
#' Input should be be a list of data frames with the resulting clusters from all parameter combinations provided to
#' the `sweep_clusters()` function. Output
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Output"...? (missing rest of sentence)

R/evaluate-clusters.R Outdated Show resolved Hide resolved
#'
calculate_cell_cluster_metrics <- function(x,
sweep_list,
evals = c("purity", "silhouette"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To match the function title more closely (and you'll want to replace this variable name when used in the function, too):

Suggested change
evals = c("purity", "silhouette"),
metric = c("purity", "silhouette"),

R/evaluate-clusters.R Outdated Show resolved Hide resolved
R/evaluate-clusters.R Outdated Show resolved Hide resolved
#'
#' set.seed(2024)
#'
#' sce_object <- splatter::simpleSimulate(nGenes = 1000, verbose = FALSE) |>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's probably too much detail just in the sense that a novice might look at this and say, "oh no, do i need splatter?"

I would simplify to by assuming an sce_object variable is already known/exists. Consistent with other evaluation function examples, you don't need to pull out the PCA either; just pass in the object directly. Let's have the example therefore just "run" (i.e., keep the \dontrun{} construct!) sweep_clusters() and calculate_cell_cluster_metrics()

@cansavvy cansavvy marked this pull request as ready for review January 24, 2025 18:58
@cansavvy cansavvy changed the title DRAFT: calculate_cell_cluster_metrics() function Add calculate_cell_cluster_metrics() function Jan 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants