Add jaccard calculation code #521

sjspielman · 2023-10-19T17:38:03Z

Towards #516
Stacked on #520

This PR starts the process of updating heatmaps. Here, I made two broad changes

Moved heatmaps above diagnostic section
Wrote code for calculating jaccard similarity matrices

I did not yet remove any of the current heatmap code. Everything in the heatmap section that comes after the sentence I added "Note that the remaining heatmap code is going to be updated next!" is the same as it was before. The two chunks in the heatmap section that come before that sentence are focus to look during review.

I wrote two functions which, together, are calculating jaccard similarity between two categorical columns, e.g. cluster & singler annotations. The function calculate_jaccard_similarity() does the actual calculation for a given jaccard similarity value, and the function calculate_jaccard_matrix() wraps it to run over all values in the two columns of interest and ultimately return a matrix for use in a heatmap.

Using these functions, I calculate a matrix for each celltype annotation compared to clusters, and I rbind() them together as I go. During this, I also track the size of each matrix so we can split a ComplexHeatmap accordingly.

The code is all working as expected, with seemingly correct calculations coming out, so I expect most discussion in this PR will be about the specific strategies I've used and how those might be modified to lay some solid ground for next steps!

…d similarity matrix

…-heatmaps-calc-jaccard

allyhawkins · 2023-10-19T22:56:21Z

Okay, I've been looking at the function you have for calculating Jaccard here and doing some reading to figure out exactly how we should apply the Jaccard Index to our data. My overall understanding is that we are calculating the Jaccard Index between the barcodes shared for each group (e.g., barcodes in cluster A and barcodes in cell type A). And then we are doing that across all combinations, obtaining a value for each combination.

I think a much simpler way of doing this would involve first creating a binary matrix with the rows as the group names (like cluster 1, cluster 2, T cell, B cell) and the columns as the barcodes. Then you want to calculate the Jaccard index between every possible combination of cluster1/T cell, cluster 1/ cluster 2, cluster 1/ B cell, etc. Once you have the binary matrix, you can use the dist() function to calculate the distance between the rows of the matrix. By using the method = "binary" option, you get the Jaccard Distance which is just 1 - Jaccard Similarity Index.

So all that to say, I think you can condense the two functions you have into one function that takes in the cell metadata and the two categories to compare and then returns the matrix to plot as a heatmap.

I spent probably too much time playing around with this, but I think you could condense a lot of what you have to something like this:

groups_to_compare <- c(colname1, colname2)

binary_mtx <- groups_to_compare |> 
  purrr::map(\(colname){
    
    # create binary matrix with group options(like clusters) as rows and barcodes as columns 
    test_2 <- celltype_df |> 
      dplyr::select(c("barcodes", {{colname}})) |> 
      # remove any NAs? This is just here to get it to work
      # but I think we want to keep NA and just relabel as Unknown
      tidyr::drop_na() |> 
      tidyr::pivot_wider(
        id_cols = colname,
        names_from = "barcodes",
        values_from = "barcodes",
        values_fn = length,
        values_fill = 0
      ) |>
      # remove the barcode column 
      tibble::column_to_rownames(colname)
    
  }) |> 
  dplyr::bind_rows()

# get the distance matrix using binary
# binary calcualtes the Jaccard Distance which is the inverse of Jaccard Similarity Index
dist_obj <- dist(binary_mtx, method = "binary")
# 1 - Jaccard Distance = Jaccard Similarity 
jaccard_mtx <- 1 - as.matrix(dist_obj)

# get the col/ row names to keep  
group_1_values <- celltype_df |>
  dplyr::pull({{colname1}}) |> 
  unique()

group_2_values <- celltype_df |> 
  dplyr::pull({{colname2}}) |> 
  unique() |> 
  na.omit()
  
# get a matrix with group 1 in the rows and group 2 in the columns 
jaccard_subset_mtx <- jaccard_mtx[group_1_values, group_2_values]

jashapiro · 2023-10-20T01:17:37Z

I'm sure you don't want a third way, but here we are...

I think I kind of like Stephanie's a bit better, but I have what might be some simplifications to reduce passing around big objects. I know this isn't the full thing, keeping column names and such, but I think it gets most of the way there. The main innovation here is to use split(), which is really handy for making the lists of barcodes we need for later analysis.

# tiny jaccard function to save later typing
jaccard <- function(vec1, vec2){
  length(intersect(vec1, vec2)) / length(union(vec1, vec2))
}

make_jaccard_df <- function(celltype_df, colname1, colname2){
  # make lists of barcodes for each classifier's assignment, named by the classifier values
  id1_list <- split(celltype_df$barcodes, celltype_df[[colname1]])
  id2_list <- split(celltype_df$barcodes, celltype_df[[colname2]])
  
  # create the grid of comparisons
  cross_df <- tidyr::expand_grid(id1 = names(id1_list), id2 = names(id2_list))
  
  # calculate scores using split lists & ids
  jaccard_scores <- cross_df |> 
    purrr::pmap_dbl(\(id1, id2){
      jaccard(id1_list[[id1]], id2_list[[id2]])
    })
  
  # add scores to the comparison grid
  jaccard_df <- cross_df |> 
    dplyr::mutate(jaccard = jaccard_scores)
  return(jaccard_df)
}

I didn't turn the resulting data frame back into a matrix, and didn't rename any columns, you already have the code for that.

So... options!

allyhawkins · 2023-10-20T13:30:39Z

Just chiming in to say I like the option that Josh posted. I think it's a bit easier to follow and simpler than what is currently there.

sjspielman · 2023-10-27T17:44:58Z

I've updated the calculations to use Josh's version, and added rxoygen comments all around, and ensured the function returns a matrix for subsequent rbind'ing. If there are any further simplifications we can thing of for combining the matrices, I'm all ears!

templates/qc_report/celltypes_supplemental_report.rmd

jashapiro · 2023-10-27T17:59:34Z

templates/qc_report/celltypes_supplemental_report.rmd

+    dplyr::mutate(jaccard = jaccard_scores) |>
+    # convert to matrix
+    tidyr::pivot_wider(
+      names_from = id1, 


Separately, I think we want to do the opposite matrix transformation here, so we get coordinates that are [id1, id2]

Co-authored-by: Joshua Shapiro <[email protected]>

jashapiro · 2023-10-27T18:08:04Z

templates/qc_report/celltypes_supplemental_report.rmd

+
+if (has_submitter) {
+  cluster_submitter_jaccard_matrix <- make_jaccard_matrix(celltype_df, "cluster", "submitter_celltype_annotation")
+  jaccard_mat <- rbind(jaccard_mat, cluster_submitter_jaccard_matrix)


I think we want to get the cluster names onto the rownames of this matrix somehow? I think split(celltype_df$cluster) (which is effectively called in the make_jaccard_matrix) will always return the same order, which I would expect to be the same as unique(celltype_df$cluster), but it might be worth using join to combine these matrices?

They already are in the matrices from the jaccard calculation itself, with the tibble::column_to_rownames(var = "id1") code, unless I've misunderstood what you're asking here?

I think they would be lost in the rbind, but if we are not going to rbind and will use the ComplexHeatmap function instead, this will be irrelevant.

jashapiro · 2023-10-27T18:10:09Z

I wonder if we don't want to actually bother joining the matrices, but leave them as a list, so we can let ComplexHeatMap handle it? https://jokergoo.github.io/ComplexHeatmap-reference/book/a-list-of-heatmaps.html

sjspielman · 2023-10-27T18:15:12Z

I wonder if we don't want to actually bother joining the matrices, but leave them as a list, so we can let ComplexHeatMap handle it?

👏 great find, concatenating looks much better than splitting!

…d'ing them all. no need for split indices now either

jashapiro · 2023-10-27T18:38:04Z

I think the code you added here looks fine, but I was thrown off by the diff, which makes it look like you updated more than you did... Moving a big chunk of code (as we wanted to) makes for a big diff!

My main note, I think, would be to move the jaccard functions up to probably just after the function imports?

sjspielman · 2023-10-27T18:45:43Z

but I was thrown off by the diff, which makes it look like you updated more than you did?

At some point I had accidentally committed my lines of local dev code (define/read in a real SCE) so it was also removed there, so maybe this is why it looks that way... I don't see any specific weirdness lingering, though!

My main note, I think, would be to move the jaccard functions up to probably just after the function imports?

👍

And circling back...

I wonder if we don't want to actually bother joining the matrices, but leave them as a list, so we can let ComplexHeatMap handle it?

👏 great find, concatenating looks much better than splitting!

I've started to think about next steps, and thinking about how this approach will work with the legend, which we may not have discussed specifically yet!
Will we want shared 1 legend for a group of stacked heatmaps, or a legend for each heatmap in the stack? If we really want 1 shared legend, I think (?) we'll need to revert back to a splitting approach, but if 3 legends is ok, then concatenation should be fine. If every matrix had a full range of values 0-1, this would probably not matter, but of course that's not how the data will look.

jashapiro · 2023-10-27T18:51:39Z

Will we want shared 1 legend for a group of stacked heatmaps, or a legend for each heatmap in the stack? If we really want 1 shared legend, I think (?) we'll need to revert back to a splitting approach, but if 3 legends is ok, then concatenation should be fine. If every matrix had a full range of values 0-1, this would probably not matter, but of course that's not how the data will look.

All the heatmaps should be on the same scale of 0-1, no matter what the values, so we should be setting that manually anyway.

sjspielman · 2023-10-27T19:58:32Z

Testing things out, we can certainly fix the range and colors, but legends don't match up between heatmaps due to different distributions within each matrix's (matrices'? grammar feels weird. whatever.) values.

Setting these values led to the following - mostly unstyled, but see legend color mapping differences.

#param settings - 
heatmap_legend_param(
      # custom legend with color splits at 0, 0.5, 1
      col_fun = colorRamp2(c(0, 0.5, 1), c("blue", "white", "red")),
      # set range with breaks
      at = seq(0, 1, 0.25)
)

jashapiro · 2023-10-27T20:43:58Z

Something must be missing, because your heatmaps are definitely not using that col_fun value.

jashapiro · 2023-10-27T20:47:09Z

Following up: you have to set the col argument to the same color function within the call to Heatmap() itself, maybe?

sjspielman · 2023-10-30T13:58:28Z

Noting @jashapiro identified the good code solution to this, so we're good to keep matrices as a list!

allyhawkins

This looks really good! I just had some comments on comments, but you can feel free to ignore those or use them.

templates/qc_report/celltypes_supplemental_report.rmd

Co-authored-by: Ally Hawkins <[email protected]>

sjspielman added 3 commits October 19, 2023 12:33

move heatmaps above diagnostics and add functions to calculate jaccar…

37117c6

…d similarity matrix

Merge branch 'sjspielman/517-delta-median-update' into sjspielman/516…

bd3961f

…-heatmaps-calc-jaccard

code to combine matrices and track how big each split is

bc18369

sjspielman requested a review from allyhawkins October 19, 2023 17:38

Base automatically changed from sjspielman/517-delta-median-update to development October 19, 2023 20:49

Merge branch 'development' into sjspielman/516-heatmaps-calc-jaccard

0a77c45

sjspielman added 2 commits October 27, 2023 08:13

Merge branch 'development' into sjspielman/516-heatmaps-calc-jaccard

753076e

function joshification, and make it a matrix

4d6c004

sjspielman requested review from allyhawkins and removed request for allyhawkins October 27, 2023 17:45

jashapiro reviewed Oct 27, 2023

View reviewed changes

templates/qc_report/celltypes_supplemental_report.rmd Outdated Show resolved Hide resolved

jashapiro reviewed Oct 27, 2023

View reviewed changes

sjspielman and others added 2 commits October 27, 2023 14:04

Update templates/qc_report/celltypes_supplemental_report.rmd

3740012

Co-authored-by: Joshua Shapiro <[email protected]>

swap matrix dimensions and update docs accordingly

c093a61

jashapiro reviewed Oct 27, 2023

View reviewed changes

remove dev code, and update to use a list of matrices instead of cbin…

5d764ea

…d'ing them all. no need for split indices now either

move jaccard functions up

789db43

allyhawkins approved these changes Oct 30, 2023

View reviewed changes

templates/qc_report/celltypes_supplemental_report.rmd Outdated Show resolved Hide resolved

templates/qc_report/celltypes_supplemental_report.rmd Outdated Show resolved Hide resolved

Apply suggestions from code review

e2c6897

Co-authored-by: Ally Hawkins <[email protected]>

sjspielman merged commit 7e0cb90 into development Oct 30, 2023
3 checks passed

sjspielman deleted the sjspielman/516-heatmaps-calc-jaccard branch October 30, 2023 16:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add jaccard calculation code #521

Add jaccard calculation code #521

sjspielman commented Oct 19, 2023

allyhawkins commented Oct 19, 2023

jashapiro commented Oct 20, 2023

allyhawkins commented Oct 20, 2023

sjspielman commented Oct 27, 2023

jashapiro Oct 27, 2023

jashapiro Oct 27, 2023

sjspielman Oct 27, 2023

jashapiro Oct 27, 2023

jashapiro commented Oct 27, 2023

sjspielman commented Oct 27, 2023

jashapiro commented Oct 27, 2023 •

edited

Loading

sjspielman commented Oct 27, 2023

jashapiro commented Oct 27, 2023

sjspielman commented Oct 27, 2023

jashapiro commented Oct 27, 2023 •

edited

Loading

jashapiro commented Oct 27, 2023

sjspielman commented Oct 30, 2023

allyhawkins left a comment

Add jaccard calculation code #521

Add jaccard calculation code #521

Conversation

sjspielman commented Oct 19, 2023

allyhawkins commented Oct 19, 2023

jashapiro commented Oct 20, 2023

allyhawkins commented Oct 20, 2023

sjspielman commented Oct 27, 2023

jashapiro Oct 27, 2023

Choose a reason for hiding this comment

jashapiro Oct 27, 2023

Choose a reason for hiding this comment

sjspielman Oct 27, 2023

Choose a reason for hiding this comment

jashapiro Oct 27, 2023

Choose a reason for hiding this comment

jashapiro commented Oct 27, 2023

sjspielman commented Oct 27, 2023

jashapiro commented Oct 27, 2023 • edited Loading

sjspielman commented Oct 27, 2023

jashapiro commented Oct 27, 2023

sjspielman commented Oct 27, 2023

jashapiro commented Oct 27, 2023 • edited Loading

jashapiro commented Oct 27, 2023

sjspielman commented Oct 30, 2023

allyhawkins left a comment

Choose a reason for hiding this comment

jashapiro commented Oct 27, 2023 •

edited

Loading

jashapiro commented Oct 27, 2023 •

edited

Loading