Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add jaccard calculation code #521

Merged
merged 11 commits into from
Oct 30, 2023

Conversation

sjspielman
Copy link
Member

Towards #516
Stacked on #520

This PR starts the process of updating heatmaps. Here, I made two broad changes

  • Moved heatmaps above diagnostic section
  • Wrote code for calculating jaccard similarity matrices

I did not yet remove any of the current heatmap code. Everything in the heatmap section that comes after the sentence I added "Note that the remaining heatmap code is going to be updated next!" is the same as it was before. The two chunks in the heatmap section that come before that sentence are focus to look during review.

I wrote two functions which, together, are calculating jaccard similarity between two categorical columns, e.g. cluster & singler annotations. The function calculate_jaccard_similarity() does the actual calculation for a given jaccard similarity value, and the function calculate_jaccard_matrix() wraps it to run over all values in the two columns of interest and ultimately return a matrix for use in a heatmap.

Using these functions, I calculate a matrix for each celltype annotation compared to clusters, and I rbind() them together as I go. During this, I also track the size of each matrix so we can split a ComplexHeatmap accordingly.

The code is all working as expected, with seemingly correct calculations coming out, so I expect most discussion in this PR will be about the specific strategies I've used and how those might be modified to lay some solid ground for next steps!

Base automatically changed from sjspielman/517-delta-median-update to development October 19, 2023 20:49
@allyhawkins
Copy link
Member

Okay, I've been looking at the function you have for calculating Jaccard here and doing some reading to figure out exactly how we should apply the Jaccard Index to our data. My overall understanding is that we are calculating the Jaccard Index between the barcodes shared for each group (e.g., barcodes in cluster A and barcodes in cell type A). And then we are doing that across all combinations, obtaining a value for each combination.

I think a much simpler way of doing this would involve first creating a binary matrix with the rows as the group names (like cluster 1, cluster 2, T cell, B cell) and the columns as the barcodes. Then you want to calculate the Jaccard index between every possible combination of cluster1/T cell, cluster 1/ cluster 2, cluster 1/ B cell, etc. Once you have the binary matrix, you can use the dist() function to calculate the distance between the rows of the matrix. By using the method = "binary" option, you get the Jaccard Distance which is just 1 - Jaccard Similarity Index.

So all that to say, I think you can condense the two functions you have into one function that takes in the cell metadata and the two categories to compare and then returns the matrix to plot as a heatmap.

I spent probably too much time playing around with this, but I think you could condense a lot of what you have to something like this:

groups_to_compare <- c(colname1, colname2)

binary_mtx <- groups_to_compare |> 
  purrr::map(\(colname){
    
    # create binary matrix with group options(like clusters) as rows and barcodes as columns 
    test_2 <- celltype_df |> 
      dplyr::select(c("barcodes", {{colname}})) |> 
      # remove any NAs? This is just here to get it to work
      # but I think we want to keep NA and just relabel as Unknown
      tidyr::drop_na() |> 
      tidyr::pivot_wider(
        id_cols = colname,
        names_from = "barcodes",
        values_from = "barcodes",
        values_fn = length,
        values_fill = 0
      ) |>
      # remove the barcode column 
      tibble::column_to_rownames(colname)
    
  }) |> 
  dplyr::bind_rows()

# get the distance matrix using binary
# binary calcualtes the Jaccard Distance which is the inverse of Jaccard Similarity Index
dist_obj <- dist(binary_mtx, method = "binary")
# 1 - Jaccard Distance = Jaccard Similarity 
jaccard_mtx <- 1 - as.matrix(dist_obj)

# get the col/ row names to keep  
group_1_values <- celltype_df |>
  dplyr::pull({{colname1}}) |> 
  unique()

group_2_values <- celltype_df |> 
  dplyr::pull({{colname2}}) |> 
  unique() |> 
  na.omit()
  
# get a matrix with group 1 in the rows and group 2 in the columns 
jaccard_subset_mtx <- jaccard_mtx[group_1_values, group_2_values]

@jashapiro
Copy link
Member

I'm sure you don't want a third way, but here we are...

I think I kind of like Stephanie's a bit better, but I have what might be some simplifications to reduce passing around big objects. I know this isn't the full thing, keeping column names and such, but I think it gets most of the way there. The main innovation here is to use split(), which is really handy for making the lists of barcodes we need for later analysis.

# tiny jaccard function to save later typing
jaccard <- function(vec1, vec2){
  length(intersect(vec1, vec2)) / length(union(vec1, vec2))
}

make_jaccard_df <- function(celltype_df, colname1, colname2){
  # make lists of barcodes for each classifier's assignment, named by the classifier values
  id1_list <- split(celltype_df$barcodes, celltype_df[[colname1]])
  id2_list <- split(celltype_df$barcodes, celltype_df[[colname2]])
  
  # create the grid of comparisons
  cross_df <- tidyr::expand_grid(id1 = names(id1_list), id2 = names(id2_list))
  
  # calculate scores using split lists & ids
  jaccard_scores <- cross_df |> 
    purrr::pmap_dbl(\(id1, id2){
      jaccard(id1_list[[id1]], id2_list[[id2]])
    })
  
  # add scores to the comparison grid
  jaccard_df <- cross_df |> 
    dplyr::mutate(jaccard = jaccard_scores)
  return(jaccard_df)
}

I didn't turn the resulting data frame back into a matrix, and didn't rename any columns, you already have the code for that.

So... options!

@allyhawkins
Copy link
Member

Just chiming in to say I like the option that Josh posted. I think it's a bit easier to follow and simpler than what is currently there.

@sjspielman
Copy link
Member Author

I've updated the calculations to use Josh's version, and added rxoygen comments all around, and ensured the function returns a matrix for subsequent rbind'ing. If there are any further simplifications we can thing of for combining the matrices, I'm all ears!

@sjspielman sjspielman requested review from allyhawkins and removed request for allyhawkins October 27, 2023 17:45
dplyr::mutate(jaccard = jaccard_scores) |>
# convert to matrix
tidyr::pivot_wider(
names_from = id1,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separately, I think we want to do the opposite matrix transformation here, so we get coordinates that are [id1, id2]


if (has_submitter) {
cluster_submitter_jaccard_matrix <- make_jaccard_matrix(celltype_df, "cluster", "submitter_celltype_annotation")
jaccard_mat <- rbind(jaccard_mat, cluster_submitter_jaccard_matrix)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to get the cluster names onto the rownames of this matrix somehow? I think split(celltype_df$cluster) (which is effectively called in the make_jaccard_matrix) will always return the same order, which I would expect to be the same as unique(celltype_df$cluster), but it might be worth using join to combine these matrices?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They already are in the matrices from the jaccard calculation itself, with the tibble::column_to_rownames(var = "id1") code, unless I've misunderstood what you're asking here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they would be lost in the rbind, but if we are not going to rbind and will use the ComplexHeatmap function instead, this will be irrelevant.

@jashapiro
Copy link
Member

I wonder if we don't want to actually bother joining the matrices, but leave them as a list, so we can let ComplexHeatMap handle it? https://jokergoo.github.io/ComplexHeatmap-reference/book/a-list-of-heatmaps.html

@sjspielman
Copy link
Member Author

I wonder if we don't want to actually bother joining the matrices, but leave them as a list, so we can let ComplexHeatMap handle it?

👏 great find, concatenating looks much better than splitting!

…d'ing them all. no need for split indices now either
@jashapiro
Copy link
Member

jashapiro commented Oct 27, 2023

I think the code you added here looks fine, but I was thrown off by the diff, which makes it look like you updated more than you did... Moving a big chunk of code (as we wanted to) makes for a big diff!

My main note, I think, would be to move the jaccard functions up to probably just after the function imports?

@sjspielman
Copy link
Member Author

but I was thrown off by the diff, which makes it look like you updated more than you did?

At some point I had accidentally committed my lines of local dev code (define/read in a real SCE) so it was also removed there, so maybe this is why it looks that way... I don't see any specific weirdness lingering, though!

My main note, I think, would be to move the jaccard functions up to probably just after the function imports?

👍

And circling back...

I wonder if we don't want to actually bother joining the matrices, but leave them as a list, so we can let ComplexHeatMap handle it?

👏 great find, concatenating looks much better than splitting!

I've started to think about next steps, and thinking about how this approach will work with the legend, which we may not have discussed specifically yet!
Will we want shared 1 legend for a group of stacked heatmaps, or a legend for each heatmap in the stack? If we really want 1 shared legend, I think (?) we'll need to revert back to a splitting approach, but if 3 legends is ok, then concatenation should be fine. If every matrix had a full range of values 0-1, this would probably not matter, but of course that's not how the data will look.

@jashapiro
Copy link
Member

Will we want shared 1 legend for a group of stacked heatmaps, or a legend for each heatmap in the stack? If we really want 1 shared legend, I think (?) we'll need to revert back to a splitting approach, but if 3 legends is ok, then concatenation should be fine. If every matrix had a full range of values 0-1, this would probably not matter, but of course that's not how the data will look.

All the heatmaps should be on the same scale of 0-1, no matter what the values, so we should be setting that manually anyway.

@sjspielman
Copy link
Member Author

Testing things out, we can certainly fix the range and colors, but legends don't match up between heatmaps due to different distributions within each matrix's (matrices'? grammar feels weird. whatever.) values.

Setting these values led to the following - mostly unstyled, but see legend color mapping differences.

#param settings - 
heatmap_legend_param(
      # custom legend with color splits at 0, 0.5, 1
      col_fun = colorRamp2(c(0, 0.5, 1), c("blue", "white", "red")),
      # set range with breaks
      at = seq(0, 1, 0.25)
)
Screenshot 2023-10-27 at 3 49 43 PM Screenshot 2023-10-27 at 3 49 48 PM

@jashapiro
Copy link
Member

jashapiro commented Oct 27, 2023

Something must be missing, because your heatmaps are definitely not using that col_fun value.

@jashapiro
Copy link
Member

Following up: you have to set the col argument to the same color function within the call to Heatmap() itself, maybe?

@sjspielman
Copy link
Member Author

Noting @jashapiro identified the good code solution to this, so we're good to keep matrices as a list!

Copy link
Member

@allyhawkins allyhawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really good! I just had some comments on comments, but you can feel free to ignore those or use them.

templates/qc_report/celltypes_supplemental_report.rmd Outdated Show resolved Hide resolved
templates/qc_report/celltypes_supplemental_report.rmd Outdated Show resolved Hide resolved
@sjspielman sjspielman merged commit 7e0cb90 into development Oct 30, 2023
3 checks passed
@sjspielman sjspielman deleted the sjspielman/516-heatmaps-calc-jaccard branch October 30, 2023 16:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants