-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add jaccard calculation code #521
Add jaccard calculation code #521
Conversation
Okay, I've been looking at the function you have for calculating Jaccard here and doing some reading to figure out exactly how we should apply the Jaccard Index to our data. My overall understanding is that we are calculating the Jaccard Index between the barcodes shared for each group (e.g., barcodes in cluster A and barcodes in cell type A). And then we are doing that across all combinations, obtaining a value for each combination. I think a much simpler way of doing this would involve first creating a binary matrix with the rows as the group names (like cluster 1, cluster 2, T cell, B cell) and the columns as the barcodes. Then you want to calculate the Jaccard index between every possible combination of cluster1/T cell, cluster 1/ cluster 2, cluster 1/ B cell, etc. Once you have the binary matrix, you can use the So all that to say, I think you can condense the two functions you have into one function that takes in the cell metadata and the two categories to compare and then returns the matrix to plot as a heatmap. I spent probably too much time playing around with this, but I think you could condense a lot of what you have to something like this:
|
I'm sure you don't want a third way, but here we are... I think I kind of like Stephanie's a bit better, but I have what might be some simplifications to reduce passing around big objects. I know this isn't the full thing, keeping column names and such, but I think it gets most of the way there. The main innovation here is to use
I didn't turn the resulting data frame back into a matrix, and didn't rename any columns, you already have the code for that. So... options! |
Just chiming in to say I like the option that Josh posted. I think it's a bit easier to follow and simpler than what is currently there. |
I've updated the calculations to use Josh's version, and added rxoygen comments all around, and ensured the function returns a matrix for subsequent |
dplyr::mutate(jaccard = jaccard_scores) |> | ||
# convert to matrix | ||
tidyr::pivot_wider( | ||
names_from = id1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Separately, I think we want to do the opposite matrix transformation here, so we get coordinates that are [id1, id2]
Co-authored-by: Joshua Shapiro <[email protected]>
|
||
if (has_submitter) { | ||
cluster_submitter_jaccard_matrix <- make_jaccard_matrix(celltype_df, "cluster", "submitter_celltype_annotation") | ||
jaccard_mat <- rbind(jaccard_mat, cluster_submitter_jaccard_matrix) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we want to get the cluster names onto the rownames of this matrix somehow? I think split(celltype_df$cluster)
(which is effectively called in the make_jaccard_matrix) will always return the same order, which I would expect to be the same as unique(celltype_df$cluster)
, but it might be worth using join
to combine these matrices?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They already are in the matrices from the jaccard calculation itself, with the tibble::column_to_rownames(var = "id1")
code, unless I've misunderstood what you're asking here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think they would be lost in the rbind
, but if we are not going to rbind
and will use the ComplexHeatmap function instead, this will be irrelevant.
I wonder if we don't want to actually bother joining the matrices, but leave them as a list, so we can let ComplexHeatMap handle it? https://jokergoo.github.io/ComplexHeatmap-reference/book/a-list-of-heatmaps.html |
👏 great find, concatenating looks much better than splitting! |
…d'ing them all. no need for split indices now either
I think the code you added here looks fine, but I was thrown off by the diff, which makes it look like you updated more than you did... Moving a big chunk of code (as we wanted to) makes for a big diff! My main note, I think, would be to move the jaccard functions up to probably just after the function imports? |
At some point I had accidentally committed my lines of local dev code (define/read in a real SCE) so it was also removed there, so maybe this is why it looks that way... I don't see any specific weirdness lingering, though!
👍 And circling back...
I've started to think about next steps, and thinking about how this approach will work with the legend, which we may not have discussed specifically yet! |
All the heatmaps should be on the same scale of 0-1, no matter what the values, so we should be setting that manually anyway. |
Something must be missing, because your heatmaps are definitely not using that |
Following up: you have to set the |
Noting @jashapiro identified the good code solution to this, so we're good to keep matrices as a list! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks really good! I just had some comments on comments, but you can feel free to ignore those or use them.
Co-authored-by: Ally Hawkins <[email protected]>
Towards #516
Stacked on #520
This PR starts the process of updating heatmaps. Here, I made two broad changes
I did not yet remove any of the current heatmap code. Everything in the heatmap section that comes after the sentence I added
"Note that the remaining heatmap code is going to be updated next!"
is the same as it was before. The two chunks in the heatmap section that come before that sentence are focus to look during review.I wrote two functions which, together, are calculating jaccard similarity between two categorical columns, e.g. cluster & singler annotations. The function
calculate_jaccard_similarity()
does the actual calculation for a given jaccard similarity value, and the functioncalculate_jaccard_matrix()
wraps it to run over all values in the two columns of interest and ultimately return a matrix for use in a heatmap.Using these functions, I calculate a matrix for each celltype annotation compared to clusters, and I
rbind()
them together as I go. During this, I also track the size of each matrix so we can split a ComplexHeatmap accordingly.The code is all working as expected, with seemingly correct calculations coming out, so I expect most discussion in this PR will be about the specific strategies I've used and how those might be modified to lay some solid ground for next steps!