.ComputeClusterStats update; add pattern method to outer functions. #6

MatveevDaniil · 2024-03-06T19:46:36Z

Rewrote the code of .computeClusterStats
1.1. Speed improvement - ~300% for small toy data with 200 rows. ~300-700% for real CMV data with 10^5 strings.
1.2. Open question - how should we handle NA values in the count column? I don't think it's the best idea to use na.rm=TRUE. Maybe we should put some warning to the user, that the count column has NA values.
Removed duplicated code - .computeClusterStatsDualChain
2.1. This function was duplicating .computeClusterStats in everything except it was computing string statistics for two columns.
Added network building method to outer functions.
3.1. We should rewrite the documentation for them, I need help with that.

Besides automatic tests, I tested that output for the new .computeClusterStats is the same as it was before on the toy data:
(code should be in the project root directory)

library(NAIR)
library(igraph)

gen_computeClusterStats_input <- function( # nolint
  seq_list = NULL,
  drop_isolated_nodes = FALSE
) {
  if (is.null(seq_list)) {
    seq_list <- data.frame(
      cdr3 = NAIR::simulateToyData()$CloneSeq
    )
  }
  adjacency_matrix <- NAIR::generateAdjacencyMatrix(
    seq_list$cdr3,
    dist_type = "hamming",
    dist_cutoff = 1,
    drop_isolated_nodes = drop_isolated_nodes,
    method = "pattern"
  )
  if (drop_isolated_nodes == FALSE) {
    adjacency_matrix@Dimnames[1] <- adjacency_matrix@Dimnames[2] <- list(
      seq_list$cdr3
    )
  } else {
    seq_list <- adjacency_matrix@Dimnames[1] <- adjacency_matrix@Dimnames[2]
  }
  net <- igraph::graph_from_adjacency_matrix(
    adjacency_matrix,
    mode = "undirected"
  )
  seq_col <- "cdr3"
  count_col <- "count"
  cluster_id_col <- "cluster_id"
  degree_col <- "degree"
  if (drop_isolated_nodes) {
    data <- data.frame(
      cdr3 = unlist(seq_list)
    )
  } else {
    data <- data.frame(
      cdr3 = seq_list$cdr3
    )
  }
  data$count <- 1
  data$degree <- igraph::degree(
    net
  )
  data$cluster_id <- igraph::components(net)$membership
  return(list(
    data = data,
    adjacency_matrix = adjacency_matrix,
    seq_col = seq_col,
    count_col = count_col,
    cluster_id_col = cluster_id_col,
    degree_col = degree_col
  ))
}

devtools::load_all()
for (i in 1:10) {
  inp_dat_ex <- gen_computeClusterStats_input(drop_isolated_nodes = FALSE)

  data <- inp_dat_ex$data
  adjacency_matrix <- inp_dat_ex$adjacency_matrix
  seq_col <- inp_dat_ex$seq_col
  count_col <- inp_dat_ex$count_col
  cluster_id_col <- inp_dat_ex$cluster_id_col
  degree_col <- inp_dat_ex$degree_col


  old_clust_time <- system.time({
    old_clust_stats <- .computeClusterStatsOLD(
      inp_dat_ex$data,
      inp_dat_ex$adjacency_matrix,
      inp_dat_ex$seq_col,
      inp_dat_ex$count_col,
      inp_dat_ex$cluster_id_col,
      inp_dat_ex$degree_col
    )
  })
  print(paste("Old time:", old_clust_time[["user.self"]]))

  # devtools::load_all()
  new_clust_time <- system.time({
    new_clust_stats <- .computeClusterStats(
      data = data,
      adjacency_matrix = adjacency_matrix,
      cluster_id_col = cluster_id_col,
      degree_col = degree_col,
      seq_col = seq_col,
      count_col = count_col    
    )
  })
  print(paste("New time:", new_clust_time[["user.self"]]))

  for (col in names(old_clust_stats)) {
    if (!isTRUE(all.equal(old_clust_stats[[col]], new_clust_stats[[col]])))
      print(col)
  }
}

…ode of computeclusterstats; removed duplicated code - computeclusterstatsdualchain

MatveevDaniil · 2024-03-06T19:49:32Z

Hello @brianpatrickneal, let's discuss the bolded points I made in the PR description

brianpatrickneal · 2024-03-09T01:19:08Z

Hi Daniil, I agree with your suggestion to issue a warning when encountering NAs in the count column. I will be happy to take care of updating the documentation (it will be a good way for me to review and familiarize myself with the changes). As a heads up, it is almost certain that I will be slow to do so as the next two weeks are going to be especially hellish for me. It is unlikely that I will get it done before the weekend of 3/22. If this delay interferes with your ongoing work, let me know and I'll do what I can to expedite, though I can't make any credible promises, much as I would like to. Thanks for your patience with me! Best, Brian

…

On Wed, Mar 6, 2024 at 11:49 AM aidanil ***@***.***> wrote: Hello, Brian, let's discuss the bolded points I made in the PR description — Reply to this email directly, view it on GitHub <#6 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AWMANGELRRNOYYE52MZ7LSDYW5XNPAVCNFSM6AAAAABEJVXWF6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBRGY3DSMBTGE> . You are receiving this because your review was requested.Message ID: <mlizhangx/Network-Analysis-for-Repertoire-Sequencing-/pull/6/c1981669031@ github.com>

MatveevDaniil · 2024-03-09T01:39:51Z

Hello Brian, No worries, I think we are not in a rush, let's discuss it when you will have more time! Also, I think at that time I will have more analysis of different adjacency matrix building methods, and we will add this info to docs too (and/or decide how to automatize the choice). Sincerely, Daniil Matveev

________________________________ From: brianpatrickneal ***@***.***> Sent: Friday, March 8, 2024 5:19 PM To: mlizhangx/Network-Analysis-for-Repertoire-Sequencing- ***@***.***> Cc: Daniil Mikhailovich Matveev ***@***.***>; Author ***@***.***> Subject: Re: [mlizhangx/Network-Analysis-for-Repertoire-Sequencing-] .ComputeClusterStats update; add pattern method to outer functions. (PR #6) Hi Daniil, I agree with your suggestion to issue a warning when encountering NAs in the count column. I will be happy to take care of updating the documentation (it will be a good way for me to review and familiarize myself with the changes). As a heads up, it is almost certain that I will be slow to do so as the next two weeks are going to be especially hellish for me. It is unlikely that I will get it done before the weekend of 3/22. If this delay interferes with your ongoing work, let me know and I'll do what I can to expedite, though I can't make any credible promises, much as I would like to. Thanks for your patience with me! Best, Brian

On Wed, Mar 6, 2024 at 11:49 AM aidanil ***@***.***> wrote: Hello, Brian, let's discuss the bolded points I made in the PR description — Reply to this email directly, view it on GitHub <#6 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AWMANGELRRNOYYE52MZ7LSDYW5XNPAVCNFSM6AAAAABEJVXWF6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBRGY3DSMBTGE> . You are receiving this because your review was requested.Message ID: <mlizhangx/Network-Analysis-for-Repertoire-Sequencing-/pull/6/c1981669031@ github.com>

— Reply to this email directly, view it on GitHub<#6 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ARB34CKV3476JHXUNXUDXRDYXJPSDAVCNFSM6AAAAABEJVXWF6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBWGY3TIMRQGI>. You are receiving this because you authored the thread.Message ID: ***@***.***>

…nalysis-for-Repertoire-Sequencing- into optimization

brianpatrickneal · 2024-03-28T03:46:37Z

Documentation has been updated.

Further thoughts on 1.2 (handling NAs in count data):
The approach in the current release (compute max/total count without NA values) is motivated by findPublicClusters(), which selects clusters based on total clone count:

We want to select clusters with sufficient total clone count
Leaving NAs intact results in the total count evaluating to NA; may overlook clusters with sufficient total count among nodes with existing count data
Dropping observations without count data from initial input data is also undesirable since clusters are also selected by node count

I'll bring the question up for discussion during tomorrow's meeting.

NEW: added network building method to outer functions; rewrited the c…

8abba7a

…ode of computeclusterstats; removed duplicated code - computeclusterstatsdualchain

MatveevDaniil self-assigned this Mar 6, 2024

MatveevDaniil requested a review from brianpatrickneal March 6, 2024 19:48

brianpatrickneal added 3 commits March 25, 2024 16:34

Add method argument to documentation

c14cd10

Add method argument to documentation

b7e4f1a

Merge branch 'optimization' of https://github.com/mlizhangx/Network-A…

a054d1b

…nalysis-for-Repertoire-Sequencing- into optimization

brianpatrickneal added 3 commits April 5, 2024 11:05

Remove data rows with NA count values when count data is provided

4e09470

Update change log

2b4997a

Increment development version

7843bae

brianpatrickneal marked this pull request as ready for review April 5, 2024 18:43

brianpatrickneal merged commit ee01f2b into main Apr 5, 2024
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.ComputeClusterStats update; add pattern method to outer functions. #6

.ComputeClusterStats update; add pattern method to outer functions. #6

MatveevDaniil commented Mar 6, 2024

MatveevDaniil commented Mar 6, 2024 •

edited

Loading

brianpatrickneal commented Mar 9, 2024 via email

MatveevDaniil commented Mar 9, 2024 via email

brianpatrickneal commented Mar 28, 2024

.ComputeClusterStats update; add pattern method to outer functions. #6

.ComputeClusterStats update; add pattern method to outer functions. #6

Conversation

MatveevDaniil commented Mar 6, 2024

MatveevDaniil commented Mar 6, 2024 • edited Loading

brianpatrickneal commented Mar 9, 2024 via email

MatveevDaniil commented Mar 9, 2024 via email

brianpatrickneal commented Mar 28, 2024

MatveevDaniil commented Mar 6, 2024 •

edited

Loading