Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.ComputeClusterStats update; add pattern method to outer functions. #6

Merged
merged 7 commits into from
Apr 5, 2024

Conversation

MatveevDaniil
Copy link
Collaborator

  1. Rewrote the code of .computeClusterStats
    1.1. Speed improvement - ~300% for small toy data with 200 rows. ~300-700% for real CMV data with 10^5 strings.
    1.2. Open question - how should we handle NA values in the count column? I don't think it's the best idea to use na.rm=TRUE. Maybe we should put some warning to the user, that the count column has NA values.
  2. Removed duplicated code - .computeClusterStatsDualChain
    2.1. This function was duplicating .computeClusterStats in everything except it was computing string statistics for two columns.
  3. Added network building method to outer functions.
    3.1. We should rewrite the documentation for them, I need help with that.

Besides automatic tests, I tested that output for the new .computeClusterStats is the same as it was before on the toy data:
(code should be in the project root directory)

library(NAIR)
library(igraph)

gen_computeClusterStats_input <- function( # nolint
  seq_list = NULL,
  drop_isolated_nodes = FALSE
) {
  if (is.null(seq_list)) {
    seq_list <- data.frame(
      cdr3 = NAIR::simulateToyData()$CloneSeq
    )
  }
  adjacency_matrix <- NAIR::generateAdjacencyMatrix(
    seq_list$cdr3,
    dist_type = "hamming",
    dist_cutoff = 1,
    drop_isolated_nodes = drop_isolated_nodes,
    method = "pattern"
  )
  if (drop_isolated_nodes == FALSE) {
    adjacency_matrix@Dimnames[1] <- adjacency_matrix@Dimnames[2] <- list(
      seq_list$cdr3
    )
  } else {
    seq_list <- adjacency_matrix@Dimnames[1] <- adjacency_matrix@Dimnames[2]
  }
  net <- igraph::graph_from_adjacency_matrix(
    adjacency_matrix,
    mode = "undirected"
  )
  seq_col <- "cdr3"
  count_col <- "count"
  cluster_id_col <- "cluster_id"
  degree_col <- "degree"
  if (drop_isolated_nodes) {
    data <- data.frame(
      cdr3 = unlist(seq_list)
    )
  } else {
    data <- data.frame(
      cdr3 = seq_list$cdr3
    )
  }
  data$count <- 1
  data$degree <- igraph::degree(
    net
  )
  data$cluster_id <- igraph::components(net)$membership
  return(list(
    data = data,
    adjacency_matrix = adjacency_matrix,
    seq_col = seq_col,
    count_col = count_col,
    cluster_id_col = cluster_id_col,
    degree_col = degree_col
  ))
}

devtools::load_all()
for (i in 1:10) {
  inp_dat_ex <- gen_computeClusterStats_input(drop_isolated_nodes = FALSE)

  data <- inp_dat_ex$data
  adjacency_matrix <- inp_dat_ex$adjacency_matrix
  seq_col <- inp_dat_ex$seq_col
  count_col <- inp_dat_ex$count_col
  cluster_id_col <- inp_dat_ex$cluster_id_col
  degree_col <- inp_dat_ex$degree_col


  old_clust_time <- system.time({
    old_clust_stats <- .computeClusterStatsOLD(
      inp_dat_ex$data,
      inp_dat_ex$adjacency_matrix,
      inp_dat_ex$seq_col,
      inp_dat_ex$count_col,
      inp_dat_ex$cluster_id_col,
      inp_dat_ex$degree_col
    )
  })
  print(paste("Old time:", old_clust_time[["user.self"]]))

  # devtools::load_all()
  new_clust_time <- system.time({
    new_clust_stats <- .computeClusterStats(
      data = data,
      adjacency_matrix = adjacency_matrix,
      cluster_id_col = cluster_id_col,
      degree_col = degree_col,
      seq_col = seq_col,
      count_col = count_col    
    )
  })
  print(paste("New time:", new_clust_time[["user.self"]]))

  for (col in names(old_clust_stats)) {
    if (!isTRUE(all.equal(old_clust_stats[[col]], new_clust_stats[[col]])))
      print(col)
  }
}

…ode of computeclusterstats; removed duplicated code - computeclusterstatsdualchain
@MatveevDaniil
Copy link
Collaborator Author

MatveevDaniil commented Mar 6, 2024

Hello @brianpatrickneal, let's discuss the bolded points I made in the PR description

@brianpatrickneal
Copy link
Collaborator

brianpatrickneal commented Mar 9, 2024 via email

@MatveevDaniil
Copy link
Collaborator Author

MatveevDaniil commented Mar 9, 2024 via email

@brianpatrickneal
Copy link
Collaborator

Documentation has been updated.

Further thoughts on 1.2 (handling NAs in count data):
The approach in the current release (compute max/total count without NA values) is motivated by findPublicClusters(), which selects clusters based on total clone count:

  • We want to select clusters with sufficient total clone count
  • Leaving NAs intact results in the total count evaluating to NA; may overlook clusters with sufficient total count among nodes with existing count data
  • Dropping observations without count data from initial input data is also undesirable since clusters are also selected by node count

I'll bring the question up for discussion during tomorrow's meeting.

@brianpatrickneal brianpatrickneal marked this pull request as ready for review April 5, 2024 18:43
@brianpatrickneal brianpatrickneal merged commit ee01f2b into main Apr 5, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants