Skip to content

NAIR 1.0.1

Compare
Choose a tag to compare
@brianpatrickneal brianpatrickneal released this 14 Sep 19:44

Breaking Changes

  • getClusterStats() now requires the cluster ID column to be specified and present in the provided node metadata; it will no longer compute cluster membership since it does not return the node metadata (so any membership values computed are lost).
  • addClusterMembership() now accepts and returns the list of network objects instead of accepting and returning the node metadata with the igraph as an additional input. The first parameter data has been deprecated and moved in position, with the second parameter net becoming the first parameter and accepting the list of network objects instead of just the igraph. The function still also supports the old usage (for now), as long as net and data are specified by name (or the updated argument positions are used). See section "Unified Primary Argument Across Functions" for context.
  • Functions no longer save output to file by default. The user must provide a directory/file path to the appropriate parameter for output to be saved.
  • All instances of "individual" as a default value for output_type have been changed to "rds". "rds" is the preferred default since it reduces file size/clutter and the list of network objects can be restored intact (the list is the primary input/output of core NAIR functions) under any name desired. "rda" should be used if the file will be transferred across machines (the list will be restored under the name net), and "individual" should be used when the output is to be accessed from outside of R.
  • output_type = "individual" now writes the row names of the node metadata to the first column of the csv file. These contain the original row IDs from the input data.
  • Default value of output_type in findAssociatedClones() and input_type in buildAssociatedClusterNetwork() changed from "csv" to "rds", since these files are intermediate outputs and typically there should be no need to access them from outside of R or from another machine.
  • buildPublicClusterNetworkByRepresentative() default value of output_type changed from "rda" to "rds".

New Features

This section covers general new features. Other new features are grouped by subject in the following few sections.

  • buildRepSeqNetwork() now has the convenient alias buildNet().
  • The list returned by buildRepSeqNetwork() now contains an element details with network metadata such as the argument values used in the function call.
  • Plots with nodes colored according to a continuous variable will now have their legends displayed using a color bar instead of discrete legend values, unless that variable is also used to size the nodes.
  • In most cases where an invalid value is supplied to a function argument for which a meaningful default exists, instead of raising an error, the argument's value is replaced by the default value and a warning is raised.

Unified Primary Argument Across Functions

Several changes and additions have been made in favor of using the list of network objects returned by buildRepSeqNetwork() as a unified primary input and output across the core NAIR functions. Adopting this convention offers several benefits: It greatly simplifies usage, since users no longer need to know which components of the list to input to which function (or what each function returns); it eliminates the task of manually updating the list of network objects; it results in the core functions working with the pipe operator; and most importantly, it improves functionality within and between functions, since functions can read and modify anything in the network list. For instance, addPlots() can use the coordinate layout of any existing plots to ensure a consistent layout across plots (which is no longer guaranteed otherwise), while addClusterStats() can add cluster membership values to the node metadata and record in details that the cluster properties correspond to these membership values (and not the values from a different instance of clustering using a different algorithm).

The following changes encompass the move toward using the network list as a primary input/output:

  • addClusterMembership() parameters and return value have changed. See the Breaking Changes section for details.
  • addPlots() added as the preferred alternative to generateNetworkGraphPlots() and plotNetworkGraph()
  • addClusterStats() added as the preferred alternative to getClusterStats()
  • addNodeStats() added as the preferred alternative to addNodeNetworkStats()
  • labelClusters() added as the preferred alternative to addClusterLabels()
  • labelNodes() added as the preferred alternative to addGraphLabels()

See the new "Supplementary Functions" vignette for examples.

Multiple Instances of Clustering

The following changes and additions have been made to facilitate multiple instances of clustering on the same network using different clustering algorithms. See the new "Cluster Analysis" vignette for examples.

  • All functions that can perform clustering now have a parameter cluster_id_name that can be used to specify a custom name for the cluster membership variable added to the node metadata.
  • Each time a new cluster membership variable is added to the node metadata, information is added to details recording the clustering algorithm used and the name of the corresponding cluster membership variable.
  • When cluster properties are computed with addClusterStats(), information is added to details recording the cluster membership variable corresponding to the cluster properties.
  • labelClusters() and addClusterLabels() now check details to confirm that the cluster properties match the specified cluster membership variable before using the node counts in the cluster properties.
  • labelClusters() and addClusterLabels() can now be used without cluster properties; node count is computed from the cluster membership values.
  • labelClusters() can be used to label multiple plots at once.
  • addClusterMembership(), addClusterStats() and addNodeStats() now allow custom argument values for optional parameters of the clustering algorithm through the ellipses (...) argument.

It may also be of interest in the future to add functionality allowing the network list to contain multiple sets of cluster properties corresponding to different instances of clustering.

Plots and Graph Layout

Plotting functions no longer fix the random seed when generating the coordinate layout for a plot. In order to facilitate a consistent layout across multiple plots of the same network graph, the following changes have been made.

  • Multiple plots produced in the same call to buildRepSeqNetwork(), addPlots() and generateNetworkGraphPlots() will all use a common layout.
  • Plot lists created by buildRepSeqNetwork(), addPlots() and generateNetworkGraphPlots() now include a matrix graph_layout containing the layout used in the plots.
  • addPlots() will automatically use the graph_layout mentioned above to ensure that new plots use the same layout as existing plots.
  • If the network list already contains plots but graph_layout is absent, addPlots() will extract the layout from the first plot and use it for the new plots.
  • generateNetworkGraphPlots() has a new parameter layout that can be used to specify the layout. Can be used to generate new plots with the same layout as existing plots (though addPlots() is easier). Can also be used to generate plots with custom layout types other than the default layout created using igraph::layout_components().
  • saveNetworkPlots() has a new parameter outfile_layout that can be used to save the graph layout.
  • saveNetwork() automatically saves the graph layout when output_type = "individual".

Essentially, generating new plots with addPlots() will ensure a consistent layout with the initial plots. Fixing a random seed before calling buildRepSeqNetwork() (or before the first call to addPlots(), if buildRepSeqNetwork() is called with plots = FALSE) allows the same layout to be reproduced across multiple executions of the same code in which the initial plots are generated.

Improved File Input Functionality

  • Most instances of the file_list argument now accept a list containing connections and file paths instead of only a character vector of file paths. This allows a greater variety of data sources to be used.
  • A greater variety of input data formats are now supported. Instances of the input_type parameter that accept text formats have a new parameter read.args that accepts a named list of optional arguments to read.table() and its variants read.csv(), etc. Dedicated arguments for header and sep still exist apart from read.args for backwards compatibility, but their defaults now match input_type (e.g., sep defaults to "," for input_type = "csv" and to "" for input_type = "table").
  • input_type = "tsv" now reads files using read.delim() instead of read.table().
  • Most instances of the input_type argument now also support the value "csv2" for reading files using read.csv2().

Lifecycle Changes

  • plotNetworkGraph() deprecated in favor of addPlots().
  • filterInputData() argument count_col deprecated. Rows with NA counts are no longer dropped.
  • getClusterFun() argument cluster_fun deprecated (see Breaking Changes)
  • addNodeNetworkStats() deprecated in favor of addNodeStats() (see section "Unified Primary Argument Across Functions")
  • addClusterMembership() argument data deprecated (see section "Unified Primary Argument Across Functions")
  • addClusterMembership() argument fun deprecated in favor of cluster_fun for consistency with other functions.
  • sparseAdjacencyMatFromSeqs() argument max_dist deprecated in favor of dist_cutoff for consistency with other functions.
  • saveNetwork() argument output_filename deprecated in favor of output_name for consistency with other functions.
  • sparseAdjacencyMatFromSeqs() deprecated in favor of its better-named twin generateAdjacencyMatrix().
  • generateNetworkFromAdjacencyMat() deprecated in favor of its better-named twin generateNetworkGraph().

Minor Changes and Bug Fixes

  • output_type = "individual" now also saves the list of plots (if present) to an RDS file. This prevents the ggraph objects containing the plots from being lost, in case the user wishes to modify these plots in the future.
  • All instances of the output_name parameter now automatically replace potentially unsafe characters with underscores and removes any leading or trailing non-alphanumeric characters. Safe characters include alphanumeric characters, underscores and hyphens.
  • Package functions no longer print messages to the console by default. Functions now have a verbose argument which can be set to TRUE to enable printing of console messages. For logging purposes, these messages are now generated using message() rather than cat(), and so send their output to std.err() rather than std.out().
  • buildRepSeqNetwork(), addPlots() and generateNetworkGraphPlots() now have print_plots set to FALSE by default (plots are no longer printed to the R plotting window unless manually specified).
  • buildAssociatedClusterNetwork() now removes duplicate observations after loading the data from all neighborhoods. When multiple associated sequences are similar, the same clone from a given sample can belong to multiple neighborhoods. Previously, this occurrence resulted in the same clone appearing multiple times in the global network.
  • simulateToyData() argument seed_value removed. Users can set a seed prior to calling the function if desired.
  • generateNetworkGraphPlots() now handles the case where color_nodes_by contains duplicate values by removing the duplicate values with a warning. If color_scheme is a vector, the corresponding entries of color_scheme are also removed. Previously, this case resulted in a list of plots containing two elements with the same name.
  • When generateNetworkGraphPlots() is called with a non-numeric variable specified for size_nodes_by, the function now defaults to fixed node sizes with a warning.
  • addClusterStats() and buildRepSeqNetwork(cluster_stats = TRUE) now call sum() and max() with na.rm = TRUE when computing abundance-based properties. This change reflects the fact that buildRepSeqNetwork() no longer drops input data rows with NA and NaN values in the count column.
  • combineSamples() and loadDataFromFileList() now preserve the original row IDs of each input file, which are prepended in the combined data by sample IDs (if available) or the file number based on the order in file_list.