nsmblR is an algorithm that infers a consensus gene regulatory network based on a gene expression data compendium. This tool uses a total of 7 different inference algorithms and performs a voting of the edges discovered by each algorithm to achieve a consensus voting on the relevant edges. The algorithms currently used by nsmblR are:
- CLR: Mutual information based algorithm with distribution correction
- ARACNE: Mutual information based algorithm with edge pruning
- Spearman correlation
- PCIT: Partial correlation based algorithm
- MRNET: Maximum relevance/minimum redundancy network inference based on mutual information
- MRNETB
- MutRank: Rank correlation inference
Please follow the links above for the appropriate references and algorithm descriptions. After running all of the algorithms, the inferred results are filtered based on the quantile of the scores (default threshold set to 0.97) and the edges are tallied. Multiple voting schemes are provided and the final results account for edges that are present in more than 51% of the cases (4 algorithms out of 7.)
Install from GitHub using devtools
as:
devtools::install_github("diogocamacho/nsmblR")
The easiest way to run nsmblR
is to use its wrapper ensemble_model
as:
library(nsmblR)
res <- ensmeble_model(data, gene_names)
where data
is a gene expression compendium (genes on rows, samples on columns) and gene_names
are the gene symbols for the rows. Internally nsmblR
will subset the final set of edges as those that are consistent across multiple inference methods that are also in the top 97% quantile of the edge score. For greater flexibility, look into the individual functions of the package to tailor the results to your specific problem.
The following is an example on how to run the nsmblR
package with data that is provided with the package. This data comes from E. coli and it is a set of 200 genes in 20 different conditions. First, we will load the package:
library(nsmblR)
Now we will run the package wrapper on the example data.
net <- ensemble_model(data = data_matrix, gene_names = genes)
The net
variable is a list that contains all inferred networks, the inferred networks filtered based on a quantile threshold (see documentation for the edge_filtering
function), and the results on the consensus voting. The consensus vote is ultimately the result of the nsmblR
package and can be access as:
consensus_net <- net$consensus_network
which is a data frame with N edges and 6 columns, where the columns are the gene names for the edges (x and y), and the presence of the edge in different voting regimens (majority, super_majority, absolute_majority, and quorum -- see documentation on the edge_voting
for a detailed explanation of voting schemes.) For this particular example, the consensus network will output:
Method | Number edges |
---|---|
Majority (> 51% of votes) | 53 |
Super majority (> 66% of votes) | 31 |
Absolute majority (100% of votes) | 2 |
Quorum vote (N/2 + 1 votes) | 31 |
You can then use a package like igraph
to display the edges inferred. In this example, we can look at the edges that pass the super-majority condition as such:
library(igraph)
G <- graph_from_data_frame(N$consensus_network %>% dplyr::filter(., super_majority == 1), directed = FALSE)
plot(G)
which will show the 31 edges that pass the super majority voting.
NOTE: this is only an example. Even as such, with a set of genes and samples selected at random, we see the inference of a relationship between the lsrR regulator and the lsrG gene, which is to be expected, thereby giving some confidence about the approaches used for inference and the voting methods employed here. (see more at EcoCyc)