Skip to content

Commit

Permalink
Rename MDL as BIC
Browse files Browse the repository at this point in the history
  • Loading branch information
franck-simon committed Sep 11, 2024
1 parent 9c06b93 commit 1b54798
Show file tree
Hide file tree
Showing 14 changed files with 49 additions and 49 deletions.
12 changes: 6 additions & 6 deletions R/computeInformation.R
Original file line number Diff line number Diff line change
Expand Up @@ -48,8 +48,8 @@
#' @param cplx [a string]
#' The complexity model:
#' \itemize{
#' \item["mdl"] Minimum description Length
#' \item["nml"] Normalized Maximum Likelihood, less costly compared to "mdl" in
#' \item["bic"] Bayesian Information Criterion
#' \item["nml"] Normalized Maximum Likelihood, less costly compared to "bic" in
#' the finite sample case and will allow for more bins.
#' }
#' @param n_eff [an integer]
Expand Down Expand Up @@ -126,7 +126,7 @@
computeMutualInfo <- function(x, y,
df_conditioning = NULL,
maxbins = NULL,
cplx = c("nml", "mdl"),
cplx = c("nml", "bic"),
n_eff = -1,
sample_weights = NULL,
is_continuous = NULL,
Expand Down Expand Up @@ -352,8 +352,8 @@ computeMutualInfo <- function(x, y,
#' @param cplx [a string]
#' The complexity model:
#' \itemize{
#' \item["mdl"] Minimum description Length
#' \item["nml"] Normalized Maximum Likelihood, less costly compared to "mdl" in
#' \item["bic"] Bayesian Information Criterion
#' \item["nml"] Normalized Maximum Likelihood, less costly compared to "bic" in
#' the finite sample case and will allow for more bins.
#' }
#' @param n_eff [an integer]
Expand Down Expand Up @@ -408,7 +408,7 @@ computeMutualInfo <- function(x, y,
computeThreePointInfo <- function(x, y, z,
df_conditioning = NULL,
maxbins = NULL,
cplx = c("nml", "mdl"),
cplx = c("nml", "bic"),
n_eff = -1,
sample_weights = NULL,
is_continuous = NULL) {
Expand Down
8 changes: 4 additions & 4 deletions R/discretizeMutual.R
Original file line number Diff line number Diff line change
Expand Up @@ -33,9 +33,9 @@
#' The maximum number of bins desired in the discretization. A lower number makes the computation faster, a higher
#' number allows finer discretization (by default : 5 * cubic root of N).
#' @param cplx [a string]
#' The complexity used in the dynamic programming. Either "mdl" for Minimum description Length or
#' The complexity used in the dynamic programming. Either "bic" for Bayesian Information Criterion or
#' "nml" for Normalized Maximum Likelihood, which is less costly in the finite sample case and
#' will allow more bins than mdl.
#' will allow more bins than bic.
#' @param n_eff [an int]
#' The number of effective samples. When there is significant autocorrelation in the samples you may
#' want to specify a number of effective samples that is lower than the number of points in the distribution.
Expand Down Expand Up @@ -258,15 +258,15 @@ discretizeMutual <- function(x,
is_continuous <- !is_discrete

# Pass complexity parameter as int
if (cplx == "mdl") {
if (cplx == "bic") {
intcplx <- 0
} else if (cplx == "nml") {
intcplx <- 1
} else {
warning(
paste0(
"cplx parameter not understood, please specify either ",
"\'mdl\' or \'nml\'. Running with the default option ",
"\'bic\' or \'nml\'. Running with the default option ",
"(nml)."
)
)
Expand Down
6 changes: 3 additions & 3 deletions R/miic.R
Original file line number Diff line number Diff line change
Expand Up @@ -159,13 +159,13 @@
#' your compiler is compatible with openmp if you wish to use multithreading.
#'
#' @param cplx [a string, optional, "nml" by default, possible values:
#' "nml", "mdl"]
#' "nml", "bic"]
#'
#' In practice, the finite size of the input dataset requires that
#' the 2-point and 3-point information measures should be \emph{shifted}
#' by a \emph{complexity} term. The finite size corrections can be based on
#' the Minimal Description Length (MDL) criterion.
#' However, the MDL complexity criterion tends to underestimate the
#' the Bayesian Information Criterion (BIC).
#' However, the BIC complexity term tends to underestimate the
#' relevance of edges connecting variables with many different categories,
#' leading to the removal of false negative edges. To avoid such biases
#' with finite datasets, the (universal) Normalized Maximum Likelihood (NML)
Expand Down
2 changes: 1 addition & 1 deletion R/miic.utils.R
Original file line number Diff line number Diff line change
Expand Up @@ -1047,7 +1047,7 @@ check_parameters <- function (input_data, n_threads, cplx,
mode, negative_info, verbose) {
list_ret = list ("mode" = mode)
list_ret$n_threads = check_param_int (n_threads, "n_threads", 1, min=1, max=NA)
list_ret$cplx = check_param_string (cplx, "complexity", c("nml", "mdl"))
list_ret$cplx = check_param_string (cplx, "complexity", c("nml", "bic"))
list_ret$orientation = check_param_logical (orientation, "orientation", TRUE)

if ( test_param_wrong_float (ort_proba_ratio, min=0, max=1) )
Expand Down
6 changes: 3 additions & 3 deletions man/computeMutualInfo.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

6 changes: 3 additions & 3 deletions man/computeThreePointInfo.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

4 changes: 2 additions & 2 deletions man/discretizeMutual.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

6 changes: 3 additions & 3 deletions man/miic.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

14 changes: 7 additions & 7 deletions src/computation_continuous.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -267,7 +267,7 @@ void reconstructCutCoarse (const TempVector<int>& cuts_idx,
// This speeds up significantly the dynamic programming at the cost of finding
// approximated solutions.
// - cache: shared pointer, cache area storing previous calculations
// - cplx: int, is the choice of complexity : 0 for simple MDL (product of
// - cplx: int, is the choice of complexity : 0 for simple BIC (product of
// all observed values) and 1 for NML with stochastic complexity and a
// combinatorial term with the previous number of levels.
//
Expand Down Expand Up @@ -374,12 +374,12 @@ void optimizeCutPoints (const TempGrid2d<int>::ConstRow& data_ranked_target,
TempVector<int> cum_counts_target (nb_factors_target);
TempVector<int> cum_counts_joint (nb_factors_joint);
//
// Stochastic complexity used for the simple MDL cost.
// Stochastic complexity used for the simple BIC cost.
// Its value is 0.5 * number_of_other_levels where number of other levels is
// the product of number of unique observed levels of all variables except
// the variable being optimized.
//
double k_mdl = 0.5 * (nb_factors_joint_for_cplx - 1) * nb_factors_target;
double k_bic = 0.5 * (nb_factors_joint_for_cplx - 1) * nb_factors_target;

for (int cum_bin_idx = 0; cum_bin_idx < n_cuts_max; ++cum_bin_idx) {
// Compute info from the first bin to the "cum_bin_idx" bin
Expand Down Expand Up @@ -421,7 +421,7 @@ void optimizeCutPoints (const TempGrid2d<int>::ConstRow& data_ranked_target,
H_cum_bin_target -= cache->getH(weighted_count);

if (cplx == 0 && cum_counts_target[level] > 0)
Hk_cum_bin_target -= k_mdl * cache->getLog(n_samples);
Hk_cum_bin_target -= k_bic * cache->getLog(n_samples);
else if (cplx == 1)
Hk_cum_bin_target -= cache->getLogC(weighted_count, nb_factors_joint_for_cplx);
}
Expand Down Expand Up @@ -498,7 +498,7 @@ void optimizeCutPoints (const TempGrid2d<int>::ConstRow& data_ranked_target,
H_cut_to_cum_target -= cache->getH(weighted_count);

if (cplx == 0 && cut_counts_target[level] > 0)
Hk_cut_to_cum_target -= k_mdl * cache->getLog(n_samples);
Hk_cut_to_cum_target -= k_bic * cache->getLog(n_samples);
else if (cplx == 1)
Hk_cut_to_cum_target -= cache->getLogC(weighted_count, nb_factors_joint_for_cplx);
}
Expand Down Expand Up @@ -545,8 +545,8 @@ void optimizeCutPoints (const TempGrid2d<int>::ConstRow& data_ranked_target,
//------------------------------------------------------------------------------
// Initialize Ik(x,y) with equal bin discretization
// Repeat
// optimize on x Ik(x,y): Hx - Hxy - kmdl
// optimize on y Ik(x,y): Hy - Hxy - kmdl
// optimize on x Ik(x,y): Hx - Hxy - k_bic
// optimize on y Ik(x,y): Hy - Hxy - k_bic
// Until convergence
//------------------------------------------------------------------------------
// Inputs:
Expand Down
24 changes: 12 additions & 12 deletions src/computation_discrete.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
#include "mutual_information.h"
#include "structure.h"

constexpr int MDL = 0;
constexpr int BIC = 0;
namespace miic {
namespace computation {

Expand Down Expand Up @@ -85,7 +85,7 @@ InfoBlock computeCondMutualInfoDiscrete(const TempGrid2d<int>& data,
// Conclude on current count
if (Nuy > 0) {
Huy -= Nuy * log(Nuy);
if (cplx != MDL) {
if (cplx != BIC) {
logC_uy_x += cache->getLogC(lround(Nuy), rx);
}
Nuy = 0;
Expand All @@ -99,15 +99,15 @@ InfoBlock computeCondMutualInfoDiscrete(const TempGrid2d<int>& data,
for (auto& Nxu : Nux_list) {
if (Nxu > 0) {
Hux -= Nxu * log(Nxu);
if (cplx != MDL) {
if (cplx != BIC) {
logC_ux_y += cache->getLogC(lround(Nxu), ry);
}
Nxu = 0; // reset counter
}
}
if (Nu > 0) {
Hu -= Nu * log(Nu);
if (cplx != MDL) {
if (cplx != BIC) {
auto Nu_long = lround(Nu);
logC_u_x += cache->getLogC(Nu_long, rx);
logC_u_y += cache->getLogC(Nu_long, ry);
Expand All @@ -116,7 +116,7 @@ InfoBlock computeCondMutualInfoDiscrete(const TempGrid2d<int>& data,
}
}

if (cplx == MDL) {
if (cplx == BIC) {
double logN = log(N_total);
logC_ux_y = 0.5 * (ry - 1) * (rx * ru - 1) * logN;
logC_uy_x = 0.5 * (rx - 1) * (ry * ru - 1) * logN;
Expand Down Expand Up @@ -214,15 +214,15 @@ Info3PointBlock computeInfo3PointAndScoreDiscrete(const TempGrid2d<int>& data,
// Conclude on current count
if (Nuy > 0) {
Huy -= Nuy * log(Nuy);
if (cplx != MDL) {
if (cplx != BIC) {
auto Nuy_long = lround(Nuy);
logC_uy_x += cache->getLogC(Nuy_long, rx);
logC_uy_z += cache->getLogC(Nuy_long, rz);
}
for (auto& Nzuy : Nzuy_list) {
if (Nzuy > 0) {
Hzuy -= Nzuy * log(Nzuy);
if (cplx != MDL) {
if (cplx != BIC) {
logC_zuy_x += cache->getLogC(lround(Nzuy), rx);
}
Nzuy = 0;
Expand All @@ -238,7 +238,7 @@ Info3PointBlock computeInfo3PointAndScoreDiscrete(const TempGrid2d<int>& data,
if (Nu == 0) continue;

Hu -= Nu * log(Nu);
if (cplx != MDL) {
if (cplx != BIC) {
auto Nu_long = lround(Nu);
logC_u_x += cache->getLogC(Nu_long, rx);
logC_u_y += cache->getLogC(Nu_long, ry);
Expand All @@ -249,7 +249,7 @@ Info3PointBlock computeInfo3PointAndScoreDiscrete(const TempGrid2d<int>& data,
for (auto& Nzu : Nzu_list) {
if (Nzu > 0) {
Hzu -= Nzu * log(Nzu);
if (cplx != MDL) {
if (cplx != BIC) {
auto Nzu_long = lround(Nzu);
logC_zu_x += cache->getLogC(Nzu_long, rx);
logC_zu_y += cache->getLogC(Nzu_long, ry);
Expand All @@ -263,7 +263,7 @@ Info3PointBlock computeInfo3PointAndScoreDiscrete(const TempGrid2d<int>& data,
if (Nux == 0) continue;

Hux -= Nux * log(Nux);
if (cplx != MDL) {
if (cplx != BIC) {
auto Nux_long = lround(Nux);
logC_ux_y += cache->getLogC(Nux_long, ry);
logC_ux_z += cache->getLogC(Nux_long, rz);
Expand All @@ -275,7 +275,7 @@ Info3PointBlock computeInfo3PointAndScoreDiscrete(const TempGrid2d<int>& data,
if (Nzux == 0) continue;

Hzux -= Nzux * log(Nzux);
if (cplx != MDL) {
if (cplx != BIC) {
logC_zux_y += cache->getLogC(lround(Nzux), ry);
}
Nzux_list(j, l) = 0;
Expand All @@ -284,7 +284,7 @@ Info3PointBlock computeInfo3PointAndScoreDiscrete(const TempGrid2d<int>& data,
}

// check maximum mutual infos - cplx terms
if (cplx == MDL) {
if (cplx == BIC) {
double logN = log(N_total);
logC_ux_y = 0.5 * (ry - 1) * (rx * ru - 1) * logN;
logC_uy_x = 0.5 * (rx - 1) * (ry * ru - 1) * logN;
Expand Down
2 changes: 1 addition & 1 deletion src/environment.h
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ struct Environment {
bool latent_orientation = false;
// Whether or not do MAR (Missing at random) test using KL-divergence
bool test_mar = false;
// Complexity mode. 0: mdl 1: nml
// Complexity mode. 0: bic (formerly mdl) 1: nml
int cplx = 1;
// List of ids of edge whose status is not yet determined
vector<EdgeID> unsettled_list;
Expand Down
4 changes: 2 additions & 2 deletions src/get_information.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -82,9 +82,9 @@ Info3PointBlock getInfo3Point(
environment.data_double, X, Y, ui_list, environment.levels,
environment.is_continuous, n_samples_non_na_z, levels_red,
sample_is_not_NA, environment.noise_vec);
double cplxMdl = environment.cache.cterm->getLog(n_samples_non_na_z);
double cplxBic = environment.cache.cterm->getLog(n_samples_non_na_z);

if ((kldiv - cplxMdl) > 0) {
if ((kldiv - cplxBic) > 0) {
// The sample is not representative of the population, hence for 3-point
// information, we cannot draw conclusion the unshielded triple (X, Z,
// Y), return 0; For contributing score, Z is not a good candidate.
Expand Down
2 changes: 1 addition & 1 deletion src/mutual_information.h
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ using std::lround;
constexpr double kPrecision = 1.e-10;

// rux: number of levels of each (joint) variable [x, u, ux]
// cplx 0: MDL, 1: NML
// cplx 0: BIC, 1: NML
// flag (for cplx == 1 only) 0: mutual info, 1: conditional mutual info
// When flag == 1 && cplx == 1, x and u are not symmetrical, x represents single
// variable, whereas u represents joint variable (see def of cond mutual info)
Expand Down
2 changes: 1 addition & 1 deletion src/r_cpp_interface.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -79,7 +79,7 @@ void setEnvironmentFromR(const Rcpp::List& input_data,
environment.test_mar = as<bool>(arg_list["test_mar"]);

if (arg_list.containsElementNamed("cplx")) {
if (as<std::string>(arg_list["cplx"]).compare("mdl") == 0)
if (as<std::string>(arg_list["cplx"]).compare("bic") == 0)
environment.cplx = 0;
}

Expand Down

0 comments on commit 1b54798

Please sign in to comment.