diff --git a/.Rbuildignore b/.Rbuildignore index c63a3c5..2534116 100644 --- a/.Rbuildignore +++ b/.Rbuildignore @@ -1,7 +1,11 @@ ^.*\.Rproj$ ^\.Rproj\.user$ -resemble_logo.png code_ex.R .travis.yml ^packrat/ ^\.Rprofile$ +^cran-comments\.md$ +^my-comments\.md$ +logo.R +^Rscript* +logo_large.png \ No newline at end of file diff --git a/.travis.yml b/.travis.yml index feb47ca..474258b 100644 --- a/.travis.yml +++ b/.travis.yml @@ -22,7 +22,6 @@ r_packages: - covr - data.table - magrittr - - dplyr - testthat after_success: diff --git a/DESCRIPTION b/DESCRIPTION index 5055167..a7dfba0 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -1,46 +1,23 @@ Package: resemble Type: Package -Title: - Regression and Similarity Evaluation for Memory-Based Learning in - Spectral Chemometrics +Title: Memory-Based Learning in Spectral Chemometrics Version: 2.0.0 -Date: 2020-10-12 -Authors@R: - c(person(given = "Leonardo", - family = "Ramirez-Lopez", - role = c("aut", "cre"), - email = "ramirez.lopez.leo@gmail.com", - comment = c(ORCID = "0000-0002-5369-5120")), - person(given = "Antoine", - family = "Stevens", - role = "ctb", - comment = c(ORCID = "0000-0002-1588-7519")), - person(given = "Raphael", - family = "Viscarra Rossel", - role = "ctb", - comment = c(ORCID = "0000-0003-1540-4748")), - person(given = "Craig", - family = "Lobsey", - role = "ctb", - comment = c(ORCID = "0000-0001-5416-8640")), - person(given = "Alex", - family = "Wadoux", - role = "ctb", - comment = c(ORCID = "0000-0001-7325-9716")), - person(given = "Timo", - family = "Breure", - role = "ctb", - comment = c(ORCID = "0000-0001-5695-8064")) - ) +Date: 2020-10-29 +Author: Leonardo Ramirez-Lopez [aut, cre], + Antoine Stevens [aut, ctb], + Raphael Viscarra Rossel [ctb], + Craig Lobsey [ctb], + Alex Wadoux [ctb], + Timo Breure [ctb] Maintainer: Leonardo Ramirez-Lopez BugReports: https://github.com/l-ramirez-lopez/resemble/issues Description: - Functions for spectral similarity/dissimilarity - analysis and memory-based learning (MBL) for non-linear modeling - in complex spectral data sets. In chemometrics MBL is also known - as local modeling. + Functions for dissimilarity analysis and memory-based learning + (MBL, a.k.a local modeling) in complex spectral data sets. + Most of these functions are based the methods presented in + Ramirez-Lopez et al. (2013) . License: MIT + file LICENSE URL: http://l-ramirez-lopez.github.io/resemble/ Depends: @@ -49,26 +26,23 @@ Imports: foreach, iterators, Rcpp (>= 1.0.3), - dplyr (>= 0.7.0), mathjaxr (>= 1.0), magrittr (>= 1.5.0), - lifecycle (>= 0.2.0), + lifecycle (>= 0.2.0), data.table (>= 1.9.8) Suggests: prospectr, parallel, - doParallel, - testthat, + doParallel, + testthat, formatR, - bookdown, rmarkdown, + bookdown, knitr LinkingTo: - Rcpp, + Rcpp, RcppArmadillo -RdMacros: - mathjaxr, - lifecycle +RdMacros: mathjaxr VignetteBuilder: knitr NeedsCompilation: yes LazyData: true diff --git a/NAMESPACE b/NAMESPACE index d1550e3..8660bc7 100644 --- a/NAMESPACE +++ b/NAMESPACE @@ -31,8 +31,7 @@ import(grDevices) import(graphics) import(iterators) import(lifecycle) -importFrom(dplyr,if_else) -importFrom(dplyr,select) +import(mathjaxr) importFrom(graphics,barplot) importFrom(lifecycle,deprecate_soft) importFrom(magrittr,"%>%") diff --git a/NEWS b/NEWS index 7e57755..a00c032 100644 --- a/NEWS +++ b/NEWS @@ -1,20 +1,130 @@ # News for the resemble package -## Version 2.0 (gordillo) - -* 11.10.2020 -New vignette! +## resemble 2.0 (gordillo) * 02.07.2020 During the recent lockdown we had the chance to inevest a enough time on the development of a new version of the package resemble. This new version comes with significant improvements as well as new functionality. For example, it now matches the tidyverse R style guide, it includes unit tests, includes new -functionality for modeling, mbl is bit faster and a bit less memory-hungry, etc. -See changes.md file for further details. - - -## Version 1.3 (never released) +functionality for modeling, mbl is faster and a less memory-hungry. + +### New features +- search_neighbors() function. +- dissimilarity() function. + +## Improvements and fixes +- mbl is faster and a less memory-hungry. +- New vignette. +- unit tests have been introduced using the testthat package. + +## Breaking changes + +### orthoProjection, pcProjection, plsProjection (renamed to ortho_projection, +pc_projection, pls_projection respectively): +- X2 argument renamed to Xu (for consistency throughout all the fucntions) +- Argument scaled renamed to .scale +- Argument max.iter renamed to max_iter +- Bug fix: when the "pca.nipals method was used and the method to select the pcs wa "opc", + the function was returning 1 component less than the maximum requested. +- "pca.nipals" is now implemented in C++ via Rcpp +- Bug fix in plsProjection: when "cumvar" was used as the pcSelection method, an + error about data allocation in a matrix was retrieved +- Argument pcSelection renamed to pc_selection +- ... is deprecated in both pcProjection and plsProjection +- Argument cores is deprecated, it was used to set the number of cores in some +c++ internal functions via OpenMP in Rcpp. +- Names the following outputs have been changed: X.loadings, Y.loadings, sc.sdv +and n.components. Their new names are: X_loadings, Y_loadings, sc_sdv and +n_components. + +### corDiss (renamed to cor_diss): +- X2 argument renamed to Xu (for consistency throughout all the fucntions) +- Argument scaled renamed to .scale +- default for .scale has changed from TRUE to FALSE +- the dimnames of the resulting matrix are now Xr_1... Xr_n (previusly Xr.1... Xr.n) + +### fDiss (renamed to f_diss): +- X2 argument renamed to Xu (for consistency throughout all the fucntions) +- Argument scaled renamed to .scale +- default for .scale has changed from TRUE to FALSE +- the dimnames of the resulting matrix are now Xr_1... Xr_n (previusly Xr.1... Xr.n) +- argument method changed to diss_method + +### sid: +- X2 argument renamed to Xu (for consistency throughout all the fucntions) +- Argument scaled renamed to .scale +- default for .scale has changed from TRUE to FALSE +- the dimnames of the resulting matrix are now Xr_1... Xr_n (previusly Xr.1... Xr.n) + + +### orthoDiss (renamed to ortho_diss): +- X2 argument renamed to Xu (for consistency throughout all the fucntions) +- Argument scaled renamed to .scale +- Argument local renamed to .local +- Argument pcSelection renamed to pc_selection +- Argument return.all renamed to compute_all +- Argument cores is deprecated, it wwas used to set the number of cores in some +c++ internal functions via OpenMP in Rcpp. +- When \code{.local = TRUE} a new output is produced: 'neighborhood_info' which +is a data.frame containing the relevant information about the neighborhood of +each sample (e.g. neighborhood indices, number of components used at each +neighborhood, etc) +- Output global.variance.info has been renamed to global_variance_info + + +### simEval (renamed to sim_eval): +- argument sideInf renamed to side_info +- argument lower.tri renamed to lower_triangle +- argument cores renamed to omp_threads +- lower_triangle is deprecated. Now if a vector is passed to d, the function assumes + that it is a lower triangle of a distance matrix +- Argument cores is deprecated, it wwas used to set the number of cores in some +c++ internal functions via OpenMP in Rcpp. + + +### mbl +- pls.max.iter, pls.tol and noise.v were moved to mbl from mbl_control() +- Argument scaled (from mbl_control()) renamed to .scale and moved to mbl +- new arguments: gh and spike +- Argument pcSelection renamed to pc_selection +- Argument mblCtrl renamed to control +- Argument dissUsage renamed to diss_usage +- order of the Yr, Xr, Yu and Xu arguments has changed to Xr, Yr, Xu and Yu +- input type for the argument method has changed. Previously it received a +character string indicating the type of local regresion (i.e. "pls", +"wapls1" or "gpr"). Now it receives an object of class local_fit which is output +by the new local_fit fucntions. +- dissimilarityM has been deprecated. It was used to pass a dissimilarity matrix +computed outiside the mbl fucntion. This can be done now with the new argument +diss_method of mbl which was previosly named "sm" and it was in mblControl() + + +### neigCleaning (now search_neighbors) +- Function renamed to search_neighbors +- Argument cores is deprecated, it was used to set the number of cores in some +c++ internal functions via OpenMP in Rcpp. + + +### mblControl (renamed to mbl_control): +- sm argument is deprecated. Now the dissmilarity metric is an argument of the +mbl fucntion +- scale and center arguments have been moved to the mbl fucntion +- Argument range.pred.lim renamed to range_prediction_limits +- Argument cores is deprecated, it was used to set the number of cores in some +c++ internal functions via OpenMP in Rcpp. +- k0, pcMethod, ghMethod are deprecated +- localOptimization has been renamed to tune_locally +- valMethod has been renamed to validation_type +- Option "loc_crossval" in validation_type has been renamed to "local_cv" + +### plot.mbl +- option "pca" was replaced by option "gh" which plots the pls projection used +for computing the gh distance in mbl + + + +## resemble 1.3 (never released) * 11.04.2020 The option 'movcor' for the argument sm of mblControl() is deprecated. The @@ -37,7 +147,7 @@ differences, dividing it by the number of variables and then compute the squared root i.e. sqrt(mean((Xi-Xj)^2)/ncol(Xi)). This bug had no effect on the computations of the nearest neighbors. -## Version 1.2.0001 (alma de coco) +## resemble 1.2.0001 (alma de coco) * 13.09.2016 A bug in the computation of the Mahalanobis distance in the PLS space was fixed. @@ -60,7 +170,7 @@ corrected in the documentation. the projection Matrix (projectionM) returned by plsProjection now only contains the columns corresponding only to the number components retrieved -## Version 1.2 (alma de coco) +## resemble 1.2 (alma de coco) * 04.03.2016 A patch was released for and extrange bug that prevented to run mbl in parallel when the gpr method was used. @@ -96,5 +206,5 @@ The function movcorDist was removed since. it was included by mistake in the previous version of the package. The corDiss function can be used in raplacement of movcorDist. -## Versions 1.0 and 1.1.1 +## resemble 1.1.1 * 19.03.2013 Hello world! Initial release of the package diff --git a/R/AAA.R b/R/AAA.R index 616e439..d2f226e 100644 --- a/R/AAA.R +++ b/R/AAA.R @@ -2,9 +2,22 @@ .onAttach <- function(lib, pkg) { # assign("gpclib", FALSE, envir=.RESEMBLE_CACHE) - resemble_v <- read.dcf(file = system.file("DESCRIPTION", package = pkg), fields = "Version") - packageStartupMessage(paste0("\033[34m", pkg, " version ", resemble_v, " -- 'gordillo'\033[39m")) - packageStartupMessage("\033[34mcheck the package repository at https://github.com/l-ramirez-lopez/resemble/\033[39m") + resemble_v <- read.dcf( + file = system.file("DESCRIPTION", package = pkg), + fields = "Version" + ) + mss <- paste0( + "\033[34m", + pkg, " version ", + resemble_v, + " -- 'gordillo'\033[39m" + ) + mss2 <- paste0( + "\033[34mcheck the package repository at: ", + "https://github.com/l-ramirez-lopez/resemble/\033[39m" + ) + packageStartupMessage(mss) + packageStartupMessage(mss2) } # .onUnload <- function(libpath) { diff --git a/R/RcppExports.R b/R/RcppExports.R index fb3ff76..d28e22f 100644 --- a/R/RcppExports.R +++ b/R/RcppExports.R @@ -17,26 +17,18 @@ fast_diss <- function(X, Y, method) { .Call('_resemble_fast_diss', PACKAGE = 'resemble', X, Y, method) } -fast_diss_vector <- function(X) { - .Call('_resemble_fast_diss_vector', PACKAGE = 'resemble', X) -} - -#' @title A fast (serial) algorithm of Euclidean (non-squared) cross-distance for vectors written in C++ -#' @description A fast algorithm of (squared) Euclidean cross-distance for vectors written in C++ +#' @title A fast algorithm of (squared) Euclidean cross-distance for vectors written in C++ +#' @description A fast (parallel for linux) algorithm of (squared) Euclidean cross-distance for vectors written in C++ #' @usage -#' fastDistVVL(X) -#' @param X a vector +#' fast_diss_vector(X) +#' @param X a vector. #' @return a vector of distance (lower triangle of the distance matrix, stored by column) #' @details used internally in ortho_projection -#' @author Leo Ramirez-Lopez +#' @author Antoine Stevens #' @keywords internal #' @useDynLib resemble -fastDistVVL <- function(X) { - .Call('_resemble_fastDistVVL', PACKAGE = 'resemble', X) -} - -minDissV <- function(X) { - .Call('_resemble_minDissV', PACKAGE = 'resemble', X) +fast_diss_vector <- function(X) { + .Call('_resemble_fast_diss_vector', PACKAGE = 'resemble', X) } #' @title Moving/rolling correlation distance of two matrices @@ -54,6 +46,38 @@ moving_cor_diss <- function(X, Y, w) { .Call('_resemble_moving_cor_diss', PACKAGE = 'resemble', X, Y, w) } +#' @title A function to compute row-wise index of minimum values of a square distance matrix +#' @description For internal use only +#' @usage +#' which_min(X) +#' @param X a square matrix of distances +#' @return a vector of the indices of the minimum value in each row of the input matrix +#' @details Used internally to find the nearest neighbors +#' @keywords internal +#' @useDynLib resemble +#' @author Antoine Stevens +which_min <- function(X) { + .Call('_resemble_which_min', PACKAGE = 'resemble', X) +} + +#' @title A function to compute indices of minimum values of a distance vector +#' @description For internal use only +#' @usage +#' which_min_vector(X) +#' @param X a vector of distances +#' @return a vector of the indices of the nearest neighbors +#' @details +#' Used internally to find the nearest neighbors. +#' It searches in lower (or upper) triangular matrix. Therefore this must be the format of the +#' input data. The piece of code int \code{len = (sqrt(X.size()*8+1)+1)/2} generated an error in CRAN +#' since \code{sqrt} cannot be applied to integers. +#' @keywords internal +#' @useDynLib resemble +#' @author Antoine Stevens +which_min_vector <- function(X) { + .Call('_resemble_which_min_vector', PACKAGE = 'resemble', X) +} + #' @title Function for identifiying the column in a matrix with the largest standard deviation #' @description Identifies the column with the largest standard deviation. For internal use only! #' @usage get_col_largest_sd(X) @@ -311,8 +335,8 @@ project_opls <- function(projection_mat, ncomp, newdata, scale, Xcenter, Xscale) #' @author Leonardo Ramirez-Lopez #' @keywords internal #' @useDynLib resemble -reconstruction_error <- function(x, projection_mat, xloadings) { - .Call('_resemble_reconstruction_error', PACKAGE = 'resemble', x, projection_mat, xloadings) +reconstruction_error <- function(x, projection_mat, xloadings, scale, Xcenter, Xscale) { + .Call('_resemble_reconstruction_error', PACKAGE = 'resemble', x, projection_mat, xloadings, scale, Xcenter, Xscale) } #' @title Internal Cpp function for computing the weights of the PLS components @@ -493,35 +517,3 @@ pca_nipals <- function(X, ncomp, center, scale, maxiter, tol, pcSelmethod = "var .Call('_resemble_pca_nipals', PACKAGE = 'resemble', X, ncomp, center, scale, maxiter, tol, pcSelmethod, pcSelvalue) } -#' @title A function to compute row-wise index of minimum values of a square distance matrix -#' @description For internal use only -#' @usage -#' which_min(X) -#' @param X a square matrix of distances -#' @return a vector of the indices of the minimum value in each row of the input matrix -#' @details Used internally to find the nearest neighbors -#' @keywords internal -#' @useDynLib resemble -#' @author Antoine Stevens -which_min <- function(X) { - .Call('_resemble_which_min', PACKAGE = 'resemble', X) -} - -#' @title A function to compute indices of minimum values of a distance vector -#' @description For internal use only -#' @usage -#' which_min_vector(X) -#' @param X a vector of distances -#' @return a vector of the indices of the nearest neighbors -#' @details -#' Used internally to find the nearest neighbors. -#' It searches in lower (or upper) triangular matrix. Therefore this must be the format of the -#' input data. The piece of code int \code{len = (sqrt(X.size()*8+1)+1)/2} generated an error in CRAN -#' since \code{sqrt} cannot be applied to integers. -#' @keywords internal -#' @useDynLib resemble -#' @author Antoine Stevens -which_min_vector <- function(X) { - .Call('_resemble_which_min_vector', PACKAGE = 'resemble', X) -} - diff --git a/R/cor_diss.R b/R/cor_diss.R index 214ecf8..2a19359 100644 --- a/R/cor_diss.R +++ b/R/cor_diss.R @@ -1,8 +1,7 @@ #' @title Correlation and moving correlation dissimilarity measurements (cor_diss) #' @description #' \loadmathjax -#' -#' \lifecycle{stable} +#' \ifelse{html}{\out{Stable lifecycle}}{\strong{Stable}} #' #' Computes correlation and moving correlation dissimilarity matrices. #' @usage @@ -22,11 +21,11 @@ #' of \mjeqn{Xr \cup Xu}{Xr U Xu}. #' @details #' The correlation dissimilarity \mjeqn{d}{d} between two observations -#' \mjeqn{x_i}{x_i} and \mjeqn{x_j}{x_j} is based on the Perason's -#' correlation coefficient (\mjeqn{\rho}{\rho}) and it can be computed as +#' \mjeqn{x_i}{x_i} and \mjeqn{x_j}{x_j} is based on the Perason's +#' correlation coefficient (\mjeqn{\rho}{\rho}) and it can be computed as #' follows: #' -#' \mjdeqn{d(x_i, x_j) = \frac{1}{2}(1 - \rho(x_i, x_j))}{d(x_i, x_j) = 1/2 (1 - \rho(x_i, x_j))} +#' \mjdeqn{d(x_i, x_j) = \frac{1}{2}((1 - \rho(x_i, x_j)))}{d(x_i, x_j) = 1/2 (1 - \rho(x_i, x_j))} #' #' The above formula is used when \code{ws = NULL}. #' On the other hand (when \code{ws != NULL}) the moving correlation @@ -44,7 +43,7 @@ #' a matrix of the computed dissimilarities. #' @author Antoine Stevens and \href{https://orcid.org/0000-0002-5369-5120}{Leonardo Ramirez-Lopez} #' @examples -#' \dontrun{ +#' \donttest{ #' library(prospectr) #' data(NIRsoil) #' diff --git a/R/dissimilarity.R b/R/dissimilarity.R index 668a84d..0d576b9 100644 --- a/R/dissimilarity.R +++ b/R/dissimilarity.R @@ -2,8 +2,6 @@ #' @description #' \loadmathjax #' -#' \lifecycle{maturing} -#' #' This is a wrapper to integrate the different dissimilarity functions of the #' offered by package.It computes the dissimilarities between observations in #' numerical matrices by using an specifed dissmilarity measure. @@ -78,30 +76,34 @@ #' (of set of observations) is the one for which its distance matrix #' minimizes the differences between the \code{Yr} value of each #' observation and the \code{Yr} value of its closest observation. In this -#' case \code{value} must be a value (larger than 0 and -#' below \code{min(nrow(Xr), nrow(Xu), ncol(Xr))}) indicating the maximum +#' case \code{value} must be a value ((larger than 0 and +#' below the minimum dimension of \code{Xr} or \code{Xr} and \code{Xu} +#' combined) indicating the maximum #' number of principal components to be tested. See the #' \code{\link{ortho_projection}} function for more details.} #' #' \item{\code{"cumvar"}:}{ selection of the principal components based #' on a given cumulative amount of explained variance. In this case, #' \code{value} must be a value (larger than 0 and below or equal to 1) -#' indicating the minimum amount of cumulative variance that the +#' indicating the minimum amount of cumulative variance that the #' combination of retained components should explain.} #' #' \item{\code{"var"}:}{ selection of the principal components based #' on a given amount of explained variance. In this case, #' \code{value} must be a value (larger than 0 and below or equal to 1) -#' indicating the minimum amount of variance that a single component +#' indicating the minimum amount of variance that a single component #' should explain in order to be retained.} #' #' \item{\code{"manual"}:}{ for manually specifying a fix number of #' principal components. In this case, \code{value} must be a value -#' (larger than 0 and \code{min(nrow(Xr), nrow(Xu), ncol(Xr))}). +#' (larger than 0 and +#' below the minimum dimension of \code{Xr} or \code{Xr} and \code{Xu} +#' combined). #' indicating the minimum amount of variance that a component should #' explain in order to be retained.} #' } -#' The default list passed is \code{list(method = "var", value = 0.01)}. +#' The default is \code{list(method = "var", value = 0.01)}. +#' #' Optionally, the \code{pc_selection} argument admits \code{"opc"} or #' \code{"cumvar"} or \code{"var"} or \code{"manual"} as a single character #' string. In such a case the default \code{"value"} when either \code{"opc"} or @@ -156,6 +158,7 @@ #' \item{\code{projection}:}{ an \code{ortho_projection} object. Only output #' if \code{return_projection = TRUE} and if \code{diss_method = "pca"}, #' \code{diss_method = "pca.nipals"} or \code{diss_method = "pls"}. +#' #' This object contains the projection used to compute #' the dissimilarity matrix. In case of local dissimilarity matrices, #' the projection corresponds to the global projection used to select the @@ -163,16 +166,16 @@ #' details).} #' #' \item{\code{gh}:}{ a list containing the GH distances as well as the -#' pls projection used (\code{ortho_projection} object) to compute the GH.} +#' pls projection used to compute the GH.} #' } #' @author \href{https://orcid.org/0000-0002-5369-5120}{Leonardo Ramirez-Lopez} #' @examples -#' \dontrun{ #' library(prospectr) #' data(NIRsoil) #' -#' # Filter the data using the first derivative with Savitzky and Golay smoothing -#' filter and a window size of 11 spectral variables and a polynomial order of 4 +#' # Filter the data using the first derivative with Savitzky and Golay +#' # smoothing filter and a window size of 11 spectral variables and a +#' # polynomial order of 4 #' sg <- savitzkyGolay(NIRsoil$spc, m = 1, p = 4, w = 15) #' #' # Replace the original spectra with the filtered ones @@ -197,7 +200,6 @@ #' pc_selection = list("opc", 30), #' return_projection = TRUE #' ) -#' } #' @export #' ## History: diff --git a/R/f_diss.R b/R/f_diss.R index 7ff80dd..acf2db2 100644 --- a/R/f_diss.R +++ b/R/f_diss.R @@ -1,8 +1,7 @@ #' @title Euclidean, Mahalanobis and cosine dissimilarity measurements #' @description #' \loadmathjax -#' -#' \lifecycle{stable} +#' \ifelse{html}{\out{Stable lifecycle}}{\strong{Stable}} #' #' This function is used to compute the dissimilarity between observations #' based on Euclidean or Mahalanobis distance measures or on cosine @@ -13,7 +12,8 @@ #' @param Xr a matrix containing the (reference) data. #' @param Xu an optional matrix containing data of a second set of observations #' (samples). -#' @param diss_method the method for computing the dissimilarity matrix. +#' @param diss_method the method for computing the dissimilarity between +#' observations. #' Options are \code{"euclid"} (Euclidean distance), \code{"mahalanobis"} #' (Mahalanobis distance) and \code{"cosine"} (cosine distance, a.k.a spectral #' angle mapper). See details. @@ -24,7 +24,6 @@ #' must be scaled. If \code{Xu} is provided the data is scaled on the basis #' of \mjeqn{Xr \cup Xu}{Xr U Xu}. #' @details -#' #' The results obtained for Euclidean dissimilarity are equivalent to those #' returned by the [stats::dist()] function, but are scaled #' differently. However, \code{f_diss} is considerably faster (which can be @@ -74,7 +73,7 @@ #' a matrix of the computed dissimilarities. #' @author \href{https://orcid.org/0000-0002-5369-5120}{Leonardo Ramirez-Lopez} and Antoine Stevens #' @examples -#' \dontrun{ +#' \donttest{ #' library(prospectr) #' data(NIRsoil) #' @@ -161,37 +160,37 @@ f_diss <- function(Xr, Xu = NULL, diss_method = "euclid", if (sum(is.na(Xr)) > 0) { stop("Matrices with missing values are not accepted") } - + n_method <- diss_method if (!n_method %in% c("euclid", "mahalanobis", "cosine")) { stop("'diss_method' must be one of: 'euclid', 'mahalanobis' or'cosine'") - + if (length(n_method) > 1) { } else { n_method <- diss_method[[1]] message(paste("More than one diss_method was specified, only", n_method, "was used.")) } } - + if (!is.logical(center)) { stop("'center' must be logical") } - + if (!is.logical(scale)) { stop("'scale' must be logical") } - + if (center | scale | n_method %in% c("mahalanobis", "euclid")) { X <- rbind(Xr, Xu) - + if (center) { X <- sweep(x = X, MARGIN = 2, FUN = "-", STATS = colMeans(X)) } - + if (scale) { X <- sweep(x = X, MARGIN = 2, FUN = "/", STATS = get_col_sds(X)) } - + if (n_method == "mahalanobis") { if (nrow(X) < ncol(X)) { stop("For computing the Mahalanobis distance, the total number of observations (rows) \n must be larger than the number of variables (columns).") @@ -202,7 +201,7 @@ f_diss <- function(Xr, Xu = NULL, diss_method = "euclid", } n_method <- "euclid" } - + if (!is.null(Xu)) { Xu <- X[(nrow(X) - nrow(Xu) + 1):nrow(X), , drop = FALSE] Xr <- X[1:(nrow(X) - nrow(Xu)), , drop = FALSE] @@ -211,10 +210,10 @@ f_diss <- function(Xr, Xu = NULL, diss_method = "euclid", } rm(X) } - + if (!is.null(Xu)) { ## FIXME check numerical precision in Rcpp - ## in some cases it returns 0s as -1e-14 + ## in some cases it returns 0s as -1e-14 ## perhaps due to reuse memory? rslt <- abs(fast_diss(Xu, Xr, n_method)) if (n_method == "euclid") { @@ -224,7 +223,7 @@ f_diss <- function(Xr, Xu = NULL, diss_method = "euclid", rownames(rslt) <- paste("Xr", 1:nrow(Xr), sep = "_") } else { ## FIXME check numerical precision in Rcpp - ## in some cases it returns 0s as -1e-14 + ## in some cases it returns 0s as -1e-14 ## perhaps due to reuse memory? rslt <- abs(fast_diss(Xr, Xr, n_method)) if (n_method == "euclid") { @@ -236,7 +235,7 @@ f_diss <- function(Xr, Xu = NULL, diss_method = "euclid", if (diss_method == "cosine") { rslt[is.nan(rslt)] <- 0 } - + rslt } @@ -248,25 +247,25 @@ f_diss <- function(Xr, Xu = NULL, diss_method = "euclid", #' @importFrom stats cov euclid_to_mahal <- function(X, sm_method = c("svd", "eigen")) { nms <- dimnames(X) - + if (ncol(X) > nrow(X)) { stop("In order to project the matrix to a Mahalanobis space, the number of observations of the input matrix must larger than its number of variables") } - + if (length(sm_method) > 1) { sm_method <- sm_method[1] } if (!(sm_method %in% c("svd", "eigen"))) { stop("sm_method must be one of 'svd', 'eigen'") } - + X <- as.matrix(X) vcv <- cov(X) sq_vcv <- sqrt_sm(vcv, method = sm_method) sq_S <- solve(sq_vcv) ms_x <- X %*% sq_S dimnames(ms_x) <- nms - + ms_x } @@ -284,7 +283,7 @@ sqrt_sm <- function(X, method = c("svd", "eigen")) { if (!(method %in% c("svd", "eigen"))) { stop("method must be one of 'svd', 'eigen'") } - + if (method == "svd") { ## REPLACE BY arma::svd(U, S, V, X, "dc") out <- svd(X) @@ -292,7 +291,7 @@ sqrt_sm <- function(X, method = c("svd", "eigen")) { U <- out$v return(U %*% (D^0.5) %*% t(U)) } - + if (method == "eigen") { out <- eigen(X) D <- diag(out$values) diff --git a/R/get_predictions.R b/R/get_predictions.R index 489b2ff..89d48b4 100644 --- a/R/get_predictions.R +++ b/R/get_predictions.R @@ -1,7 +1,7 @@ #' @title Extract predictions from an object of class \code{mbl} #' @description #' -#' \lifecycle{stable} +#' \ifelse{html}{\out{Stable lifecycle}}{\strong{Stable}} #' #' Extract predictions from an object of class \code{mbl} #' @usage diff --git a/R/local_fit.R b/R/local_fit.R index a293523..9dd4aed 100644 --- a/R/local_fit.R +++ b/R/local_fit.R @@ -5,9 +5,6 @@ #' @aliases local_fit_wapls #' @aliases local_fit_gpr #' @description -#' -#' \lifecycle{maturing} -#' #' \loadmathjax #' These functions define the way in which each local fit/prediction is done #' within each iteration in the \code{\link{mbl}} function. @@ -92,15 +89,13 @@ #' #' @seealso \code{\link{mbl}} #' @examples -#' \dontrun{ #' local_fit_wapls(min_pls_c = 3, max_pls_c = 12) -#' } #' @export ## History: ## 28.05.2020 Leo Hello world! -## 19.07.2020 Leo arguments pls_max_iter and pls_tol were removed fas they -## are only required when modleing for more than one response +## 19.07.2020 Leo arguments pls_max_iter and pls_tol were removed fas they +## are only required when modleing for more than one response ## variable, i.e. pls2 (which is not implemented for mbl) @@ -154,7 +149,8 @@ local_fit_wapls <- function(min_pls_c, max_pls_c) { fit_type <- list( method = "wapls", - pls_c = c(min_pls_c = min_pls_c, max_pls_c = max_pls_c)) + pls_c = c(min_pls_c = min_pls_c, max_pls_c = max_pls_c) + ) class(fit_type) <- c("local_fit", "list") fit_type diff --git a/R/local_helpers.R b/R/local_helpers.R index ed2c6ee..fa18830 100644 --- a/R/local_helpers.R +++ b/R/local_helpers.R @@ -16,7 +16,6 @@ get_neighbor_info <- function(Xr, Xu, diss_method, Yr = NULL, diss_usage, allow_parallel = FALSE, ...) { ortho_diss_methods <- c("pca", "pca.nipals", "pls") - k_max <- NULL if (!is.null(k)) { k <- sort(k) @@ -156,7 +155,7 @@ pls_cv <- function(x, y, ncomp, y = y, p = p, number = number, - group = group, + group = group, replacement = FALSE ) @@ -193,7 +192,7 @@ pls_cv <- function(x, y, ncomp, val <- NULL val$resamples <- cv_samples$hold_out - + if (method == "pls") { val$cv_results <- data.table( npls = 1:ncomp, @@ -430,7 +429,7 @@ ith_mbl_neighbor <- function(Xr, Xu, Yr, Yu = NULL, ith_xr_neighbors <- cbind(ith_local_xr_xr_diss, ith_xr_neighbors) ith_xu <- cbind(ith_neigh_diss, ith_xu) } - + if (!is.null(group)) { ith_group <- factor(group[ith_neighborhood]) } else { diff --git a/R/mbl.R b/R/mbl.R index 2c4e5c4..973bf06 100644 --- a/R/mbl.R +++ b/R/mbl.R @@ -1,8 +1,5 @@ #' @title A function for memory-based learning (mbl) #' @description -#' -#' \lifecycle{maturing} -#' #' \loadmathjax #' This function is implemented for memory-based learning (a.k.a. #' instance-based learning or local regression) which is a non-linear lazy @@ -15,13 +12,12 @@ #' regression model. #' @usage #' mbl(Xr, Yr, Xu, Yu = NULL, k, k_diss, k_range, spike = NULL, -#' method = local_fit_wapls(min_pls_c = 3, -#' max_pls_c = min(dim(Xr), 15)), +#' method = local_fit_wapls(min_pls_c = 3, max_pls_c = min(dim(Xr), 15)), #' diss_method = "pca", diss_usage = "predictors", -#' gh = TRUE, pc_selection = list(method = "opc", -#' value = min(dim(Xr), 40)), +#' gh = TRUE, pc_selection = list(method = "opc", value = min(dim(Xr), 40)), #' control = mbl_control(), group = NULL, -#' center = TRUE, scale = FALSE, documentation = character(), ...) +#' center = TRUE, scale = FALSE, verbose = TRUE, +#' documentation = character(), ...) #' #' @param Xr a matrix of predictor variables of the reference data #' (observations in rows and variables in columns). @@ -47,8 +43,11 @@ #' retained when the \code{k_diss} is given. #' @param spike an integer vector indicating the indices of observations in #' \code{Xr} that must be forced into the neighborhoods of every \code{Xu} -#' observation. Default is \code{NULL} (i.e. no observations are forced). -#' See details. +#' observation. Default is \code{NULL} (i.e. no observations are forced). Note +#' that this argument is not intended for increasing the neighborhood size which +#' is only controlled by \code{k} or \code{k_diss} and \code{k_range}. By +#' forcing observations into the neighborhood, some observations will be forced +#' out of the neighborhood. See details. #' @param method an object of class \code{\link{local_fit}} which indicates the #' type of regression to conduct at each local segment as well as additional #' parameters affecting this regression. See \code{\link{local_fit}} function. @@ -70,8 +69,8 @@ #' #' \item{\code{"pls"}}{ Mahalanobis distance #' computed on the matrix of scores of a partial least squares projection -#' of \code{Xr} and \code{Xu}. In this case, \code{Yr} is always required. -#' See \code{\link{ortho_diss}} function.} +#' of \code{Xr} and \code{Xu}. In this case, \code{Yr} is always +#' required. See \code{\link{ortho_diss}} function.} #' #' \item{\code{"cor"}}{ correlation coefficient #' between observations. See \code{\link{cor_diss}} function.} @@ -123,35 +122,38 @@ #' components) and \code{value} (a numerical value that complements the selected #' method). The methods available are: #' \itemize{ -#' \item{\code{"opc"}:} { optimized principal component selection based on -#' Ramirez-Lopez et al. (2013a, 2013b). The optimal number of components -#' (of set of observations) is the one for which its distance matrix -#' minimizes the differences between the \code{Yr} value of each -#' observation and the \code{Yr} value of its closest observation. In this -#' case \code{value} must be a value (larger than 0 and -#' below \code{min(nrow(Xr), nrow(X2), ncol(Xr))}) indicating the maximum +#' \item{\code{"opc"}:} { optimized principal component selection based +#' on Ramirez-Lopez et al. (2013a, 2013b). The optimal number of +#' components (of set of observations) is the one for which its distance +#' matrix minimizes the differences between the \code{Yr} value of each +#' observation and the \code{Yr} value of its closest observation. In +#' this case \code{value} must be a value (larger than 0 and +#' below the minimum dimension of \code{Xr} or \code{Xr} and \code{Xu} +#' combined) indicating the maximum #' number of principal components to be tested. See the #' \code{\link{ortho_projection}} function for more details.} #' #' \item{\code{"cumvar"}:}{ selection of the principal components based #' on a given cumulative amount of explained variance. In this case, #' \code{value} must be a value (larger than 0 and below or equal to 1) -#' indicating the minimum amount of cumulative variance that the +#' indicating the minimum amount of cumulative variance that the #' combination of retained components should explain.} #' #' \item{\code{"var"}:}{ selection of the principal components based #' on a given amount of explained variance. In this case, #' \code{value} must be a value (larger than 0 and below or equal to 1) -#' indicating the minimum amount of variance that a single component +#' indicating the minimum amount of variance that a single component #' should explain in order to be retained.} #' #' \item{\code{"manual"}:}{ for manually specifying a fix number of #' principal components. In this case, \code{value} must be a value -#' (larger than 0 and \code{min(nrow(Xr), nrow(X2), ncol(Xr))}). +#' (larger than 0 and below the minimum dimension of \code{Xr} or +#' \code{Xr} and \code{Xu} combined). #' indicating the minimum amount of variance that a component should #' explain in order to be retained.} #' } -#' The default list passed is \code{list(method = "opc", value = min(dim(Xr), 40))}. +#' The list +#' \code{list(method = "opc", value = min(dim(Xr), 40))} is the default. #' Optionally, the \code{pc_selection} argument admits \code{"opc"} or #' \code{"cumvar"} or \code{"var"} or \code{"manual"} as a single character #' string. In such a case the default \code{"value"} when either \code{"opc"} or @@ -162,8 +164,8 @@ #' that can be coerced to \code{\link[base]{factor}} by \code{as.factor}) that #' assigns a group/class label to each observation in \code{Xr} #' (e.g. groups can be given by spectra collected from the same batch of -#' measurements, from the same observation, from observations with very similar origin, -#' etc). This is taken into account for internal leave-group-out cross +#' measurements, from the same observation, from observations with very similar +#' origin, etc). This is taken into account for internal leave-group-out cross #' validation for pls tuning (factor optimization) to avoid pseudo-replication. #' When one observation is selected for cross-validation, all observations of #' the same group are removed together and assigned to validation. The length @@ -176,6 +178,9 @@ #' to unit variance at each local segment (before regression). In addition, if #' \code{TRUE}, \code{Xr} and \code{Xu} will be scaled for dissimilarity #' computations. +#' @param verbose a logical indicating whether or not to print a progress bar +#' for each observation to be predicted. Default is \code{TRUE}. Note: In case +#' parallel processing is used, these progress bars will not be printed. #' @param documentation an optional character string that can be used to #' describe anything related to the \code{mbl} call (e.g. description of the #' input data). Default: \code{character()}. NOTE: his is an experimental @@ -186,7 +191,10 @@ #' @details #' The argument \code{spike} can be used to indicate what reference observations #' in \code{Xr} must be kept in the neighborhood of every single \code{Xu} -#' observation. Spiking might be useful in cases where +#' observation. If a vector of length \mjeqn{m}{m} is passed to this argument, +#' this means that the \mjeqn{m}{m} original neighbors with the largest +#' dissimilarities to the target observations will be forced out of the +#' neighborhood. Spiking might be useful in cases where #' some reference observations are known to be somehow related to the ones in #' \code{Xu} and therefore might be relevant for fitting the local models. See #' Guerrero et al. (2010) for an example on the benefits of spiking. @@ -207,14 +215,14 @@ #' treated as predictor variables). In some cases this results in an improvement #' of the prediction performance (Ramirez-Lopez et al., 2013a). #' If \code{diss_usage = "weights"}, the neighbors of the query point -#' (\mjeqn{xu_{j}}{xu_j}) are weighted according to their dissimilarity to \mjeqn{xu_{j}}{xu_j} -#' before carrying out each local regression. The following tricubic function -#' (Cleveland and Delvin, 1988; Naes et al., 1990) is used for computing the -#' final weights based on the measured dissimilarities: +#' (\mjeqn{xu_{j}}{xu_j}) are weighted according to their dissimilarity to +#' \mjeqn{xu_{j}}{xu_j} before carrying out each local regression. The following +#' tricubic function (Cleveland and Delvin, 1988; Naes et al., 1990) is used for +#' computing the final weights based on the measured dissimilarities: #' #' \mjdeqn{W_{j} = (1 - v^{3})^{3}}{W_j = (1 - v^3)^3} #' -#' where if \mjeqn{{xr_{i} \in}}{xr_i in} neighbors of \mjeqn{xu_{j}}{xu_j}: +#' where if \mjeqn{{xr_{i} \in }}{xr_i in} neighbors of \mjeqn{xu_{j}}{xu_j}: #' #' \mjdeqn{v_{j}(xu_{j}) = d(xr_{i}, xu_{j})}{v_j(xu_j) = d(xr_i, xu_j)} #' @@ -241,15 +249,15 @@ #' execution, prediction limits, etc, can be specified by using the #' \code{\link{mbl_control}} function. #' -#' By using the \code{group} argument one can specify groups of observations that -#' have something in common (e.g. observations with very similar origin). +#' By using the \code{group} argument one can specify groups of observations +#' that have something in common (e.g. observations with very similar origin). #' The purpose of \code{group} is to avoid biased cross-validation results due #' to pseudo-replication. This argument allows to select calibration points #' that are independent from the validation ones. In this regard, when -#' \code{validation_type = "local_cv"} (used in \code{\link{mbl_control}} function), -#' then the \code{p} argument refers to the percentage of groups of observations -#' (rather than single observations) to be retained in each sampling iteration at -#' each local segment. +#' \code{validation_type = "local_cv"} (used in \code{\link{mbl_control}} +#' function), then the \code{p} argument refers to the percentage of groups of +#' observations (rather than single observations) to be retained in each +#' sampling iteration at each local segment. #' #' @return a \code{list} of class \code{mbl} with the following components #' (sorted either by \code{k} or \code{k_diss}): @@ -273,11 +281,11 @@ #' as well as the results of the global pls projection object used to obtain #' the GH values.} #' \item{\code{validation_results}:}{ a list of validation results for -#' \code{local_cross_validation} (returned if the \code{validation_type} in \code{control} -#' list was set to \code{"local_cv"}), -#' \code{nearest_neighbor_validation} (returned if the \code{validation_type} in \code{control} -#' list was set to \code{"NNv"}) and -#' \code{Yu_prediction_statistics} (returned if \code{Yu} was supplied).} +#' "local cross validation" (returned if the \code{validation_type} in +#' \code{control} list was set to \code{"local_cv"}), +#' "nearest neighbor validation" (returned if the \code{validation_type} +#' in \code{control} list was set to \code{"NNv"}) and +#' "Yu prediction statistics" (returned if \code{Yu} was supplied).}`` #' \item{\code{results}:}{ a list of data tables containing the results of the #' predictions for each either \code{k} or \code{k_diss}. Each data table #' contains the following columns:} @@ -289,9 +297,11 @@ #' \item{\code{k_original}:}{ This column is only output if the \code{k_diss} #' argument is used. It indicates the number of neighbors that were originally #' found when the given dissimilarity threshold is used.} -#' \item{\code{k}:}{ This column indicates the final number of neighbors used.} -#' \item{\code{npls}:}{ This column is only output if the \code{pls} regression -#' method was used. It indicates the final number of pls components used.} +#' \item{\code{k}:}{ This column indicates the final number of neighbors +#' used.} +#' \item{\code{npls}:}{ This column is only output if the \code{pls} +#' regression method was used. It indicates the final number of pls +#' components used.} #' \item{\code{min_pls}:}{ This column is only output if \code{wapls} #' regression method was used. It indicates the final number of minimum pls #' components used. If no optimization was set, it retrieves the original @@ -310,48 +320,52 @@ #' variable) in the neighborhood.} #' \item{\code{index_nearest_in_Xr}}{ The index of the nearest neighbor found #' in \code{Xr}.} -#' \item{\code{index_farthest_in_Xr}}{ The index of the farthest neighbor found -#' in \code{Xr}.} +#' \item{\code{index_farthest_in_Xr}}{ The index of the farthest neighbor +#' found in \code{Xr}.} #' \item{\code{y_nearest}:}{ The reference value (\code{Yr}) corresponding to #' the nearest neighbor found in \code{Xr}.} -#' \item{\code{y_nearest_pred}:}{ This column is only output if the validation -#' method in the object passed to \code{control} was set to \code{"NNv"}. -#' It represents the predicted value of the nearest neighbor observation found -#' in \code{Xr}. This prediction come from model fitted with the remaining -#' observations in the neighborhood of the target observation in \code{Xu}.} +#' \item{\code{y_nearest_pred}:}{ This column is only output if the +#' validation method in the object passed to \code{control} was set to +#' \code{"NNv"}. It represents the predicted value of the nearest neighbor +#' observation found in \code{Xr}. This prediction come from model fitted +#' with the remaining observations in the neighborhood of the target +#' observation in \code{Xu}.} #' \item{\code{loc_rmse_cv}:}{ This column is only output if the validation #' method in the object passed to \code{control} was set to #' \code{'local_cv'}. It represents the RMSE of the cross-validation #' computed for the neighborhood of the target observation in \code{Xu}.} -#' \item{\code{loc_st_rmse_cv}:}{ This column is only output if the validation -#' method in the object passed to \code{control} was set to +#' \item{\code{loc_st_rmse_cv}:}{ This column is only output if the +#' validation method in the object passed to \code{control} was set to #' \code{'local_cv'}. It represents the standardized RMSE of the -#' cross-validation computed for the neighborhood of the target observation in -#' \code{Xu}.} +#' cross-validation computed for the neighborhood of the target observation +#' in \code{Xu}.} #' \item{\code{dist_nearest}:}{ The distance to the nearest neighbor.} #' \item{\code{dist_farthest}:}{ The distance to the farthest neighbor.} #' \item{\code{loc_n_components}:}{ This column is only output if the #' dissimilarity method used is one of \code{"pca"}, \code{"pca.nipals"} or #' \code{"pls"} and in addition the dissimilarities are requested to be -#' computed locally by passing \code{.local = TRUE} to the \code{mbl} function. +#' computed locally by passing \code{.local = TRUE} to the \code{mbl} +#' function. #' See \code{.local} argument in the \code{\link{ortho_diss}} function.} #' } -#' \item{\code{documentation}}{ A character string with the documentation added.} +#' \item{\code{documentation}}{ A character string with the documentation +#' added.} #' } #' When the \code{k_diss} argument is used, the printed results show a table #' with a column named '\code{p_bounded}. It represents the percentage of #' observations for which the neighbors selected by the given dissimilarity #' threshold were outside the boundaries specified in the \code{k_range} #' argument. -#' @author \href{https://orcid.org/0000-0002-5369-5120}{Leonardo Ramirez-Lopez} and Antoine Stevens +#' @author \href{https://orcid.org/0000-0002-5369-5120}{Leonardo Ramirez-Lopez} +#' and Antoine Stevens #' @references #' Cleveland, W. S., and Devlin, S. J. 1988. Locally weighted regression: an #' approach to regression analysis by local fitting. Journal of the American #' Statistical Association, 83, 596-610. #' #' Guerrero, C., Zornoza, R., Gómez, I., Mataix-Beneyto, J. 2010. Spiking of -#' NIR regional models using observations from target sites: Effect of model size on -#' prediction accuracy. Geoderma, 158(1-2), 66-77. +#' NIR regional models using observations from target sites: Effect of model +#' size on prediction accuracy. Geoderma, 158(1-2), 66-77. #' #' Naes, T., Isaksson, T., Kowalski, B. 1990. Locally weighted regression and #' scatter correction for near-infrared reflectance data. Analytical Chemistry @@ -359,7 +373,8 @@ #' #' Ramirez-Lopez, L., Behrens, T., Schmidt, K., Stevens, A., Dematte, J.A.M., #' Scholten, T. 2013a. The spectrum-based learner: A new local approach for -#' modeling soil vis-NIR spectra of complex data sets. Geoderma 195-196, 268-279. +#' modeling soil vis-NIR spectra of complex data sets. Geoderma 195-196, +#' 268-279. #' #' Ramirez-Lopez, L., Behrens, T., Schmidt, K., Viscarra Rossel, R., Dematte, #' J. A. M., Scholten, T. 2013b. Distance and similarity-search metrics for @@ -376,29 +391,30 @@ #' \code{\link{cor_diss}}, \code{\link{sid}}, \code{\link{ortho_diss}}, #' \code{\link{search_neighbors}} #' @examples -#' \dontrun{ +#' \donttest{ #' library(prospectr) #' data(NIRsoil) -#' +#' #' # Proprocess the data using detrend plus first derivative with Savitzky and #' # Golay smoothing filter #' sg_det <- savitzkyGolay( #' detrend(NIRsoil$spc, -#' wav = as.numeric(colnames(NIRsoil$spc))), +#' wav = as.numeric(colnames(NIRsoil$spc)) +#' ), #' m = 1, #' p = 1, #' w = 7 #' ) -#' +#' #' NIRsoil$spc_pr <- sg_det -#' +#' #' # split into training and testing sets -#' test_x <- NIRsoil$spc_pr[NIRsoil$train == 0 & !is.na(NIRsoil$CEC),] +#' test_x <- NIRsoil$spc_pr[NIRsoil$train == 0 & !is.na(NIRsoil$CEC), ] #' test_y <- NIRsoil$CEC[NIRsoil$train == 0 & !is.na(NIRsoil$CEC)] -#' +#' #' train_y <- NIRsoil$CEC[NIRsoil$train == 1 & !is.na(NIRsoil$CEC)] -#' train_x <- NIRsoil$spc_pr[NIRsoil$train == 1 & !is.na(NIRsoil$CEC),] -#' +#' train_x <- NIRsoil$spc_pr[NIRsoil$train == 1 & !is.na(NIRsoil$CEC), ] +#' #' # Example 1 #' # A mbl implemented in Ramirez-Lopez et al. (2013, #' # the spectrum-based learner) @@ -406,10 +422,10 @@ #' # An exmaple where Yu is supposed to be unknown, but the Xu #' # (spectral variables) are known #' my_control <- mbl_control(validation_type = "NNv") -#' +#' #' ## The neighborhood sizes to test #' ks <- seq(40, 140, by = 20) -#' +#' #' sbl <- mbl( #' Xr = train_x, #' Yr = train_y, @@ -422,7 +438,7 @@ #' sbl #' plot(sbl) #' get_predictions(sbl) -#' +#' #' # Example 1.2 #' # If Yu is actually known... #' sbl_2 <- mbl( @@ -436,7 +452,7 @@ #' ) #' sbl_2 #' plot(sbl_2) -#' +#' #' # Example 2 #' # the LOCAL algorithm (Shenk et al., 1997) #' local_algorithm <- mbl( @@ -452,7 +468,7 @@ #' ) #' local_algorithm #' plot(local_algorithm) -#' +#' #' # Example 3 #' # A variation of the LOCAL algorithm (using the optimized pc #' # dissmilarity matrix) and dissimilarity matrix as source of @@ -470,14 +486,22 @@ #' ) #' local_algorithm_2 #' plot(local_algorithm_2) -#' +#' #' # Example 4 #' # Running the mbl function in parallel with example 2 -#' n_cores <- parallel::detectCores() - 1 -#' if (n_cores == 0) { -#' n_cores <- 1 +#' +#' n_cores <- 2 +#' +#' if (parallel::detectCores() < 2) { +#' n_cores <- 1 #' } #' +#' # Alternatively: +#' # n_cores <- parallel::detectCores() - 1 +#' # if (n_cores == 0) { +#' # n_cores <- 1 +#' # } +#' #' library(doParallel) #' clust <- makeCluster(n_cores) #' registerDoParallel(clust) @@ -487,7 +511,7 @@ #' # clust <- makeCluster(n_cores, type = "SOCK") #' # registerDoSNOW(clust) #' # getDoParWorkers() -#' +#' #' local_algorithm_par <- mbl( #' Xr = train_x, #' Yr = train_y, @@ -500,10 +524,10 @@ #' control = my_control #' ) #' local_algorithm_par -#' +#' #' registerDoSEQ() #' try(stopCluster(clust)) -#' +#' #' # Example 5 #' # Using local pls distances #' with_local_diss <- mbl( @@ -625,7 +649,6 @@ ## The error thrown is: ## Error in svd(x = X0) : infinite or missing values in 'x' - mbl <- function(Xr, Yr, Xu, Yu = NULL, k, k_diss, k_range, spike = NULL, @@ -644,31 +667,35 @@ mbl <- function(Xr, Yr, Xu, Yu = NULL, group = NULL, center = TRUE, scale = FALSE, + verbose = TRUE, documentation = character(), ...) { f_call <- match.call() - + "%mydo%" <- get("%do%") if (control$allow_parallel & getDoParRegistered()) { "%mydo%" <- get("%dopar%") } + if (!is.logical(verbose)) { + stop("'verbose' must be logical") + } if (missing(k)) { k <- NULL } - + if (missing(k_diss)) { k_diss <- NULL } - + if (missing(k_range)) { k_range <- NULL } - + input_dots <- list(...) ini_cntrl <- control ortho_diss_methods <- c("pca", "pca.nipals", "pls") - + if (".local" %in% names(input_dots)) { if (isTRUE(input_dots$.local)) { if (!"pre_k" %in% names(input_dots)) { @@ -681,60 +708,60 @@ mbl <- function(Xr, Yr, Xu, Yu = NULL, } } } - + # Sanity checks if (!is.logical(center)) { stop("'center' argument must be logical") } - + if (!is.logical(scale)) { stop("'scale' argument must be logical") } - + if (ncol(Xr) != ncol(Xu)) { stop("The number of predictor variables in Xr must be equal to the number of variables in Xu") } - + if (ncol(Xr) < 4) { stop("This function works only with matrices containing more than 3 predictor variables") } - + if (length(Yr) != nrow(Xr)) { stop("length(Yr) must be equal to nrow(Xr)") } - + if (any(is.na(Yr))) { stop("The current version of the mbl function does not handle NAs in the response variable of the reference observations (Yr)") } - + Xr <- as.matrix(Xr) Xu <- as.matrix(Xu) Yr <- as.matrix(Yr) - + n_xr <- nrow(Xr) n_xu <- nrow(Xu) n_total <- n_xr + n_xu - + rownames(Xr) <- 1:nrow(Xr) rownames(Xu) <- 1:nrow(Xu) - + if (is.null(colnames(Xr))) { colnames(Xr) <- 1:ncol(Xr) } - + if (is.null(colnames(Xu))) { colnames(Xu) <- 1:ncol(Xu) } - + if (sum(!colnames(Xu) == colnames(Xr)) != 0) { stop("Variable names in Xr do not match those in Xu") } - + diss_methods <- c( "pca", "pca.nipals", "pls", "cor", "euclid", "cosine", "sid" ) - + if (!is.character(diss_method) & !is.matrix(diss_method)) { mtds <- paste(diss_methods, collapse = ", ") stop(paste0( @@ -743,7 +770,7 @@ mbl <- function(Xr, Yr, Xu, Yu = NULL, " or a matrix" )) } - + if (!is.null(group)) { if (length(group) != nrow(Xr)) { stop(paste0( @@ -752,16 +779,16 @@ mbl <- function(Xr, Yr, Xu, Yu = NULL, )) } } - + if (length(pc_selection) != 2 | !is.list(pc_selection)) { stop("'pc_selection' must be a list of length 2") } - + if (!all(names(pc_selection) %in% c("method", "value")) | is.null(names(pc_selection))) { names(pc_selection)[sapply(pc_selection, FUN = is.character)] <- "method" names(pc_selection)[sapply(pc_selection, FUN = is.numeric)] <- "value" } - + pc_sel_method <- match.arg(pc_selection$method, c( "opc", "var", @@ -769,7 +796,7 @@ mbl <- function(Xr, Yr, Xu, Yu = NULL, "manual" )) pc_threshold <- pc_selection$value - + if (pc_sel_method %in% c("opc", "manual") & pc_selection$value > min(n_total, ncol(Xr))) { warning(paste0( "When pc_selection$method is 'opc' or 'manual', the value ", @@ -781,14 +808,14 @@ mbl <- function(Xr, Yr, Xu, Yu = NULL, )) pc_threshold <- min(n_total, ncol(Xr)) } - - + + match.arg(diss_usage, c("predictors", "weights", "none")) - + if (is.null(k) & is.null(k_diss)) { stop("Either k or k_diss must be specified") } - + k_max <- NULL if (!is.null(k)) { if (!is.null(k_diss)) { @@ -802,7 +829,7 @@ mbl <- function(Xr, Yr, Xu, Yu = NULL, k <- sort(k) k_max <- max(k) } - + k_diss_max <- NULL if (!is.null(k_diss)) { k_diss <- unique(sort(k_diss)) @@ -823,7 +850,7 @@ mbl <- function(Xr, Yr, Xu, Yu = NULL, } k_diss_max <- max(k_diss) } - + if (".local" %in% names(input_dots)) { if (isTRUE(input_dots$local)) { if (!"pre_k" %in% names(input_dots)) { @@ -840,33 +867,33 @@ mbl <- function(Xr, Yr, Xu, Yu = NULL, } } } - + if (!"local_fit" %in% class(method)) { stop("Object passed to method must be of class local_fit") } - + validation_type <- control$validation_type is_local_cv <- "local_cv" %in% validation_type is_nnv_val <- "NNv" %in% validation_type - + if (all(c("local_cv", "NNv") %in% control$validation_type)) { validation_type <- "both" } - + if (validation_type %in% c("NNv", "both") & nrow(Xu) < 3) { stop(paste0( "For nearest neighbor validation (control$validation_type == 'NNv')", " Xu must contain at least 3 observations" )) } - + if (!is.null(Yu)) { Yu <- as.matrix(Yu) if (length(Yu) != nrow(Xu)) { stop("Number of observations in Yu and Xu differ") } } - + if (!is.null(k)) { k <- as.integer(k) if (min(k) < 4) { @@ -879,9 +906,9 @@ mbl <- function(Xr, Yr, Xu, Yu = NULL, )) } } - + has_projection <- FALSE - + if (!is.matrix(diss_method)) { # when .local = TRUE, k_max is replaced with k_pre inside get_neighbor_info() neighborhoods <- get_neighbor_info( @@ -903,7 +930,7 @@ mbl <- function(Xr, Yr, Xu, Yu = NULL, } } else { diss_xr_xr <- NULL - + dim_diss <- dim(diss_method) if (diss_usage == "predictors") { if (diff(dim_diss) != 0 | dim_diss[1] != n_total | any(diag(diss_method) != 0)) { @@ -933,12 +960,12 @@ mbl <- function(Xr, Yr, Xu, Yu = NULL, append( neighborhoods, diss_to_neighbors(diss_xr_xu, - k = k, k_diss = k_diss, k_range = k_range, - spike = NULL, - return_dissimilarity = control$return_dissimilarity + k = k, k_diss = k_diss, k_range = k_range, + spike = NULL, + return_dissimilarity = control$return_dissimilarity ) ) - + if (gh) { neighborhoods <- NULL neighborhoods$gh$projection <- pls_projection( @@ -948,26 +975,26 @@ mbl <- function(Xr, Yr, Xu, Yu = NULL, scale = scale, ... ) neighborhoods$gh$gh_Xr <- f_diss(neighborhoods$gh$projection$scores, - Xu = t(colMeans(neighborhoods$gh$projection$scores)), - diss_method = "mahalanobis", - center = FALSE, scale = FALSE + Xu = t(colMeans(neighborhoods$gh$projection$scores)), + diss_method = "mahalanobis", + center = FALSE, scale = FALSE ) neighborhoods$gh$gh_Xu <- neighborhoods$gh$gh_Xr[-c(1:nrow(Xr))] neighborhoods$gh$gh_Xr <- neighborhoods$gh$gh_Xr[c(1:nrow(Xr))] neighborhoods$gh <- neighborhoods$gh[c("gh_Xr", "gh_Xu", "projection")] } - + neighborhoods$diss_xr_xr <- diss_xr_xr rm(diss_xr_xr) rm(diss_method) gc() } - + if (!is.null(k)) { smallest_neighborhood <- neighborhoods$neighbors[1:min(k), , drop = FALSE] smallest_n_neighbors <- colSums(!is.na(smallest_neighborhood)) } - + if (!is.null(k_diss)) { min_diss <- neighborhoods$neighbors_diss <= min(k_diss) if (!is.null(spike)) { @@ -979,8 +1006,8 @@ mbl <- function(Xr, Yr, Xu, Yu = NULL, smallest_n_neighbors[smallest_n_neighbors < min(k_range)] <- min(k_range) smallest_n_neighbors[smallest_n_neighbors > max(k_range)] <- max(k_range) } - - + + if (is_local_cv) { min_n_samples <- floor(min(smallest_n_neighbors) * control$p) - 1 min_cv_samples <- floor(min(k, k_range) * (1 - control$p)) @@ -994,7 +1021,7 @@ mbl <- function(Xr, Yr, Xu, Yu = NULL, } else { min_n_samples <- smallest_n_neighbors - 1 } - + if (method$method %in% c("pls", "wapls")) { max_pls <- max(method$pls_c) if (any(min_n_samples < max_pls)) { @@ -1006,8 +1033,8 @@ mbl <- function(Xr, Yr, Xu, Yu = NULL, )) } } - - + + if (!".local" %in% names(input_dots)) { iter_neighborhoods <- ith_mbl_neighbor( Xr = Xr, Xu = Xu, Yr = Yr, Yu = Yu, @@ -1026,7 +1053,7 @@ mbl <- function(Xr, Yr, Xu, Yu = NULL, group = group ) } - + r_fields <- c( "o_index", "k_diss", "k_original", "k", "npls", "min_pls", "max_pls", "yu_obs", "pred", "yr_min_obs", "yr_max_obs", @@ -1035,15 +1062,15 @@ mbl <- function(Xr, Yr, Xu, Yu = NULL, "y_farthest", "diss_nearest", "diss_farthest", "loc_rmse_cv", "loc_st_rmse_cv", "loc_n_components", "rep" ) - + n_ith_result <- ifelse(is.null(k_diss), length(k), length(k_diss)) - + template_pred_results <- data.table(matrix(NA, n_ith_result, length(r_fields), - dimnames = list(NULL, r_fields) + dimnames = list(NULL, r_fields) )) - + template_pred_results$rep[1] <- 0 - + if (!is.null(k_diss)) { template_pred_results$k_diss <- k_diss } else { @@ -1054,9 +1081,11 @@ mbl <- function(Xr, Yr, Xu, Yu = NULL, to_erase <- pg_bar_width + (2 * nchar(nrow(Xu))) + 8 to_erase <- paste(rep(" ", to_erase), collapse = "") - cat("\033[32m\033[3mPredicting...\n\033[23m\033[39m") + if (verbose){ + cat("\033[32m\033[3mPredicting...\n\033[23m\033[39m") + } n_iter <- nrow(Xu) - + pred_obs <- foreach( i = 1:n_iter, ith_observation = iter_neighborhoods, @@ -1087,17 +1116,17 @@ mbl <- function(Xr, Yr, Xu, Yu = NULL, ith_group = ith_observation$ith_group, ... ) - + ith_pred_results$loc_n_components[] <- ith_observation$ith_components additional_results$ith_neig_indices <- ith_observation$ith_neig_indices additional_results$ith_neigh_diss <- ith_observation$ith_neigh_diss } - - if (control$progress) { + + if (verbose) { cat(paste0("\033[34m\033[3m", i, "/", n_iter, "\033[23m\033[39m")) pb <- txtProgressBar(width = pg_bar_width, char = "\033[34m_\033[39m") } - + if (!is.null(k_diss)) { ith_diss <- ith_observation$ith_neigh_diss if (!is.null(spike)) { @@ -1110,12 +1139,12 @@ mbl <- function(Xr, Yr, Xu, Yu = NULL, } else { ith_pred_results$k <- k } - + for (kk in 1:nrow(ith_pred_results)) { - if (control$progress) { + if (verbose) { setTxtProgressBar(pb, kk / nrow(ith_pred_results)) } - + # If the sample has not been predicted before, # then create a model and predict it (useful only when k_diss is used) current_k <- ith_pred_results$k[kk] @@ -1144,13 +1173,13 @@ mbl <- function(Xr, Yr, Xu, Yu = NULL, ith_pred_results$y_nearest[kk] <- i_k_yr[which.min(kth_diss)] ith_pred_results$index_nearest_in_Xr[kk] <- ith_observation$ith_neig_indices[which.min(kth_diss)] ith_pred_results$index_farthest_in_Xr[kk] <- ith_observation$ith_neig_indices[which.max(kth_diss)] - + if (!is.null(group)) { i_k_group <- factor(ith_observation$ith_group[1:current_k]) } else { i_k_group <- NULL } - + if (diss_usage == "weights") { # Weights are defined according to a tricubic function # as in Cleveland and Devlin (1988) and Naes and Isaksson (1990). @@ -1160,7 +1189,7 @@ mbl <- function(Xr, Yr, Xu, Yu = NULL, } else { kth_weights <- rep(1, current_k) } - + # local fit i_k_pred <- fit_and_predict( x = i_k_xr, @@ -1180,9 +1209,9 @@ mbl <- function(Xr, Yr, Xu, Yu = NULL, pls_max_iter = 1, pls_tol = 1e-6 ) - + ith_pred_results$pred[kk] <- i_k_pred$prediction - + selected_pls <- NULL if (is_local_cv) { if (control$tune_locally) { @@ -1190,7 +1219,7 @@ mbl <- function(Xr, Yr, Xu, Yu = NULL, } else { best_row <- ifelse(method$method == "pls", method$pls_c, 1) } - + if (method$method == "pls") { ith_pred_results$npls[kk] <- i_k_pred$validation$cv_results$npls[best_row] selected_pls <- ith_pred_results$npls[kk] @@ -1200,7 +1229,7 @@ mbl <- function(Xr, Yr, Xu, Yu = NULL, ith_pred_results$max_pls[kk] <- i_k_pred$validation$cv_results$max_component[best_row] selected_pls <- i_k_pred$validation$cv_results[best_row, 1:2] } - + ith_pred_results$loc_rmse_cv[kk] <- i_k_pred$validation$cv_results$rmse_cv[best_row] ith_pred_results$loc_st_rmse_cv[kk] <- i_k_pred$validation$cv_results$st_rmse_cv[best_row] } else { @@ -1214,14 +1243,14 @@ mbl <- function(Xr, Yr, Xu, Yu = NULL, selected_pls <- method$pls_c } } - + if (is_nnv_val) { if (!is.null(group)) { out_group <- which(i_k_group == i_k_group[[ith_observation$local_index_nearest]]) } else { out_group <- ith_observation$local_index_nearest } - + nearest_pred <- fit_and_predict( x = i_k_xr[-out_group, ], y = i_k_yr[-out_group, , drop = FALSE], @@ -1236,7 +1265,7 @@ mbl <- function(Xr, Yr, Xu, Yu = NULL, pls_max_iter = 1, pls_tol = 1e-6 )$prediction - + ith_pred_results$y_nearest_pred[kk] <- nearest_pred / kth_weights[1] } } else { @@ -1246,12 +1275,12 @@ mbl <- function(Xr, Yr, Xu, Yu = NULL, ith_pred_results$k_diss[kk] <- ith_k_diss } } - - if (control$progress) { + + if (verbose) { if (kk == nrow(ith_pred_results) & i != n_iter) { cat("\r", to_erase, "\r") } - + if (i == n_iter) { cat("\n") } @@ -1263,27 +1292,29 @@ mbl <- function(Xr, Yr, Xu, Yu = NULL, additional_results = additional_results ) } - + iteration_order <- sapply(pred_obs, - FUN = function(x) x$results$o_index[1]) - + FUN = function(x) x$results$o_index[1] + ) + pred_obs <- pred_obs[order(iteration_order, decreasing = FALSE)] - + results_table <- do.call("rbind", lapply(pred_obs, - FUN = function(x) x$results)) - + FUN = function(x) x$results + )) + if (".local" %in% names(input_dots) & diss_method %in% ortho_diss_methods) { diss_xr_xu <- do.call( "cbind", lapply(iteration_order, - FUN = function(x, m, ii) { - idc <- x[[ii]]$additional_results$ith_neig_indices - d <- x[[ii]]$additional_results$ith_neigh_diss - m[idc] <- d - m - }, - x = pred_obs, - m = matrix(NA, nrow(Xr), 1) + FUN = function(x, m, ii) { + idc <- x[[ii]]$additional_results$ith_neig_indices + d <- x[[ii]]$additional_results$ith_neigh_diss + m[idc] <- d + m + }, + x = pred_obs, + m = matrix(NA, nrow(Xr), 1) ) ) class(diss_xr_xu) <- c("local_ortho_diss", "matrix") @@ -1291,20 +1322,20 @@ mbl <- function(Xr, Yr, Xu, Yu = NULL, paste0("Xr_", 1:nrow(diss_xr_xu)), paste0("Xu_", 1:ncol(diss_xr_xu)) ) - + neighborhoods$neighbors <- do.call( "cbind", lapply(iteration_order, - FUN = function(x, m, ii) { - idc <- x[[ii]]$additional_results$ith_neig_indices - m[1:length(idc)] <- idc - m - }, - x = pred_obs, - m = matrix(NA, max(results_table$k), 1) + FUN = function(x, m, ii) { + idc <- x[[ii]]$additional_results$ith_neig_indices + m[1:length(idc)] <- idc + m + }, + x = pred_obs, + m = matrix(NA, max(results_table$k), 1) ) ) } - + out <- c( if (is.null(Yu)) { "yu_obs" @@ -1326,21 +1357,21 @@ mbl <- function(Xr, Yr, Xu, Yu = NULL, }, "rep" ) - + results_table[, (out) := NULL] if (!is.null(k_diss)) { param <- "k_diss" results_table <- lapply(get(param), - FUN = function(x, sel, i) x[x[[sel]] == i, ], - x = results_table, - sel = param + FUN = function(x, sel, i) x[x[[sel]] == i, ], + x = results_table, + sel = param ) names(results_table) <- paste0("k_diss_", k_diss) p_bounded <- sapply(results_table, - FUN = function(x, k_range) { - sum(x$k_original <= k_range[1] | x$k_original >= k_range[2]) - }, - k_range = k_range + FUN = function(x, k_range) { + sum(x$k_original <= k_range[1] | x$k_original >= k_range[2]) + }, + k_range = k_range ) col_ks <- data.table( k_diss = k_diss, @@ -1349,14 +1380,14 @@ mbl <- function(Xr, Yr, Xu, Yu = NULL, } else { param <- "k" results_table <- lapply(get(param), - FUN = function(x, sel, i) x[x[[sel]] == i, ], - x = results_table, - sel = param + FUN = function(x, sel, i) x[x[[sel]] == i, ], + x = results_table, + sel = param ) names(results_table) <- paste0("k_", k) col_ks <- data.table(k = k) } - + if (validation_type %in% c("NNv", "both")) { nn_stats <- function(x) { nn_rmse <- (mean((x$y_nearest - x$y_nearest_pred)^2))^0.5 @@ -1364,17 +1395,17 @@ mbl <- function(Xr, Yr, Xu, Yu = NULL, nn_rsq <- (cor(x$y_nearest, x$y_nearest_pred))^2 c(nn_rmse = nn_rmse, nn_st_rmse = nn_st_rmse, nn_rsq = nn_rsq) } - + loc_nn_res <- do.call("rbind", lapply(results_table, FUN = nn_stats)) loc_nn_res <- cbind(col_ks, - rmse = loc_nn_res[, "nn_rmse"], - st_rmse = loc_nn_res[, "nn_st_rmse"], - r2 = loc_nn_res[, "nn_rsq"] + rmse = loc_nn_res[, "nn_rmse"], + st_rmse = loc_nn_res[, "nn_st_rmse"], + r2 = loc_nn_res[, "nn_rsq"] ) } else { loc_nn_res <- NULL } - + if (validation_type %in% c("local_cv", "both")) { mean_loc_res <- function(x) { mean_loc_rmse <- mean(x$loc_rmse_cv) @@ -1383,13 +1414,13 @@ mbl <- function(Xr, Yr, Xu, Yu = NULL, } loc_res <- do.call("rbind", lapply(results_table, mean_loc_res)) loc_res <- cbind(col_ks, - rmse = loc_res[, "loc_rmse"], - st_rmse = loc_res[, "loc_st_rmse"] + rmse = loc_res[, "loc_rmse"], + st_rmse = loc_res[, "loc_st_rmse"] ) } else { loc_res <- NULL } - + if (!is.null(Yu)) { for (i in 1:length(results_table)) { results_table[[i]]$yu_obs <- Yu @@ -1402,18 +1433,18 @@ mbl <- function(Xr, Yr, Xu, Yu = NULL, } pred_res <- do.call("rbind", lapply(results_table, yu_stats)) pred_res <- cbind(col_ks, - rmse = pred_res[, "yu_rmse"], - st_rmse = pred_res[, "yu_st_rmse"], - r2 = pred_res[, "yu_rsq"] + rmse = pred_res[, "yu_rmse"], + st_rmse = pred_res[, "yu_st_rmse"], + r2 = pred_res[, "yu_rsq"] ) } else { pred_res <- NULL } - + if ("local_ortho_diss" %in% class(diss_xr_xu)) { diss_method <- paste0(diss_method, " (locally computed)") } - + if (control$return_dissimilarity) { diss_list <- list( diss_method = diss_method, @@ -1425,7 +1456,7 @@ mbl <- function(Xr, Yr, Xu, Yu = NULL, } else { diss_list <- NULL } - + colnames(neighborhoods$neighbors) <- paste0("Xu_", 1:nrow(Xu)) rownames(neighborhoods$neighbors) <- paste0("k_", 1:nrow(neighborhoods$neighbors)) results_list <- list( @@ -1446,9 +1477,9 @@ mbl <- function(Xr, Yr, Xu, Yu = NULL, results = results_table, documentation = documentation ) - + attr(results_list, "call") <- f_call class(results_list) <- c("mbl", "list") - + results_list } diff --git a/R/mbl_control.R b/R/mbl_control.R index 968d3a6..e50ccad 100644 --- a/R/mbl_control.R +++ b/R/mbl_control.R @@ -1,9 +1,6 @@ #' @title A function that controls some few aspects of the memory-based learning #' process in the \code{mbl} function #' @description -#' -#' \lifecycle{maturing} -#' #' \loadmathjax #' This function is used to further control some aspects of the memory-based #' learning process in the \code{mbl} function. @@ -14,7 +11,6 @@ #' number = 10, #' p = 0.75, #' range_prediction_limits = TRUE, -#' progress = TRUE, #' allow_parallel = TRUE) #' @param return_dissimilarity a logical indicating if the dissimilarity matrix #' between \code{Xr} and \code{Xu} must be returned. @@ -31,7 +27,7 @@ #' @param number an integer indicating the number of sampling iterations at #' each local segment when \code{"local_cv"} is selected in the #' \code{validation_type} argument. Default is 10. -#' @param p a numeric value indicating the percentage of calibration observations +#' @param p a numeric value indicating the percentage of calibration observations #' to be retained at each sampling iteration at each local segment when \code{"local_cv"} #' is selected in the \code{validation_type} argument. Default is 0.75 (i.e. 75 "\%"). #' @param range_prediction_limits a logical. It indicates whether the prediction @@ -40,14 +36,11 @@ #' this range, it will be automatically replaced with the value of the nearest #' range value. If \code{FALSE}, no prediction limits are imposed. #' Default is \code{TRUE}. -#' @param progress a logical indicating whether or not to print a progress bar -#' for each observation to be predicted. Default is \code{TRUE}. Note: In case -#' parallel processing is used, these progress bars will not be printed. #' @param allow_parallel a logical indicating if parallel execution is allowed. #' If \code{TRUE}, this parallelism is applied to the loop in \code{\link{mbl}} #' in which each iteration takes care of a single observation in \code{Xu}. The #' parallelization of this for loop is implemented using the -#' \link[foreach]{foreach} function of the \code{\link{foreach}} package. +#' code{\link[foreach]{foreach}} function of the package \code{foreach}. #' Default is \code{TRUE}. #' @details #' The validation methods available for assessing the predictive performance of @@ -64,20 +57,20 @@ #' \item{Local leave-group-out cross-validation (\code{"local_cv"}):}{ The #' group of neighbors of each observation to be predicted is partitioned into #' different equal size subsets. Each partition is selected based on a -#' stratified random sampling that uses the the distribution of -#' the response variable in the corresponding set of neighbors. When -#' \code{p} \mjeqn{>=}{\geqslant} 0.5 (i.e. the number of calibration -#' observations to retain is larger than 50% of the total samples in the neighborhood), -#' the sampling is conducted for selecting the validation samples, and when -#' \code{p} < 0.5 the sampling is conducted for selecting the calibration -#' samples (samples used for model fitting). The model fitted with the selected -#' calibration samples is used to predict the response values of the local -#' validation samples and the local root mean square error is computed. -#' This process is repeated \mjeqn{m}{m} times and the final local -#' error is computed as the average of the local root mean square errors -#' obtained for all the \mjeqn{m}{m} iterations. In the \code{mbl_control} function -#' \mjeqn{m}{m} is controlled by the \code{number} argument and the size of the -#' subsets is controlled by the \code{p} argument which indicates the +#' stratified random sampling that uses the the distribution of +#' the response variable in the corresponding set of neighbors. When +#' \code{p} \mjeqn{>=}{\geqslant} 0.5 (i.e. the number of calibration +#' observations to retain is larger than 50% of the total samples in the neighborhood), +#' the sampling is conducted for selecting the validation samples, and when +#' \code{p} < 0.5 the sampling is conducted for selecting the calibration +#' samples (samples used for model fitting). The model fitted with the selected +#' calibration samples is used to predict the response values of the local +#' validation samples and the local root mean square error is computed. +#' This process is repeated \mjeqn{m}{m} times and the final local +#' error is computed as the average of the local root mean square errors +#' obtained for all the \mjeqn{m}{m} iterations. In the \code{mbl_control} function +#' \mjeqn{m}{m} is controlled by the \code{number} argument and the size of the +#' subsets is controlled by the \code{p} argument which indicates the #' percentage of observations to be selected from the subset of nearest neighbors. #' The global error of the predictions is computed as the average of the local #' root mean square errors.} @@ -139,7 +132,6 @@ mbl_control <- function(return_dissimilarity = FALSE, number = 10, p = 0.75, range_prediction_limits = TRUE, - progress = TRUE, allow_parallel = TRUE) { # Sanity checks if (!is.logical(allow_parallel)) { @@ -174,9 +166,6 @@ mbl_control <- function(return_dissimilarity = FALSE, stop("'range_prediction_limits' must be logical") } - if (!is.logical(progress)) { - stop("'progress' must be logical") - } cntrl <- list( return_dissimilarity = return_dissimilarity, validation_type = validation_type, @@ -184,7 +173,6 @@ mbl_control <- function(return_dissimilarity = FALSE, number = number, p = p, range_prediction_limits = range_prediction_limits, - progress = progress, allow_parallel = allow_parallel ) diff --git a/R/ortho_diss.R b/R/ortho_diss.R index 1cadb47..92641b7 100644 --- a/R/ortho_diss.R +++ b/R/ortho_diss.R @@ -1,9 +1,6 @@ #' @title A function for computing dissimilarity matrices from orthogonal #' projections (ortho_diss) #' @description -#' -#' \lifecycle{maturing} -#' #' \loadmathjax #' This function computes dissimilarities (in an orthogonal space) between #' either observations in a given set or between observations in two different @@ -22,8 +19,8 @@ #' compute_all = FALSE, #' return_projection = FALSE, #' allow_parallel = TRUE, ...) -#' @param Xr a matrix containing \code{n} reference observations/rows and -#' \code{p} variables/columns. +#' @param Xr a matrix containing \code{n} reference observations rows and +#' \code{p} variablescolumns. #' @param Xu an optional matrix containing data of a second set of observations #' with \code{p} variables/columns. #' @param Yr a matrix of \code{n} rows and one or more columns (variables) with @@ -61,22 +58,25 @@ #' \item{\code{"cumvar"}:}{ selection of the principal components based #' on a given cumulative amount of explained variance. In this case, #' \code{value} must be a value (larger than 0 and below or equal to 1) -#' indicating the minimum amount of cumulative variance that the +#' indicating the minimum amount of cumulative variance that the #' combination of retained components should explain.} #' #' \item{\code{"var"}:}{ selection of the principal components based #' on a given amount of explained variance. In this case, #' \code{value} must be a value (larger than 0 and below or equal to 1) -#' indicating the minimum amount of variance that a single component +#' indicating the minimum amount of variance that a single component #' should explain in order to be retained.} #' #' \item{\code{"manual"}:}{ for manually specifying a fix number of #' principal components. In this case, \code{value} must be a value -#' (larger than 0 and below \code{min(nrow(Xr)} \code{+ nrow(Xu),} \code{ncol(Xr))}). +#' (larger than 0 and +#' below the minimum dimension of \code{Xr} or \code{Xr} and \code{Xu} +#' combined). #' indicating the minimum amount of variance that a component should #' explain in order to be retained.} #' } -#' The default list passed is \code{list(method = "var", value = 0.01)}. +#' Default is \code{list(method = "var", value = 0.01)}. +#' #' Optionally, the \code{pc_selection} argument admits \code{"opc"} or #' \code{"cumvar"} or \code{"var"} or \code{"manual"} as a single character #' string. In such case, the default \code{"value"} when either \code{"opc"} or @@ -147,10 +147,10 @@ #' not have enough observations with non-missing \code{"Yr"} values, which might retrieve #' unreliable dissimilarity computations. #' -#' If \code{.local = TRUE} and \code{pc_selection$method} is \code{"opc"} or -#' \code{"manual"}, the minimum number of observations with non-missing \code{"Yr"} -#' values at each neighborhood is determined by \code{pc_selection$value} -#' (i.e. the maximum number of components to compute). +#' If \code{"opc"} or \code{"manual"} are used in \code{pc_selection$method} +#' and \code{.local = TRUE}, the minimum number of observations with non-missing +#' \code{"Yr"} values at each neighborhood is determined by +#' \code{pc_selection$value} (i.e. the maximum number of components to compute). #' #' #' @return a \code{list} of class \code{ortho_diss} with the following elements: @@ -184,7 +184,6 @@ #' with soil vis-NIR spectra. Geoderma 199, 43-53. #' @seealso \code{\link{ortho_projection}}, \code{\link{sim_eval}} #' @examples -#' \dontrun{ #' library(prospectr) #' data(NIRsoil) #' @@ -220,7 +219,6 @@ #' pc_selection = list("opc", 40), #' diss_method = "pls" #' ) -#' } #' @export ####################################################################### diff --git a/R/ortho_diss_helpers.R b/R/ortho_diss_helpers.R index edebf2b..1fb1e45 100644 --- a/R/ortho_diss_helpers.R +++ b/R/ortho_diss_helpers.R @@ -180,7 +180,7 @@ format_xr_xu_indices <- function(xr_xu_names) { xu_mss <- paste0(c( - if_else(length(xu_insufficient) > 0, + ifelse(length(xu_insufficient) > 0, "\nXu: ", "" ), paste(xu_insufficient, collapse = ", ") @@ -188,7 +188,7 @@ format_xr_xu_indices <- function(xr_xu_names) { collapse = " " ) xr_mss <- paste0(c( - if_else(length(xr_insufficient) > 0, + ifelse(length(xr_insufficient) > 0, "\nXr: ", "" ), paste(xr_insufficient, collapse = ", ") diff --git a/R/ortho_projection.R b/R/ortho_projection.R index 94d66f8..998935c 100644 --- a/R/ortho_projection.R +++ b/R/ortho_projection.R @@ -40,10 +40,14 @@ #' and their corresponding most similar observations in terms of the side information #' provided. A single discrete variable of class factor can also be passed. In #' that case, the kappa index is used. See \code{\link{sim_eval}} function for more details. -#' @param method the method for projecting the data. Options are: "pca" (principal -#' component analysis using the singular value decomposition algorithm), -#' "pca.nipals" (principal component analysis using the non-linear iterative -#' partial least squares algorithm) and "pls" (partial least squares). +#' @param method the method for projecting the data. Options are: +#' \itemize{ +#' \item{\code{"pca"}:}{ principal component analysis using the singular value +#' decomposition algorithm.} +#' \item{\code{"pca.nipals"}:}{ principal component analysis using the +#' non-linear iterative partial least squares algorithm.} +#' \item{\code{"pls"}:}{ partial least squares.} +#' } #' @param pc_selection a list of length 2 which specifies the method to be used #' for optimizing the number of components (principal components or pls factors) #' to be retained. This list must contain two elements (in the following order): @@ -57,13 +61,13 @@ #' minimizes the differences between the \code{Yr} value of each #' observation and the \code{Yr} value of its closest observation. In this #' case \code{value} must be a value (larger than 0 and -#' below \code{min(nrow(Xr)} \code{+ nrow(Xu),} \code{ncol(Xr))} indicating +#' below \code{min(nrow(Xr)} \code{+ nrow(Xu),} \code{ncol(Xr))} indicating #' the maximum number of principal components to be tested. See details.} #' #' \item{\code{"cumvar"}:}{ selection of the principal components based #' on a given cumulative amount of explained variance. In this case, #' \code{value} must be a value (larger than 0 and below or equal to 1) -#' indicating the minimum amount of cumulative variance that the +#' indicating the minimum amount of cumulative variance that the #' combination of retained components should explain.} #' #' \item{\code{"var"}:}{ selection of the principal components based @@ -74,11 +78,13 @@ #' #' \item{\code{"manual"}:}{ for manually specifying a fix number of #' principal components. In this case, \code{value} must be a value -#' (larger than 0 and below \code{min(nrow(Xr)} \code{+ nrow(Xu),} \code{ncol(Xr))}). +#' (larger than 0 and +#' below the minimum dimension of \code{Xr} or \code{Xr} and \code{Xu} +#' combined). #' indicating the minimum amount of variance that a component should #' explain in order to be retained.} #' } -#' The default list passed is \code{list(method = "var", value = 0.01)}. +#' The list \code{list(method = "var", value = 0.01)} is the default. #' Optionally, the \code{pc_selection} argument admits \code{"opc"} or #' \code{"cumvar"} or \code{"var"} or \code{"manual"} as a single character #' string. In such a case the default \code{"value"} when either \code{"opc"} or @@ -99,19 +105,18 @@ #' @param max_iter maximum number of iterations (default is 1000). In the case of #' \code{method = "pls"} this applies only to \code{Yr} matrices with more than #' one variable. -#' @param ... additional arguments to be passed from \code{ortho_projection} +#' @param ... additional arguments to be passed #' to \code{pc_projection} or \code{pls_projection}. -#' @param object object of class "ortho_projection" (as returned by -#' \code{ortho_projection}, \code{pc_projection} or \code{pls_projection}). +#' @param object object of class \code{"ortho_projection"}. #' @param newdata an optional data frame or matrix in which to look for variables #' with which to predict. If omitted, the scores are used. It must contain the #' same number of columns, to be used in the same order. #' @details #' In the case of \code{method = "pca"}, the algrithm used is the singular value #' decomposition in which a given data matrix (\mjeqn{X}{X}) is factorized as follows: -#' +#' #' \mjdeqn{X = UDV^{T}}{X = UDV^{\mathrm{T}}} -#' +#' #' where \mjeqn{U}{U} and \mjeqn{V}{V} are orthogonal matrices, being the left and right #' singular vectors of \mjeqn{X}{X} respectively, \mjeqn{D}{D} is a diagonal matrix #' containing the singular values of \mjeqn{X}{X} and \mjeqn{V}{V} is the is a matrix of @@ -142,10 +147,8 @@ #' Note that for the \code{"opc"} method \code{Yr} is required (i.e. the #' side information of the observations). #' -#' This function supports multi-threading for the computation of dissimilarities -#' via OpenMP in Rcpp. -#' @return \code{ortho_projection}, \code{pc_projection}, \code{pls_projection}, -#' return a \code{list} of class \code{ortho_projection} with the following +#' @return +#' a \code{list} of class \code{ortho_projection} with the following #' components: #' \itemize{ #' \item{\code{scores}}{ a matrix of scores corresponding to the observations in @@ -164,8 +167,8 @@ #' onto a "pls" space. This object is only returned if the "pls" algorithm was #' used.} #' \item{\code{variance}}{ a matrix indicating the standard deviation of each -#' component (sd), the variance explained by each single component -#' (explained_var) and the cumulative explained variance +#' component (sd), the variance explained by each single component +#' (explained_var) and the cumulative explained variance #' (cumulative_explained_var). These values are #' computed based on the data used to create the projection matrices. #' For example if the "pls" method was used, then these values are computed @@ -205,27 +208,28 @@ #' with soil vis-NIR spectra. Geoderma 199, 43-53. #' @seealso \code{\link{ortho_diss}}, \code{\link{sim_eval}}, \code{\link{mbl}} #' @examples -#' \dontrun{ +#' \donttest{ #' library(prospectr) #' data(NIRsoil) -#' +#' #' # Proprocess the data using detrend plus first derivative with Savitzky and #' # Golay smoothing filter #' sg_det <- savitzkyGolay( #' detrend(NIRsoil$spc, -#' wav = as.numeric(colnames(NIRsoil$spc))), +#' wav = as.numeric(colnames(NIRsoil$spc)) +#' ), #' m = 1, #' p = 1, #' w = 7 #' ) #' NIRsoil$spc_pr <- sg_det -#' +#' #' # split into training and testing sets -#' test_x <- NIRsoil$spc_pr[NIRsoil$train == 0 & !is.na(NIRsoil$CEC),] +#' test_x <- NIRsoil$spc_pr[NIRsoil$train == 0 & !is.na(NIRsoil$CEC), ] #' test_y <- NIRsoil$CEC[NIRsoil$train == 0 & !is.na(NIRsoil$CEC)] -#' +#' #' train_y <- NIRsoil$CEC[NIRsoil$train == 1 & !is.na(NIRsoil$CEC)] -#' train_x <- NIRsoil$spc_pr[NIRsoil$train == 1 & !is.na(NIRsoil$CEC),] +#' train_x <- NIRsoil$spc_pr[NIRsoil$train == 1 & !is.na(NIRsoil$CEC), ] #' #' # A principal component analysis using 5 components #' pca_projected <- ortho_projection(train_x, pc_selection = list("manual", 5)) @@ -234,7 +238,7 @@ #' # A principal components projection using the "opc" method #' # for the selection of the optimal number of components #' pca_projected_2 <- ortho_projection( -#' Xr = train_x, Xu = test_x, Yr, +#' Xr = train_x, Xu = test_x, Yr = train_y, #' method = "pca", #' pc_selection = list("opc", 40) #' ) @@ -244,7 +248,7 @@ #' # A partial least squares projection using the "opc" method #' # for the selection of the optimal number of components #' pls_projected <- ortho_projection( -#' Xr = train_x, Xu = test_x, Yr, +#' Xr = train_x, Xu = test_x, Yr = train_y, #' method = "pls", #' pc_selection = list("opc", 40) #' ) @@ -254,7 +258,7 @@ #' # A partial least squares projection using the "cumvar" method #' # for the selection of the optimal number of components #' pls_projected_2 <- ortho_projection( -#' Xr = train_x, Yr = train_y, Xu = test_x, +#' Xr = train_x, Xu = test_x, Yr = train_y, #' method = "pls", #' pc_selection = list("cumvar", 0.99) #' ) @@ -343,7 +347,7 @@ ortho_projection <- function(Xr, Xu = NULL, ) mthd <- "pls" } else { - mthd <- if_else(method == "pca", "pca (svd)", "pca (nipals)") + mthd <- ifelse(method == "pca", "pca (svd)", "pca (nipals)") proj <- pc_projection( Xr = Xr, Yr = Yr, Xu = Xu, pc_selection = pc_selection, center = center, scale = scale, method = method, ... @@ -393,7 +397,7 @@ pc_projection <- function(Xr, Xu = NULL, Yr = NULL, )) } } - + ny <- ncol(Yr) if (!is.null(Xu)) { @@ -552,7 +556,7 @@ pc_projection <- function(Xr, Xu = NULL, Yr = NULL, scale = sd_vector ) colnames(fresults$variance) <- rownames(fresults$X_loadings) - fresults$method <- if_else(method == "pca", "pca (svd)", "pca (nipals)") + fresults$method <- ifelse(method == "pca", "pca (svd)", "pca (nipals)") if (pc_selection_method == "opc") { fresults$opc_evaluation <- results } diff --git a/R/ortho_projection_helpers.R b/R/ortho_projection_helpers.R index 9ec57a5..0d263e6 100644 --- a/R/ortho_projection_helpers.R +++ b/R/ortho_projection_helpers.R @@ -67,7 +67,7 @@ eval_multi_pc_diss <- function(scores, ith_result <- extract_sim_results(tmp) results[i, 1:n_cols_results] <- unlist(ith_result$result) } - + colnames(results) <- ith_result$measure_names eval_pcs <- matrix(eval_pcs, dimnames = list(eval_pcs, method)) results <- cbind(pc = eval_pcs, results) @@ -101,7 +101,9 @@ check_pc_arguments <- function(n_rows_x, n_cols_x, pc_selection, if (pc_selection_method %in% c("opc", "manual")) { if (length(pc_selection) == 1) { treshold_comp <- min(n_rows_x, n_cols_x) - treshold_comp <- if_else(treshold_comp > default_max_comp, + + + treshold_comp <- ifelse(treshold_comp > default_max_comp, default_max_comp, treshold_comp ) diff --git a/R/plot.mbl.R b/R/plot.mbl.R index 5947074..3b6f44a 100644 --- a/R/plot.mbl.R +++ b/R/plot.mbl.R @@ -1,20 +1,17 @@ #' @title Plot method for an object of class \code{mbl} #' @description -#' -#' \lifecycle{maturing} -#' #' Plots the content of an object of class \code{mbl} #' @aliases plot.mbl #' @usage \method{plot}{mbl}(x, g = c("validation", "gh"), param = "rmse", pls_c = c(1,2), ...) #' @param x an object of class \code{mbl} (as returned by \code{mbl}). #' @param g a character vector indicating what results shall be plotted. #' Options are: \code{"validation"} (for plotting the validation results) and/or -#' \code{"gh"} (for plotting the pls scores used to compute the GH distance. +#' \code{"gh"} (for plotting the pls scores used to compute the GH distance. #' See details). -#' @param param a character string indicating what validation statistics shall be -#' plotted. The following options are available: \code{"rmse"}, \code{"st_rmse"} -#' or \code{"r2"}. These options only available if the \code{mbl} object contains -#' validation results. +#' @param param a character string indicating what validation statistics shall be +#' plotted. The following options are available: \code{"rmse"}, \code{"st_rmse"} +#' or \code{"r2"}. These options only available if the \code{mbl} object contains +#' validation results. #' @param pls_c a numeric vector of length one or two indicating the pls factors to be #' plotted. Default is \code{c(1, 2)}. It is only available if \code{"gh"} is #' specified in the \code{g} argument. @@ -28,7 +25,7 @@ #' @author Leonardo Ramirez-Lopez and Antoine Stevens #' @seealso \code{\link{mbl}} #' @examples -#' \dontrun{ +#' \donttest{ #' library(prospectr) #' #' data(NIRsoil) @@ -77,10 +74,12 @@ plot.mbl <- function(x, g = c("validation", "gh"), param = "rmse", pls_c = c(1, 2), ...) { - original_set <- par()$mfrow + + opar <- par("mfrow", "mar") + on.exit(par(opar)) + if (length(g) != 1 & !is.null(x$gh)) { op <- par(mfrow = c(1, 2)) - on.exit(par(op)) } plot_dots <- list(...) @@ -159,9 +158,9 @@ plot.mbl <- function(x, dt <- !is.element(dtn, opt[!is.element(opt, param)]) idv <- ifelse("k" %in% colnames(tpl), "k", "k_diss") dt <- as.logical(dt * (!dtn %in% "p_bounded")) - to_plot <- tpl %>% - select(names(tpl)[dt]) %>% - reshape( + + to_plot <- data.frame(tpl)[, dt] %>% + stats::reshape( timevar = "val", idvar = idv, direction = "wide" @@ -186,7 +185,7 @@ plot.mbl <- function(x, )) grid( nx = NULL, ny = NULL, col = rgb(0.3, 0.3, 0.3, 0.1), lty = 1, - lwd = par("lwd"), equilogs = TRUE + lwd = 1, equilogs = TRUE ) mtext("Validation results", col = grey(0.3)) # Adding a legend @@ -242,7 +241,7 @@ plot.mbl <- function(x, points(tp[tp$set == "Xu", 1:2], xlim = rng, ylim = rng, col = xu_col, pch = plot_dots$pch) grid( nx = NULL, ny = NULL, col = rgb(0.3, 0.3, 0.3, 0.1), lty = 1, - lwd = par("lwd"), equilogs = TRUE + lwd = 1, equilogs = TRUE ) legend("topleft", legend = c("Xr", "Xu"), @@ -273,7 +272,7 @@ plot.mbl <- function(x, points(xu_scores[, pls_c, drop = FALSE], col = xu_col, pch = plot_dots$pch) grid( nx = NULL, ny = NULL, col = rgb(0.3, 0.3, 0.3, 0.1), lty = 1, - lwd = par("lwd"), equilogs = TRUE + lwd =1, equilogs = TRUE ) legend("topright", legend = c("Xr", "Xu"), @@ -304,7 +303,6 @@ plot.mbl <- function(x, mtext(main, outer = TRUE, cex = 2, line = -2) # par(ask = original_set) # op <- par(ask = original_set) - on.exit(par(mfrow = original_set)) # par(mfrow = pm) # title(main = "Memory-based learning results") # dev.flush() diff --git a/R/plot.ortho_projection.R b/R/plot.ortho_projection.R index b080729..a1f43e8 100644 --- a/R/plot.ortho_projection.R +++ b/R/plot.ortho_projection.R @@ -1,8 +1,5 @@ #' @title Plot method for an object of class \code{ortho_projection} #' @description -#' -#' \lifecycle{maturing} -#' #' Plots objects of class \code{ortho_projection} #' @aliases plot.ortho_projection #' @usage \method{plot}{ortho_projection}(x, col = "dodgerblue", ...) @@ -20,7 +17,7 @@ plot.ortho_projection <- function(x, col = "dodgerblue", ...) { } else { x_variance <- x$variance } - + if (x$pc_selection$method == "opc") { tpl <- x$opc_evaluation[, c(1, ncol(x$opc_evaluation))] if ("mean_standardized_rmsd_Yr" %in% colnames(tpl)) { @@ -33,24 +30,26 @@ plot.ortho_projection <- function(x, col = "dodgerblue", ...) { ylab <- "kappa index" } plot(tpl, - type = "p", - ylab = ylab, pch = 1, col = col, ... + type = "p", + ylab = ylab, pch = 1, col = col, ... ) grid() - segments(tpl[,1], 0, tpl[,1], tpl[,2], col = col) - } else{ + segments(tpl[, 1], 0, tpl[, 1], tpl[, 2], col = col) + } else { + opar <- par("mfrow") + on.exit(par(opar)) + o_mfrow <- par()$mfrow par(mfrow = c(1, 2)) barplot(x_variance[grep("^explained_var", rownames(x_variance)), ], - horiz = F, - names.arg = colnames(x_variance), ylim = c(0, 1), - ylab = "Explained variance", col = col, ... + horiz = F, + names.arg = colnames(x_variance), ylim = c(0, 1), + ylab = "Explained variance", col = col, ... ) barplot(x_variance[grep("cumulative", rownames(x_variance)), ], - horiz = F, - names.arg = colnames(x_variance), ylim = c(0, 1), - ylab = "Explained variance (cumulative)", col = col, ... + horiz = F, + names.arg = colnames(x_variance), ylim = c(0, 1), + ylab = "Explained variance (cumulative)", col = col, ... ) - par(mfrow = o_mfrow) } } diff --git a/R/print.ortho_projection.R b/R/print.ortho_projection.R index 5a67cc4..a0f6387 100644 --- a/R/print.ortho_projection.R +++ b/R/print.ortho_projection.R @@ -19,7 +19,7 @@ print.ortho_projection <- function(x, ...) { } else { mss <- c("\n", "Explained variance in Xr: \n") } - + cat("\n", "Standard deviations, individual explained variance, cumulative explained variance:", "\n") if (x$method == "pls") { cat(mss) diff --git a/R/resemble.R b/R/resemble.R index 5ecdd74..44eaf91 100644 --- a/R/resemble.R +++ b/R/resemble.R @@ -6,18 +6,18 @@ #' @import data.table #' @import grDevices #' @import graphics +#' @import mathjaxr ## usethis namespace: start #' @importFrom lifecycle deprecate_soft ## usethis namespace: end #' @importFrom magrittr %>% -#' @importFrom dplyr if_else select #' @importFrom utils setTxtProgressBar txtProgressBar #' @importFrom stats model.frame model.matrix model.extract na.fail sd reshape #' @description -#' -#' \lifecycle{maturing} +#' \ifelse{html}{\out{Maturing lifecycle}}{\strong{Maturing}} #' #' Functions for memory-based learning +#' #' \if{html}{\figure{logo.png}{options: align='right' alt='logo' width='120'}} #' #' @details @@ -27,9 +27,10 @@ #' modeling complex spectral spectra (e.g. NIR, IR). #' The package includes functions for dimensionality reduction, #' computing spectral dissimilarity matrices, nearest neighbor search, -#' and modeling spectral data using memory-based learning. -#' -#' Development versions can be found in the github repository of the package +#' and modeling spectral data using memory-based learning. This package builds +#' upon the methods presented in Ramirez-Lopez et al. (2013) >. +#' +#' Development versions can be found in the github repository of the package #' at \href{https://github.com/l-ramirez-lopez/resemble}{https://github.com/l-ramirez-lopez/resemble}. #' #' The functions available for dimensionality reduction are: @@ -69,19 +70,35 @@ #' @name resemble-package #' @aliases resemble-package resemble #' @title Overview of the functions in the resemble package -#' @author -#' -#' \href{https://orcid.org/0000-0002-5369-5120}{Leonardo Ramirez-Lopez} [aut, cre] -#' -#' \href{https://orcid.org/0000-0002-1588-7519}{Antoine Stevens} [ctb] -#' -#' \href{https://orcid.org/0000-0003-1540-4748}{Raphael Viscarra Rossel} [ctb] -#' -#' \href{https://orcid.org/0000-0001-5416-8640}{Craig Lobsey} [ctb] -#' -#' \href{https://orcid.org/0000-0001-7325-9716}{Alex Wadoux} [ctb] -#' -#' \href{https://orcid.org/0000-0001-5695-8064}{Timo Breure} [ctb] +#' @references +#' Ramirez-Lopez, L., Behrens, T., Schmidt, K., Stevens, A., Dematte, J.A.M., +#' Scholten, T. 2013a. The spectrum-based learner: A new local approach for +#' modeling soil vis-NIR spectra of complex data sets. Geoderma 195-196, +#' 268-279. +#' @seealso +#' Useful links: +#' \itemize{ +#' \item \url{https://github.com/l-ramirez-lopez/resemble} +#' \item Report bugs at \url{https://github.com/l-ramirez-lopez/resemble/issues} +#' } +#' @author +#' +#' \strong{Maintainer / Creator}: Leonardo Ramirez-Lopez \email{ramirez.lopez.leo@gmail.com} +#' +#' Authors: +#' \itemize{ +#' \item Leonardo Ramirez-Lopez (\href{https://orcid.org/0000-0002-5369-5120}{ORCID}) +#' +#' \item Antoine Stevens (\href{https://orcid.org/0000-0002-1588-7519}{ORCID}) +#' +#' \item Raphael Viscarra Rossel (\href{https://orcid.org/0000-0003-1540-4748}{ORCID}) +#' +#' \item Craig Lobsey (\href{https://orcid.org/0000-0001-5416-8640}{ORCID}) +#' +#' \item Alex Wadoux (\href{https://orcid.org/0000-0001-7325-9716}{ORCID}) +#' +#' \item Timo Breure (\href{https://orcid.org/0000-0001-5695-8064}{ORCID}) +#' } ###################################################################### # resemble # Copyright (C) 2020 Leonardo Ramirez-Lopez diff --git a/R/sample_stratified.R b/R/sample_stratified.R index 51327ba..7e72582 100644 --- a/R/sample_stratified.R +++ b/R/sample_stratified.R @@ -21,7 +21,7 @@ #' @keywords internal sample_stratified <- function(y, p, number, group = NULL, replacement = FALSE) { - + ## If the percentage of samples to build the hold_in subset is below 50% of ## the total number of samples, the selection is based on the number of samples ## to retain. @@ -35,11 +35,11 @@ sample_stratified <- function(y, p, number, group = NULL, replacement = FALSE) { p_to_sample <- 1 - p do_sampling_for <- "validation" } - + if (is.null(group)) { nv <- floor(p_to_sample * nrow(y)) nv <- ifelse(nv < 1, 1, nv) - + if (p >= 0.5) { n_val <- nv if (replacement) { @@ -55,22 +55,22 @@ sample_stratified <- function(y, p, number, group = NULL, replacement = FALSE) { n_cal <- nrow(y) - n_val } } - + strata_category <- optim_sample_strata( y = y, n = nv ) - + calibration_indices <- matrix(NA, n_cal, number) validation_indices <- matrix(NA, n_val, number) - + colnames(calibration_indices) <- colnames(validation_indices) <- paste0( "Resample_", seq(1:number) ) rownames(calibration_indices) <- paste0("index_", seq(1:n_cal)) rownames(validation_indices) <- paste0("index_", seq(1:nrow(validation_indices))) - + indcs <- 1:nrow(y) for (jj in 1:number) { strata_samples <- get_samples_from_strata( @@ -80,7 +80,7 @@ sample_stratified <- function(y, p, number, group = NULL, replacement = FALSE) { replacement = replacement, sampling_for = do_sampling_for ) - + calibration_indices[, jj] <- strata_samples$calibration validation_indices[, jj] <- strata_samples$validation } @@ -99,10 +99,10 @@ sample_stratified <- function(y, p, number, group = NULL, replacement = FALSE) { INDEX = y_groups$group, FUN = mean ) - + nv <- floor(p_to_sample * n_levels) nv <- ifelse(nv < 1, 1, nv) - + if (p >= 0.5) { n_val <- nv if (replacement) { @@ -118,21 +118,20 @@ sample_stratified <- function(y, p, number, group = NULL, replacement = FALSE) { n_cal <- nrow(y) - n_val } } - + strata_category <- optim_sample_strata( y = aggregated_y, n = nv ) strata_category$strata <- paste0("l_", strata_category$strata) - + calibration_indices <- matrix(0, nv, number) colnames(calibration_indices) <- paste0("Resample_", seq(1:number)) rownames(calibration_indices) <- paste0("index_", seq(1:nv)) - + calibration_groups_indices <- validation_groups_indices <- NULL - + for (jj in 1:number) { - strata_samples <- get_samples_from_strata( original_order = strata_category$sample_strata$original_order, strata = strata_category$sample_strata$strata, @@ -140,22 +139,22 @@ sample_stratified <- function(y, p, number, group = NULL, replacement = FALSE) { sampling_for = do_sampling_for, replacement = replacement ) - + if (replacement) { sel_sample_indices <- lapply(1:length(strata_samples$calibration), - FUN = function(ith, full_groups, cal_groups) { - ## this equivalent to extract from - ## y_groups$original_order - which(full_groups %in% cal_groups[ith]) - }, - full_groups = y_groups$group, - cal_groups = gr_levels[strata_samples$calibration] + FUN = function(ith, full_groups, cal_groups) { + ## this equivalent to extract from + ## y_groups$original_order + which(full_groups %in% cal_groups[ith]) + }, + full_groups = y_groups$group, + cal_groups = gr_levels[strata_samples$calibration] ) sel_sample_indices <- do.call("c", sel_sample_indices) } else { sel_sample_indices <- y_groups$original_order[y_groups$group %in% gr_levels[strata_samples$calibration]] } - + calibration_groups_indices[[jj]] <- sel_sample_indices validation_groups_indices[[jj]] <- y_groups$original_order[y_groups$group %in% gr_levels[strata_samples$validation]] } @@ -175,16 +174,16 @@ sample_stratified <- function(y, p, number, group = NULL, replacement = FALSE) { ## replacements are necessary! n_cal_missing <- max(lengths_list_calibration_indices) - lengths_list_calibration_indices[jj] jj_add_list_calibration_indices <- sample(calibration_groups_indices[[jj]], - size = n_cal_missing, - replace = TRUE + size = n_cal_missing, + replace = TRUE ) - + n_val_missing <- max(lengths_list_validation_indices) - lengths_list_validation_indices[jj] jj_add_list_validation_indices <- sample(validation_groups_indices[[jj]], - size = n_val_missing, - replace = TRUE + size = n_val_missing, + replace = TRUE ) - + calibration_indices[1:lengths_list_calibration_indices[jj], jj] <- calibration_groups_indices[[jj]] validation_indices[1:lengths_list_validation_indices[jj], jj] <- validation_groups_indices[[jj]] calibration_indices[-(1:lengths_list_calibration_indices[jj]), jj] <- jj_add_list_calibration_indices @@ -208,18 +207,18 @@ sample_stratified <- function(y, p, number, group = NULL, replacement = FALSE) { #' @keywords internal get_sample_strata <- function(y, n) { y_strata <- unique(quantile(y, - probs = seq(0, 1, length = (n + 1)), - names = FALSE + probs = seq(0, 1, length = (n + 1)), + names = FALSE )) - - + + strata_labels <- 1:(length(y_strata) - 1) y_cuts <- cut(y, - breaks = y_strata, - labels = strata_labels, - include.lowest = TRUE + breaks = y_strata, + labels = strata_labels, + include.lowest = TRUE ) - + strata_category <- data.table( original_order = 1:length(y), strata = y_cuts @@ -243,29 +242,29 @@ optim_sample_strata <- function(y, n) { v_missing <- rep(0, length(s_missing)) names(v_missing) <- s_missing table_strata <- c(table_strata, v_missing) - + new_n <- ceiling(n / (iter + 1)) # ???????????????? - + sample_strata <- get_sample_strata(y, new_n) table_strata <- table(sample_strata$strata) - + condition_1 <- new_min_samples_per_strata < min(table_strata) condition_2 <- length(table(sample_strata$strata)) == new_n condition_3 <- new_min_samples_per_strata >= n - - + + if ((condition_1 & condition_2) | condition_3) { break } iter <- iter + 1 } - + samples_to_get_no_replacement <- rep( new_min_samples_per_strata / 2, nlevels(sample_strata$strata) ) - - + + samples_to_get <- data.table( strata = levels(sample_strata$strata), samples_to_get = samples_to_get_no_replacement @@ -283,7 +282,7 @@ optim_sample_strata <- function(y, n) { samples_to_get = 1 ) } - + list( sample_strata = sample_strata, samples_to_get = samples_to_get @@ -312,7 +311,7 @@ get_samples_from_strata <- function(original_order, samples_per_strata$samples_to_get <- 2 * samples_per_strata$samples_to_get with_replacement <- TRUE } - + get_random_sample <- function(x, ns) { if (length(x) == 1) { # this is required to keep the name of the @@ -321,39 +320,39 @@ get_samples_from_strata <- function(original_order, } sample(x, size = ns) } - + ## for selecting the replacement samples in cases where a strata has only one ## sample, the replacement sample is randomly selected from the data max_samples <- max(samples_per_strata$samples_to_get) vec_samples <- rep(NA, max_samples) - + strata_samples <- lapply(levels(strata), - FUN = function(strata, - original_order, - samples_per_strata, - vec_samples, - replacement, - ii) { - ith_n <- samples_per_strata$samples_to_get[samples_per_strata$strata == ii] - ith_set <- original_order[which(strata == ii)] - ith_sel <- get_random_sample(ith_set, ith_n) - if (replacement) { - ln <- (length(ith_sel) / 2) - vec_samples[1:ln] <- ith_sel[1:ln] - vec_samples[(length(vec_samples) - ln + 1):length(vec_samples)] <- ith_sel[-(1:ln)] - } else { - vec_samples[1:length(ith_sel)] <- ith_sel - } - - vec_samples - }, - strata = strata, - original_order = original_order, - samples_per_strata = samples_per_strata, - vec_samples = vec_samples, - replacement = with_replacement + FUN = function(strata, + original_order, + samples_per_strata, + vec_samples, + replacement, + ii) { + ith_n <- samples_per_strata$samples_to_get[samples_per_strata$strata == ii] + ith_set <- original_order[which(strata == ii)] + ith_sel <- get_random_sample(ith_set, ith_n) + if (replacement) { + ln <- (length(ith_sel) / 2) + vec_samples[1:ln] <- ith_sel[1:ln] + vec_samples[(length(vec_samples) - ln + 1):length(vec_samples)] <- ith_sel[-(1:ln)] + } else { + vec_samples[1:length(ith_sel)] <- ith_sel + } + + vec_samples + }, + strata = strata, + original_order = original_order, + samples_per_strata = samples_per_strata, + vec_samples = vec_samples, + replacement = with_replacement ) - + strata_samples <- do.call("rbind", strata_samples) if (replacement & sampling_for == "validation") { col_s <- 1:(ncol(strata_samples) / 2) @@ -365,8 +364,8 @@ get_samples_from_strata <- function(original_order, } else { strata_samples <- as.matrix(sort(strata_samples)) } - - + + if (sampling_for == "validation") { if (replacement) { unique_sample_strata <- levels(strata)[strata_samples[, 1] == strata_samples[, 2]] @@ -374,16 +373,16 @@ get_samples_from_strata <- function(original_order, solve_replacement <- sample(original_order[-strata_samples[, 2]], length(unique_sample_strata)) strata_samples[unique_sample_strata, 2] <- solve_replacement } - + replacement_indices <- strata_samples[, 2] } else { replacement_indices <- NULL } - + keep <- original_order[!original_order %in% strata_samples[, 1]] exclude <- as.vector(sort(strata_samples[, 1])) } - + if (sampling_for == "calibration") { keep <- strata_samples[, 1] exclude <- original_order[!original_order %in% keep] @@ -393,9 +392,9 @@ get_samples_from_strata <- function(original_order, replacement_indices <- NULL } } - + keep <- sort(as.vector(c(keep, replacement_indices))) - + strata_samples <- list( calibration = keep, validation = exclude diff --git a/R/search_neighbors.R b/R/search_neighbors.R index 8bd603e..1cb37df 100644 --- a/R/search_neighbors.R +++ b/R/search_neighbors.R @@ -1,9 +1,6 @@ #' @title A function for searching in a given reference set the neighbors of #' another given set of observations (search_neighbors) #' @description -#' -#' \lifecycle{maturing} -#' #' \loadmathjax #' This function searches in a reference set the neighbors of the observations #' provided in another set. @@ -59,7 +56,7 @@ #' \itemize{ #' \item{\code{diss_method = "pls"}} #' \item{\code{diss_method = "pca"} with \code{"opc"} used as the method -#' in the \code{pc_selection} argument. See \code{\link{ortho_diss}.}} +#' in the \code{pc_selection} argument. See [ortho_diss()].} #' } #' @param k an integer value indicating the k-nearest neighbors of each #' observation in \code{Xu} that must be selected from \code{Xr}. @@ -89,30 +86,32 @@ #' (of set of observations) is the one for which its distance matrix #' minimizes the differences between the \code{Yr} value of each #' observation and the \code{Yr} value of its closest observation. In this -#' case \code{value} must be a value (larger than 0 and -#' below \code{min(nrow(Xr), nrow(X2), ncol(Xr))}) indicating the maximum -#' number of principal components to be tested. See the -#' \code{\link{ortho_projection}} function for more details.} +#' case \code{value} must be a value (larger than 0 and below the +#' minimum dimension of \code{Xr} or \code{Xr} and \code{Xu} combined) +#' indicating the maximum number of principal components to be tested. +#' See the \code{\link{ortho_projection}} function for more details.} #' #' \item{\code{"cumvar"}:}{ selection of the principal components based #' on a given cumulative amount of explained variance. In this case, #' \code{value} must be a value (larger than 0 and below or equal to 1) -#' indicating the minimum amount of cumulative variance that the +#' indicating the minimum amount of cumulative variance that the #' combination of retained components should explain.} #' #' \item{\code{"var"}:}{ selection of the principal components based #' on a given amount of explained variance. In this case, #' \code{value} must be a value (larger than 0 and below or equal to 1) -#' indicating the minimum amount of variance that a single component +#' indicating the minimum amount of variance that a single component #' should explain in order to be retained.} #' #' \item{\code{"manual"}:}{ for manually specifying a fix number of #' principal components. In this case, \code{value} must be a value -#' (larger than 0 and \code{min(nrow(Xr), nrow(X2), ncol(Xr))}). +#' (larger than 0 and below the +#' minimum dimension of \code{Xr} or \code{Xr} and \code{Xu} combined) #' indicating the minimum amount of variance that a component should #' explain in order to be retained.} #' } -#' The default list passed is \code{list(method = "var", value = 0.01)}. +#' The default is \code{list(method = "var", value = 0.01)}. +#' #' Optionally, the \code{pc_selection} argument admits \code{"opc"} or #' \code{"cumvar"} or \code{"var"} or \code{"manual"} as a single character #' string. In such a case the default \code{"value"} when either \code{"opc"} or @@ -184,13 +183,14 @@ #' \item{\code{dissimilarity}}{ If \code{return_dissimilarity = TRUE} the #' dissimilarity object used (as computed by the \code{\link{dissimilarity}} #' function.} -#' \item{\code{projection}}{ an \code{ortho_projection} object. Only output if +#' \item{\code{projection}}{ an \code{ortho_projection} object. Only output if #' \code{return_projection = TRUE} and if \code{diss_method = "pca"}, #' \code{diss_method = "pca.nipals"} or \code{diss_method = "pls"}. +#' #' This object contains the projection used to compute #' the dissimilarity matrix. In case of local dissimilarity matrices, #' the projection corresponds to the global projection used to select the -#' neighborhoods (see \code{\link{ortho_diss}} function for further +#' neighborhoods. (see \code{\link{ortho_diss}} function for further #' details).} #' } #' @author \href{https://orcid.org/0000-0002-5369-5120}{Leonardo Ramirez-Lopez} @@ -206,7 +206,7 @@ #' \code{\link{cor_diss}} \code{\link{f_diss}} \code{\link{sid}} #' \code{\link{mbl}} #' @examples -#' \dontrun{ +#' \donttest{ #' library(prospectr) #' #' data(NIRsoil) @@ -293,13 +293,15 @@ ## - scaled renamed to scale ## - pcMethod and cores are deprecated -search_neighbors <- function(Xr, Xu, diss_method = c("pca", - "pca.nipals", - "pls", - "cor", - "euclid", - "cosine", - "sid"), +search_neighbors <- function(Xr, Xu, diss_method = c( + "pca", + "pca.nipals", + "pls", + "cor", + "euclid", + "cosine", + "sid" + ), Yr = NULL, k, k_diss, k_range, spike = NULL, @@ -309,8 +311,8 @@ search_neighbors <- function(Xr, Xu, diss_method = c("pca", ws = NULL, center = TRUE, scale = FALSE, documentation = character(), ...) { - - + + # Sanity checks match.arg(diss_method, c( "pca", @@ -321,27 +323,27 @@ search_neighbors <- function(Xr, Xu, diss_method = c("pca", "cosine", "sid" )) - + if (missing(k)) { k <- NULL } - + if (missing(k_diss)) { k_diss <- NULL } - + if (missing(k_range)) { k_range <- NULL } - + if (!is.logical(center)) { stop("'center' argument must be logical") } - + if (!is.logical(scale)) { stop("'scale' argument must be logical") } - + if (diss_method == "cor") { if (!is.null(ws)) { if (ws < 3 | ws > (ncol(Xr) - 1) | length(ws) != 1 | (ws %% 2) == 0) { @@ -352,18 +354,18 @@ search_neighbors <- function(Xr, Xu, diss_method = c("pca", } } } - + if (!is.null(k) & !is.null(k_diss)) { # if k and k_diss are not called here, errors are thrown during checks k k_diss stop("Only one of k or k_diss can be specified") } - + if (is.null(k) & is.null(k_diss)) { stop("Either k or k_diss must be specified") } - + if (!is.null(k)) { k <- as.integer(k) if (k < 1) { @@ -377,7 +379,7 @@ search_neighbors <- function(Xr, Xu, diss_method = c("pca", } kk <- k } - + if (!is.null(k_diss)) { # if k_diss is not called here, errors are thrown during checks k_diss @@ -418,12 +420,12 @@ search_neighbors <- function(Xr, Xu, diss_method = c("pca", if (input_dots$pre_k < kk) { stop(paste0( "pre_k must be larger than ", - if_else(is.null(k), "max(k_range)", "k") + ifelse(is.null(k), "max(k_range)", "k") )) } } } - + if (!is.null(spike)) { if (!is.vector(spike)) { stop("spike must be a vector of integers") @@ -432,11 +434,21 @@ search_neighbors <- function(Xr, Xu, diss_method = c("pca", stop("spike must be a vector of integers") } if (length(spike) >= nrow(Xr)) { - stop("Argument spike cannot be larger or equal to the number of rows of Xr") + stop("The lebgth of spike cannot be larger or equal to the number of rows of Xr") } - if (max(spike) >= nrow(Xr)) { + if (max(spike) > nrow(Xr)) { stop("Argument spike contains indices subscript out of bounds of Xr") } + if (!is.null(k)) { + if (min(k) <= length(spike)) { + stop("values for k must be larger than length(spike)") + } + } + if (!is.null(k_diss)) { + if (min(k_range) <= length(spike)) { + stop("values for k_range must be larger than length(spike)") + } + } spike <- sort(unique(as.integer(spike))) } dsm <- dissimilarity( @@ -451,19 +463,19 @@ search_neighbors <- function(Xr, Xu, diss_method = c("pca", scale = scale, ... ) - + results <- diss_to_neighbors(dsm$dissimilarity, - k = k, k_diss = k_diss, k_range = k_range, - spike = spike, - return_dissimilarity = return_dissimilarity + k = k, k_diss = k_diss, k_range = k_range, + spike = spike, + return_dissimilarity = return_dissimilarity ) - + if (return_projection & diss_method %in% c("pca", "pca.nipals", "pls")) { results$projection <- dsm$projection } if ("gh" %in% names(input_dots)) { results$gh <- dsm$gh } - + results } diff --git a/R/sid.R b/R/sid.R index ee6c154..1fe5ea8 100644 --- a/R/sid.R +++ b/R/sid.R @@ -1,10 +1,9 @@ #' @title A function for computing the spectral information divergence between #' spectra (sid) #' @description -#' -#' \lifecycle{experimental} -#' #' \loadmathjax +#' \ifelse{html}{\out{Experimental lifecycle}}{\strong{Experimental}} +#' #' This function computes the spectral information divergence/dissimilarity between #' spectra based on the kullback-leibler divergence algorithm (see details). #' @usage @@ -131,7 +130,7 @@ #' @author Leonardo Ramirez-Lopez #' @importFrom stats bw.nrd0 density #' @examples -#' \dontrun{ +#' \donttest{ #' library(prospectr) #' #' data(NIRsoil) diff --git a/R/sim_eval.R b/R/sim_eval.R index a271042..64a9d8b 100644 --- a/R/sim_eval.R +++ b/R/sim_eval.R @@ -1,7 +1,8 @@ #' @title A function for evaluating dissimilarity matrices (sim_eval) #' @description +#' \loadmathjax #' -#' \lifecycle{stable} +#' \ifelse{html}{\out{Stable lifecycle}}{\strong{Stable}} #' #' This function searches for the most similar observation (closest neighbor) of #' each observation in a given data set based on a dissimilarity (e.g. distance @@ -24,7 +25,6 @@ #' observations in terms of the side information provided. If it is a character #' variable, then the kappa index is used. See details. #' @details -#' \loadmathjax #' For the evaluation of dissimilarity matrices this function uses side #' information (information about one variable which is available for a #' group of observations, Ramirez-Lopez et al., 2013). It is assumed that there @@ -34,16 +34,16 @@ #' is used for assessing the similarity between the observations and their #' corresponding most similar observations in terms of the side information #' provided. It is computed as follows: -#' +#' #' \mjdeqn{j(i) = NN(xr_i, Xr^{\{-i\}})}{j(i) = NN(xr_i, Xr^{\{-i\}})} #' \mjdeqn{RMSD = \sqrt{\frac{1}{m} \sum_{i=1}^n {(y_i - y_{j(i)})^2}}}{RMSD = \sqrt{1/n sum_{i=1}^m (y_i - y_{j(i)})^2}} #' -#' where \mjeqn{NN(xr_i, Xr^{-i})}{NN(xr_i, Xr^{-i})} represents a function to -#' obtain the index of the nearest neighbor observation found in \mjeqn{Xr}{Xr} -#' (excluding the \mjeqn{i}{i}th observation) for \mjeqn{xr_i}{xr_i}, -#' \mjeqn{y_{i}}{y_i} is the value of the side variable of the \mjeqn{i}{i}th -#' observation, \mjeqn{y_{j(i)}}{y_{j(i)}} is the value of the side variable of -#' the nearest neighbor of the \mjeqn{i}{i}th observation and \mjeqn{m}{m} is +#' where \mjeqn{NN(xr_i, Xr^{-i})}{NN(xr_i, Xr^{-i})} represents a function to +#' obtain the index of the nearest neighbor observation found in \mjeqn{Xr}{Xr} +#' (excluding the \mjeqn{i}{i}th observation) for \mjeqn{xr_i}{xr_i}, +#' \mjeqn{y_{i}}{y_i} is the value of the side variable of the \mjeqn{i}{i}th +#' observation, \mjeqn{y_{j(i)}}{y_{j(i)}} is the value of the side variable of +#' the nearest neighbor of the \mjeqn{i}{i}th observation and \mjeqn{m}{m} is #' the total number of observations. #' #' If \code{side_info} is a factor the kappa index (\mjeqn{\kappa}{kappa}) is @@ -61,8 +61,6 @@ #' case, the vector must represent the lower triangle of a dissimilarity matrix #' (e.g. as returned by the [stats::dist()] function of \code{stats}). #' -#' This function supports multi-threading based on OpenMP for retrieving the -#' closest observations. #' @return \code{sim_eval} returns a list with the following components: #' \itemize{ #' \item{"\code{eval}}{ either the RMSD (and the correlation coefficient) or @@ -82,7 +80,7 @@ #' Dematte, J. A. M., Scholten, T. 2013b. Distance and similarity-search #' metrics for use with soil vis-NIR spectra. Geoderma 199, 43-53. #' @examples -#' \dontrun{ +#' \donttest{ #' library(prospectr) #' data(NIRsoil) #' diff --git a/README.md b/README.md index bbd13f7..8fc560f 100644 --- a/README.md +++ b/README.md @@ -9,7 +9,7 @@ -_Last update: 12.10.2020_ +_Last update: 29.10.2020_

Think Globally, Fit Locally (Saul and Roweis, 2003)

@@ -22,13 +22,13 @@ Check the package vignette(s)! ## Core functionality -The core functionality of the package can be sumarized into the following +The core functionality of the package can be summarized into the following functions: __`mbl`__: implements memory-based learning (MBL) for modeling and predicting continuous response variables. For example, it can be used to reproduce the famous LOCAL algorithm proposed by Shenk et al. (1997). In general, this -function allowes you to easily customize your own MBL regression-prediction +function allows you to easily customize your own MBL regression-prediction method. __`dissimilarity`__: Computes dissimilarity matrices based on various methods @@ -40,8 +40,8 @@ __`ortho_projection`__: A function for dimensionality reduction using either principal component analysis or partial least squares (a.k.a projection to latent structures). -__`search_neighbors`__: A function to efficiently retrieve from a refence set the -k-nearest neighbors of another given data set. +__`search_neighbors`__: A function to efficiently retrieve from a reference set +the k-nearest neighbors of another given data set. ## New version During the recent lockdown we invested some of our free time to come up @@ -69,8 +69,8 @@ just type the following line in your `R` console: install.packages('resemble') ``` -If you do not have the following packages installed, it might be good to install -them first +If you do not have the following packages installed, it might be good to +update/install them first ``` install.packages('Rcpp') install.packages('RcppArmadillo') @@ -129,8 +129,9 @@ plot(sbl) get_predictions(sbl) ```` - - +

+ +

Figure 1. Standard plot of the results of the __`mbl`__ function. [`resemble`](http://l-ramirez-lopez.github.io/resemble/) implements functions @@ -168,6 +169,20 @@ __`'pls'`__: Partial least squares. __`'wapls'`__: Weighted average partial least squares (Shenk et al., 1997). +Figure 2 illustrates the basic steps in MBL for a set of five observations. + +

+ +

+Figure 2. Example of the main steps in memory-based learning for predicting a response variable in five different observations based on set of p-dimesnional variables. + + +## Citing the package +Simply type and you will get the info you need: +``` +citation(package = "resemble") +``` + ## News * 2020.04: [Tsakiridis et al. (2020)](https://www.sciencedirect.com/science/article/abs/pii/S0016706119308870?via%3Dihub), used the optmal principal components dissimilarity method implemented in `resemble` in combination with convolutional neural networks for simultaneous prediction of soil properties from vis-NIR spectra. @@ -233,13 +248,13 @@ multivariate prediction of soil organic carbon by Vis-NIR spectra. Geoderma, comparison of different approaches to multi-product brix calibration in near-infrared spectroscopy. In OCM 2017-Optical Characterization of Materials-conference proceedings (p. 129). KIT Scientific Publishing). -* 2016-05: In [this recent scientific paper](http://www.sciencedirect.com/science/article/pii/S001670611630180X) the authors sucesfully used `resemble` to predict soil organic carbon content at +* 2016-05: In [this scientific paper](https://www.sciencedirect.com/science/article/pii/S001670611630180X) the authors sucesfully used `resemble` to predict soil organic carbon content at national scale in France. (Clairotte, M., Grinand, C., Kouakoua, E., Thébault, A., Saby, N. P., Bernoux, M., & Barthès, B. G. (2016). National calibration of soil organic carbon concentration using diffuse infrared reflectance spectroscopy. Geoderma, 276, 41-52). -* 2016-04: [This paper](http://www.mdpi.com/2072-4292/8/4/341) shows some +* 2016-04: [This paper](https://www.mdpi.com/2072-4292/8/4/341) shows some interesting results on applying memory-based learning to predict soil properties. * 2016-04: In some recent entries of [this blog](http://nir-quimiometria.blogspot.com/), @@ -273,7 +288,7 @@ prospectr package) is used. * 2014-03: The package released on CRAN! ## Other R'elated stuff -* [Check our other project called `prospectr`.](http://github.com/l-ramirez-lopez/prospectr) +* [Check our other project called `prospectr`.](https://github.com/l-ramirez-lopez/prospectr) * [Check this presentation in which we used the resemble package to predict soil attributes from large scale soil spectral libraries.](http://www.fao.org/fileadmin/user_upload/GSP/docs/Spectroscopy_dec13/SSW2013_f.pdf) ## Bug report and development version diff --git a/cran-comments.md b/cran-comments.md new file mode 100644 index 0000000..6c98b60 --- /dev/null +++ b/cran-comments.md @@ -0,0 +1,22 @@ +# resemble + +# checks for release of `resemble 2.0.0` (`gordillo`) + +29.10.2020 +As requested by CRAN: +-The length of the title is now below + 65 characters +- A has been added in the description field of DESCRIPTION +- \donttest{} is now used (instead of \dontrun{}) for those examples + taking more than 5 seconds +- verobse argument has been added to the functions to easily suppress any + message different from error warnings or messages. +- on.exit() is now called properly to reset to user + parameters when the functions are exited +- User's options() are reset in the examples in the vignette + that require changes in those options() +- Examples are now explicitly using maximum two cores (if + available). + +The package has been checked in multiple platforms using rhub::check(), as well +as in the win-builders provided by CRAN. \ No newline at end of file diff --git a/inst/CITATION b/inst/CITATION index 3674853..21de3f3 100644 --- a/inst/CITATION +++ b/inst/CITATION @@ -1,17 +1,29 @@ -citHeader("To cite package 'resemble' in publications use:") - +citHeader("To cite resemble in publications use:") + year <- sub(".*(2[[:digit:]]{3})-.*", "\\1", meta$Date, perl = TRUE) vers <- paste("R package version", meta$Version) - -citEntry(entry="Manual", - title = "resemble: Regression and similarity evaluation for memory-based learning in spectral chemometrics", - author = personList(as.person("Leornardo Ramirez-Lopez"), - as.person("Antoine Stevens")), + +citEntry( + entry = "Manual", + title = "resemble: Regression and similarity evaluation for memory-based learning in spectral chemometrics. ", + author = personList(as.person("Leonardo Ramirez-Lopez"), + as.person("Antoine Stevens"), + as.person("Raphael Viscarra Rossel"), + as.person("Craig Lobsey"), + as.person("Alex Wadoux"), + as.person("Timo Breure")), + publication = "R package Vignette", year = year, note = vers, - - textVersion = - paste("Leornardo Ramirez-Lopez and Antoine Stevens (", - year, - "). resemble: Regression and similarity evaluation for memory-based learning in spectral chemometrics", - vers, ".", sep="")) + url = "https://CRAN.R-project.org/package=resemble", + textVersion = paste0("Ramirez-Lopez, L., ", + "and Stevens, A., ", + "and Viscarra Rossel, R., ", + "and Lobsey, C., ", + "and Wadoux, A., ", + "and Breure, T. ", + "(", year, "). ", + "resemble: Regression and similarity evaluation for memory-based learning in spectral chemometrics. R package Vignette ", + vers, + ".") +) diff --git a/inst/changes.md b/inst/changes.md index 2e10107..5a610c8 100644 --- a/inst/changes.md +++ b/inst/changes.md @@ -1,13 +1,12 @@ # Changes implemented for version 2.0 -In general, the code has been reformatted for compliying with Google and -tidyverse style guideleines. Here the major changes are listed. -### Main new fucntions -- search_neighbors -- dissimilarity +### New features +- search_neighbors() function +- dissimilarity() function +## Breaking changes ### orthoProjection, pcProjection, plsProjection (renamed to ortho_projection, pc_projection, pls_projection respectively): diff --git a/man/cor_diss.Rd b/man/cor_diss.Rd index 83d5934..0329511 100644 --- a/man/cor_diss.Rd +++ b/man/cor_diss.Rd @@ -30,18 +30,17 @@ a matrix of the computed dissimilarities. } \description{ \loadmathjax - -\lifecycle{stable} +\ifelse{html}{\out{Stable lifecycle}}{\strong{Stable}} Computes correlation and moving correlation dissimilarity matrices. } \details{ The correlation dissimilarity \mjeqn{d}{d} between two observations -\mjeqn{x_i}{x_i} and \mjeqn{x_j}{x_j} is based on the Perason's -correlation coefficient (\mjeqn{\rho}{\rho}) and it can be computed as +\mjeqn{x_i}{x_i} and \mjeqn{x_j}{x_j} is based on the Perason's +correlation coefficient (\mjeqn{\rho}{\rho}) and it can be computed as follows: -\mjdeqn{d(x_i, x_j) = \frac{1}{2}(1 - \rho(x_i, x_j))}{d(x_i, x_j) = 1/2 (1 - \rho(x_i, x_j))} +\mjdeqn{d(x_i, x_j) = \frac{1}{2}((1 - \rho(x_i, x_j)))}{d(x_i, x_j) = 1/2 (1 - \rho(x_i, x_j))} The above formula is used when \code{ws = NULL}. On the other hand (when \code{ws != NULL}) the moving correlation @@ -57,7 +56,7 @@ variables of the observations. The function does not accept input data containing missing values. } \examples{ -\dontrun{ +\donttest{ library(prospectr) data(NIRsoil) diff --git a/man/dissimilarity.Rd b/man/dissimilarity.Rd index c45ac44..74290f3 100644 --- a/man/dissimilarity.Rd +++ b/man/dissimilarity.Rd @@ -81,30 +81,34 @@ method). The methods available are: (of set of observations) is the one for which its distance matrix minimizes the differences between the \code{Yr} value of each observation and the \code{Yr} value of its closest observation. In this - case \code{value} must be a value (larger than 0 and - below \code{min(nrow(Xr), nrow(Xu), ncol(Xr))}) indicating the maximum + case \code{value} must be a value ((larger than 0 and + below the minimum dimension of \code{Xr} or \code{Xr} and \code{Xu} + combined) indicating the maximum number of principal components to be tested. See the \code{\link{ortho_projection}} function for more details.} \item{\code{"cumvar"}:}{ selection of the principal components based on a given cumulative amount of explained variance. In this case, \code{value} must be a value (larger than 0 and below or equal to 1) - indicating the minimum amount of cumulative variance that the + indicating the minimum amount of cumulative variance that the combination of retained components should explain.} \item{\code{"var"}:}{ selection of the principal components based on a given amount of explained variance. In this case, \code{value} must be a value (larger than 0 and below or equal to 1) - indicating the minimum amount of variance that a single component + indicating the minimum amount of variance that a single component should explain in order to be retained.} \item{\code{"manual"}:}{ for manually specifying a fix number of principal components. In this case, \code{value} must be a value - (larger than 0 and \code{min(nrow(Xr), nrow(Xu), ncol(Xr))}). + (larger than 0 and + below the minimum dimension of \code{Xr} or \code{Xr} and \code{Xu} + combined). indicating the minimum amount of variance that a component should explain in order to be retained.} } -The default list passed is \code{list(method = "var", value = 0.01)}. +The default is \code{list(method = "var", value = 0.01)}. + Optionally, the \code{pc_selection} argument admits \code{"opc"} or \code{"cumvar"} or \code{"var"} or \code{"manual"} as a single character string. In such a case the default \code{"value"} when either \code{"opc"} or @@ -155,6 +159,7 @@ A list with the following components: \item{\code{projection}:}{ an \code{ortho_projection} object. Only output if \code{return_projection = TRUE} and if \code{diss_method = "pca"}, \code{diss_method = "pca.nipals"} or \code{diss_method = "pls"}. + This object contains the projection used to compute the dissimilarity matrix. In case of local dissimilarity matrices, the projection corresponds to the global projection used to select the @@ -162,14 +167,12 @@ A list with the following components: details).} \item{\code{gh}:}{ a list containing the GH distances as well as the - pls projection used (\code{ortho_projection} object) to compute the GH.} + pls projection used to compute the GH.} } } \description{ \loadmathjax -\lifecycle{maturing} - This is a wrapper to integrate the different dissimilarity functions of the offered by package.It computes the dissimilarities between observations in numerical matrices by using an specifed dissmilarity measure. @@ -180,12 +183,12 @@ This function is a wrapper for \code{\link{ortho_diss}}, \code{\link{cor_diss}}, functions for further details. } \examples{ -\dontrun{ library(prospectr) data(NIRsoil) -# Filter the data using the first derivative with Savitzky and Golay smoothing -filter and a window size of 11 spectral variables and a polynomial order of 4 +# Filter the data using the first derivative with Savitzky and Golay +# smoothing filter and a window size of 11 spectral variables and a +# polynomial order of 4 sg <- savitzkyGolay(NIRsoil$spc, m = 1, p = 4, w = 15) # Replace the original spectra with the filtered ones @@ -211,7 +214,6 @@ dsm_pca <- dissimilarity( return_projection = TRUE ) } -} \seealso{ \code{\link{ortho_diss}} \code{\link{cor_diss}} \code{\link{f_diss}} \code{\link{sid}}. diff --git a/man/f_diss.Rd b/man/f_diss.Rd index 26d0243..2cd69a2 100644 --- a/man/f_diss.Rd +++ b/man/f_diss.Rd @@ -13,7 +13,8 @@ f_diss(Xr, Xu = NULL, diss_method = "euclid", \item{Xu}{an optional matrix containing data of a second set of observations (samples).} -\item{diss_method}{the method for computing the dissimilarity matrix. +\item{diss_method}{the method for computing the dissimilarity between +observations. Options are \code{"euclid"} (Euclidean distance), \code{"mahalanobis"} (Mahalanobis distance) and \code{"cosine"} (cosine distance, a.k.a spectral angle mapper). See details.} @@ -31,8 +32,7 @@ a matrix of the computed dissimilarities. } \description{ \loadmathjax - -\lifecycle{stable} +\ifelse{html}{\out{Stable lifecycle}}{\strong{Stable}} This function is used to compute the dissimilarity between observations based on Euclidean or Mahalanobis distance measures or on cosine @@ -85,7 +85,7 @@ NOTE: The computed distances are divided by the number of variables/columns in \code{Xr}. } \examples{ -\dontrun{ +\donttest{ library(prospectr) data(NIRsoil) diff --git a/man/fastDistVVL.Rd b/man/fast_diss_vector.Rd similarity index 50% rename from man/fastDistVVL.Rd rename to man/fast_diss_vector.Rd index 0872fb3..d493510 100644 --- a/man/fastDistVVL.Rd +++ b/man/fast_diss_vector.Rd @@ -1,24 +1,24 @@ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/RcppExports.R -\name{fastDistVVL} -\alias{fastDistVVL} -\title{A fast (serial) algorithm of Euclidean (non-squared) cross-distance for vectors written in C++} +\name{fast_diss_vector} +\alias{fast_diss_vector} +\title{A fast algorithm of (squared) Euclidean cross-distance for vectors written in C++} \usage{ -fastDistVVL(X) +fast_diss_vector(X) } \arguments{ -\item{X}{a vector} +\item{X}{a vector.} } \value{ a vector of distance (lower triangle of the distance matrix, stored by column) } \description{ -A fast algorithm of (squared) Euclidean cross-distance for vectors written in C++ +A fast (parallel for linux) algorithm of (squared) Euclidean cross-distance for vectors written in C++ } \details{ used internally in ortho_projection } \author{ -Leo Ramirez-Lopez +Antoine Stevens } \keyword{internal} diff --git a/man/figures/logo.png b/man/figures/logo.png index cc4a2b4..921c3fe 100644 Binary files a/man/figures/logo.png and b/man/figures/logo.png differ diff --git a/vignettes/logo.png b/man/figures/logo_large.png similarity index 100% rename from vignettes/logo.png rename to man/figures/logo_large.png diff --git a/man/get_predictions.Rd b/man/get_predictions.Rd index a08634f..4bc05a8 100644 --- a/man/get_predictions.Rd +++ b/man/get_predictions.Rd @@ -13,7 +13,7 @@ get_predictions(object) a data.table of predicted values according to either \code{k} or \code{k_dist} } \description{ -\lifecycle{stable} +\ifelse{html}{\out{Stable lifecycle}}{\strong{Stable}} Extract predictions from an object of class \code{mbl} } diff --git a/man/local_fit.Rd b/man/local_fit.Rd index 99db434..5178b58 100644 --- a/man/local_fit.Rd +++ b/man/local_fit.Rd @@ -34,8 +34,6 @@ for Gaussian process local regressions (\code{local_fit_gpr}). Default is An object of class \code{local_fit} mirroring the input arguments. } \description{ -\lifecycle{maturing} - \loadmathjax These functions define the way in which each local fit/prediction is done within each iteration in the \code{\link{mbl}} function. @@ -91,10 +89,8 @@ There are three possible options for performing these regressions: } } \examples{ -\dontrun{ local_fit_wapls(min_pls_c = 3, max_pls_c = 12) } -} \references{ Shenk, J., Westerhaus, M., and Berzaghi, P. 1997. Investigation of a LOCAL calibration procedure for near infrared instruments. Journal of Near Infrared diff --git a/man/mbl.Rd b/man/mbl.Rd index 1217774..1707ed8 100644 --- a/man/mbl.Rd +++ b/man/mbl.Rd @@ -5,13 +5,12 @@ \title{A function for memory-based learning (mbl)} \usage{ mbl(Xr, Yr, Xu, Yu = NULL, k, k_diss, k_range, spike = NULL, - method = local_fit_wapls(min_pls_c = 3, - max_pls_c = min(dim(Xr), 15)), + method = local_fit_wapls(min_pls_c = 3, max_pls_c = min(dim(Xr), 15)), diss_method = "pca", diss_usage = "predictors", - gh = TRUE, pc_selection = list(method = "opc", - value = min(dim(Xr), 40)), + gh = TRUE, pc_selection = list(method = "opc", value = min(dim(Xr), 40)), control = mbl_control(), group = NULL, - center = TRUE, scale = FALSE, documentation = character(), ...) + center = TRUE, scale = FALSE, verbose = TRUE, + documentation = character(), ...) } \arguments{ \item{Xr}{a matrix of predictor variables of the reference data @@ -45,8 +44,11 @@ retained when the \code{k_diss} is given.} \item{spike}{an integer vector indicating the indices of observations in \code{Xr} that must be forced into the neighborhoods of every \code{Xu} -observation. Default is \code{NULL} (i.e. no observations are forced). -See details.} +observation. Default is \code{NULL} (i.e. no observations are forced). Note +that this argument is not intended for increasing the neighborhood size which +is only controlled by \code{k} or \code{k_diss} and \code{k_range}. By +forcing observations into the neighborhood, some observations will be forced +out of the neighborhood. See details.} \item{method}{an object of class \code{\link{local_fit}} which indicates the type of regression to conduct at each local segment as well as additional @@ -70,8 +72,8 @@ observation. Options are: \item{\code{"pls"}}{ Mahalanobis distance computed on the matrix of scores of a partial least squares projection - of \code{Xr} and \code{Xu}. In this case, \code{Yr} is always required. - See \code{\link{ortho_diss}} function.} + of \code{Xr} and \code{Xu}. In this case, \code{Yr} is always + required. See \code{\link{ortho_diss}} function.} \item{\code{"cor"}}{ correlation coefficient between observations. See \code{\link{cor_diss}} function.} @@ -121,35 +123,38 @@ This list must contain two elements in the following order: components) and \code{value} (a numerical value that complements the selected method). The methods available are: \itemize{ - \item{\code{"opc"}:} { optimized principal component selection based on - Ramirez-Lopez et al. (2013a, 2013b). The optimal number of components - (of set of observations) is the one for which its distance matrix - minimizes the differences between the \code{Yr} value of each - observation and the \code{Yr} value of its closest observation. In this - case \code{value} must be a value (larger than 0 and - below \code{min(nrow(Xr), nrow(X2), ncol(Xr))}) indicating the maximum + \item{\code{"opc"}:} { optimized principal component selection based + on Ramirez-Lopez et al. (2013a, 2013b). The optimal number of + components (of set of observations) is the one for which its distance + matrix minimizes the differences between the \code{Yr} value of each + observation and the \code{Yr} value of its closest observation. In + this case \code{value} must be a value (larger than 0 and + below the minimum dimension of \code{Xr} or \code{Xr} and \code{Xu} + combined) indicating the maximum number of principal components to be tested. See the \code{\link{ortho_projection}} function for more details.} \item{\code{"cumvar"}:}{ selection of the principal components based on a given cumulative amount of explained variance. In this case, \code{value} must be a value (larger than 0 and below or equal to 1) - indicating the minimum amount of cumulative variance that the + indicating the minimum amount of cumulative variance that the combination of retained components should explain.} \item{\code{"var"}:}{ selection of the principal components based on a given amount of explained variance. In this case, \code{value} must be a value (larger than 0 and below or equal to 1) - indicating the minimum amount of variance that a single component + indicating the minimum amount of variance that a single component should explain in order to be retained.} \item{\code{"manual"}:}{ for manually specifying a fix number of principal components. In this case, \code{value} must be a value - (larger than 0 and \code{min(nrow(Xr), nrow(X2), ncol(Xr))}). + (larger than 0 and below the minimum dimension of \code{Xr} or + \code{Xr} and \code{Xu} combined). indicating the minimum amount of variance that a component should explain in order to be retained.} } -The default list passed is \code{list(method = "opc", value = min(dim(Xr), 40))}. +The list +\code{list(method = "opc", value = min(dim(Xr), 40))} is the default. Optionally, the \code{pc_selection} argument admits \code{"opc"} or \code{"cumvar"} or \code{"var"} or \code{"manual"} as a single character string. In such a case the default \code{"value"} when either \code{"opc"} or @@ -167,8 +172,8 @@ See the \code{\link{mbl_control}} function for more details.} that can be coerced to \code{\link[base]{factor}} by \code{as.factor}) that assigns a group/class label to each observation in \code{Xr} (e.g. groups can be given by spectra collected from the same batch of -measurements, from the same observation, from observations with very similar origin, -etc). This is taken into account for internal leave-group-out cross +measurements, from the same observation, from observations with very similar +origin, etc). This is taken into account for internal leave-group-out cross validation for pls tuning (factor optimization) to avoid pseudo-replication. When one observation is selected for cross-validation, all observations of the same group are removed together and assigned to validation. The length @@ -184,6 +189,10 @@ to unit variance at each local segment (before regression). In addition, if \code{TRUE}, \code{Xr} and \code{Xu} will be scaled for dissimilarity computations.} +\item{verbose}{a logical indicating whether or not to print a progress bar +for each observation to be predicted. Default is \code{TRUE}. Note: In case +parallel processing is used, these progress bars will not be printed.} + \item{documentation}{an optional character string that can be used to describe anything related to the \code{mbl} call (e.g. description of the input data). Default: \code{character()}. NOTE: his is an experimental @@ -215,11 +224,11 @@ a \code{list} of class \code{mbl} with the following components as well as the results of the global pls projection object used to obtain the GH values.} \item{\code{validation_results}:}{ a list of validation results for - \code{local_cross_validation} (returned if the \code{validation_type} in \code{control} - list was set to \code{"local_cv"}), - \code{nearest_neighbor_validation} (returned if the \code{validation_type} in \code{control} - list was set to \code{"NNv"}) and - \code{Yu_prediction_statistics} (returned if \code{Yu} was supplied).} + "local cross validation" (returned if the \code{validation_type} in + \code{control} list was set to \code{"local_cv"}), + "nearest neighbor validation" (returned if the \code{validation_type} + in \code{control} list was set to \code{"NNv"}) and + "Yu prediction statistics" (returned if \code{Yu} was supplied).}`` \item{\code{results}:}{ a list of data tables containing the results of the predictions for each either \code{k} or \code{k_diss}. Each data table contains the following columns:} @@ -231,9 +240,11 @@ a \code{list} of class \code{mbl} with the following components \item{\code{k_original}:}{ This column is only output if the \code{k_diss} argument is used. It indicates the number of neighbors that were originally found when the given dissimilarity threshold is used.} - \item{\code{k}:}{ This column indicates the final number of neighbors used.} - \item{\code{npls}:}{ This column is only output if the \code{pls} regression - method was used. It indicates the final number of pls components used.} + \item{\code{k}:}{ This column indicates the final number of neighbors + used.} + \item{\code{npls}:}{ This column is only output if the \code{pls} + regression method was used. It indicates the final number of pls + components used.} \item{\code{min_pls}:}{ This column is only output if \code{wapls} regression method was used. It indicates the final number of minimum pls components used. If no optimization was set, it retrieves the original @@ -252,33 +263,36 @@ a \code{list} of class \code{mbl} with the following components variable) in the neighborhood.} \item{\code{index_nearest_in_Xr}}{ The index of the nearest neighbor found in \code{Xr}.} - \item{\code{index_farthest_in_Xr}}{ The index of the farthest neighbor found - in \code{Xr}.} + \item{\code{index_farthest_in_Xr}}{ The index of the farthest neighbor + found in \code{Xr}.} \item{\code{y_nearest}:}{ The reference value (\code{Yr}) corresponding to the nearest neighbor found in \code{Xr}.} - \item{\code{y_nearest_pred}:}{ This column is only output if the validation - method in the object passed to \code{control} was set to \code{"NNv"}. - It represents the predicted value of the nearest neighbor observation found - in \code{Xr}. This prediction come from model fitted with the remaining - observations in the neighborhood of the target observation in \code{Xu}.} + \item{\code{y_nearest_pred}:}{ This column is only output if the + validation method in the object passed to \code{control} was set to + \code{"NNv"}. It represents the predicted value of the nearest neighbor + observation found in \code{Xr}. This prediction come from model fitted + with the remaining observations in the neighborhood of the target + observation in \code{Xu}.} \item{\code{loc_rmse_cv}:}{ This column is only output if the validation method in the object passed to \code{control} was set to \code{'local_cv'}. It represents the RMSE of the cross-validation computed for the neighborhood of the target observation in \code{Xu}.} - \item{\code{loc_st_rmse_cv}:}{ This column is only output if the validation - method in the object passed to \code{control} was set to + \item{\code{loc_st_rmse_cv}:}{ This column is only output if the + validation method in the object passed to \code{control} was set to \code{'local_cv'}. It represents the standardized RMSE of the - cross-validation computed for the neighborhood of the target observation in - \code{Xu}.} + cross-validation computed for the neighborhood of the target observation + in \code{Xu}.} \item{\code{dist_nearest}:}{ The distance to the nearest neighbor.} \item{\code{dist_farthest}:}{ The distance to the farthest neighbor.} \item{\code{loc_n_components}:}{ This column is only output if the dissimilarity method used is one of \code{"pca"}, \code{"pca.nipals"} or \code{"pls"} and in addition the dissimilarities are requested to be - computed locally by passing \code{.local = TRUE} to the \code{mbl} function. + computed locally by passing \code{.local = TRUE} to the \code{mbl} + function. See \code{.local} argument in the \code{\link{ortho_diss}} function.} } - \item{\code{documentation}}{ A character string with the documentation added.} + \item{\code{documentation}}{ A character string with the documentation + added.} } When the \code{k_diss} argument is used, the printed results show a table with a column named '\code{p_bounded}. It represents the percentage of @@ -287,8 +301,6 @@ threshold were outside the boundaries specified in the \code{k_range} argument. } \description{ -\lifecycle{maturing} - \loadmathjax This function is implemented for memory-based learning (a.k.a. instance-based learning or local regression) which is a non-linear lazy @@ -303,7 +315,10 @@ regression model. \details{ The argument \code{spike} can be used to indicate what reference observations in \code{Xr} must be kept in the neighborhood of every single \code{Xu} -observation. Spiking might be useful in cases where +observation. If a vector of length \mjeqn{m}{m} is passed to this argument, +this means that the \mjeqn{m}{m} original neighbors with the largest +dissimilarities to the target observations will be forced out of the +neighborhood. Spiking might be useful in cases where some reference observations are known to be somehow related to the ones in \code{Xu} and therefore might be relevant for fitting the local models. See Guerrero et al. (2010) for an example on the benefits of spiking. @@ -324,14 +339,14 @@ source of additional predictors (i.e the columns of this local matrix are treated as predictor variables). In some cases this results in an improvement of the prediction performance (Ramirez-Lopez et al., 2013a). If \code{diss_usage = "weights"}, the neighbors of the query point -(\mjeqn{xu_{j}}{xu_j}) are weighted according to their dissimilarity to \mjeqn{xu_{j}}{xu_j} -before carrying out each local regression. The following tricubic function -(Cleveland and Delvin, 1988; Naes et al., 1990) is used for computing the -final weights based on the measured dissimilarities: +(\mjeqn{xu_{j}}{xu_j}) are weighted according to their dissimilarity to +\mjeqn{xu_{j}}{xu_j} before carrying out each local regression. The following +tricubic function (Cleveland and Delvin, 1988; Naes et al., 1990) is used for +computing the final weights based on the measured dissimilarities: \mjdeqn{W_{j} = (1 - v^{3})^{3}}{W_j = (1 - v^3)^3} -where if \mjeqn{{xr_{i} \in}}{xr_i in} neighbors of \mjeqn{xu_{j}}{xu_j}: +where if \mjeqn{{xr_{i} \in }}{xr_i in} neighbors of \mjeqn{xu_{j}}{xu_j}: \mjdeqn{v_{j}(xu_{j}) = d(xr_{i}, xu_{j})}{v_j(xu_j) = d(xr_i, xu_j)} @@ -358,18 +373,18 @@ parameter tuning, what extra objects to return, permission for parallel execution, prediction limits, etc, can be specified by using the \code{\link{mbl_control}} function. -By using the \code{group} argument one can specify groups of observations that -have something in common (e.g. observations with very similar origin). +By using the \code{group} argument one can specify groups of observations +that have something in common (e.g. observations with very similar origin). The purpose of \code{group} is to avoid biased cross-validation results due to pseudo-replication. This argument allows to select calibration points that are independent from the validation ones. In this regard, when -\code{validation_type = "local_cv"} (used in \code{\link{mbl_control}} function), -then the \code{p} argument refers to the percentage of groups of observations -(rather than single observations) to be retained in each sampling iteration at -each local segment. +\code{validation_type = "local_cv"} (used in \code{\link{mbl_control}} +function), then the \code{p} argument refers to the percentage of groups of +observations (rather than single observations) to be retained in each +sampling iteration at each local segment. } \examples{ -\dontrun{ +\donttest{ library(prospectr) data(NIRsoil) @@ -377,7 +392,8 @@ data(NIRsoil) # Golay smoothing filter sg_det <- savitzkyGolay( detrend(NIRsoil$spc, - wav = as.numeric(colnames(NIRsoil$spc))), + wav = as.numeric(colnames(NIRsoil$spc)) + ), m = 1, p = 1, w = 7 @@ -386,11 +402,11 @@ sg_det <- savitzkyGolay( NIRsoil$spc_pr <- sg_det # split into training and testing sets -test_x <- NIRsoil$spc_pr[NIRsoil$train == 0 & !is.na(NIRsoil$CEC),] +test_x <- NIRsoil$spc_pr[NIRsoil$train == 0 & !is.na(NIRsoil$CEC), ] test_y <- NIRsoil$CEC[NIRsoil$train == 0 & !is.na(NIRsoil$CEC)] train_y <- NIRsoil$CEC[NIRsoil$train == 1 & !is.na(NIRsoil$CEC)] -train_x <- NIRsoil$spc_pr[NIRsoil$train == 1 & !is.na(NIRsoil$CEC),] +train_x <- NIRsoil$spc_pr[NIRsoil$train == 1 & !is.na(NIRsoil$CEC), ] # Example 1 # A mbl implemented in Ramirez-Lopez et al. (2013, @@ -466,11 +482,19 @@ plot(local_algorithm_2) # Example 4 # Running the mbl function in parallel with example 2 -n_cores <- parallel::detectCores() - 1 -if (n_cores == 0) { - n_cores <- 1 + +n_cores <- 2 + +if (parallel::detectCores() < 2) { + n_cores <- 1 } +# Alternatively: +# n_cores <- parallel::detectCores() - 1 +# if (n_cores == 0) { +# n_cores <- 1 +# } + library(doParallel) clust <- makeCluster(n_cores) registerDoParallel(clust) @@ -522,8 +546,8 @@ approach to regression analysis by local fitting. Journal of the American Statistical Association, 83, 596-610. Guerrero, C., Zornoza, R., Gómez, I., Mataix-Beneyto, J. 2010. Spiking of -NIR regional models using observations from target sites: Effect of model size on -prediction accuracy. Geoderma, 158(1-2), 66-77. +NIR regional models using observations from target sites: Effect of model +size on prediction accuracy. Geoderma, 158(1-2), 66-77. Naes, T., Isaksson, T., Kowalski, B. 1990. Locally weighted regression and scatter correction for near-infrared reflectance data. Analytical Chemistry @@ -531,7 +555,8 @@ scatter correction for near-infrared reflectance data. Analytical Chemistry Ramirez-Lopez, L., Behrens, T., Schmidt, K., Stevens, A., Dematte, J.A.M., Scholten, T. 2013a. The spectrum-based learner: A new local approach for -modeling soil vis-NIR spectra of complex data sets. Geoderma 195-196, 268-279. +modeling soil vis-NIR spectra of complex data sets. Geoderma 195-196, +268-279. Ramirez-Lopez, L., Behrens, T., Schmidt, K., Viscarra Rossel, R., Dematte, J. A. M., Scholten, T. 2013b. Distance and similarity-search metrics for @@ -550,5 +575,6 @@ Infrared Spectroscopy, 5, 223-232. \code{\link{search_neighbors}} } \author{ -\href{https://orcid.org/0000-0002-5369-5120}{Leonardo Ramirez-Lopez} and Antoine Stevens +\href{https://orcid.org/0000-0002-5369-5120}{Leonardo Ramirez-Lopez} +and Antoine Stevens } diff --git a/man/mbl_control.Rd b/man/mbl_control.Rd index c89c202..494f02b 100644 --- a/man/mbl_control.Rd +++ b/man/mbl_control.Rd @@ -11,7 +11,6 @@ mbl_control(return_dissimilarity = FALSE, number = 10, p = 0.75, range_prediction_limits = TRUE, - progress = TRUE, allow_parallel = TRUE) } \arguments{ @@ -34,7 +33,7 @@ for the "wapls" method) are optimized. Default is \code{TRUE}.} each local segment when \code{"local_cv"} is selected in the \code{validation_type} argument. Default is 10.} -\item{p}{a numeric value indicating the percentage of calibration observations +\item{p}{a numeric value indicating the percentage of calibration observations to be retained at each sampling iteration at each local segment when \code{"local_cv"} is selected in the \code{validation_type} argument. Default is 0.75 (i.e. 75 "\%").} @@ -45,23 +44,17 @@ this range, it will be automatically replaced with the value of the nearest range value. If \code{FALSE}, no prediction limits are imposed. Default is \code{TRUE}.} -\item{progress}{a logical indicating whether or not to print a progress bar -for each observation to be predicted. Default is \code{TRUE}. Note: In case -parallel processing is used, these progress bars will not be printed.} - \item{allow_parallel}{a logical indicating if parallel execution is allowed. If \code{TRUE}, this parallelism is applied to the loop in \code{\link{mbl}} in which each iteration takes care of a single observation in \code{Xu}. The parallelization of this for loop is implemented using the -\link[foreach]{foreach} function of the \code{\link{foreach}} package. +code{\link[foreach]{foreach}} function of the package \code{foreach}. Default is \code{TRUE}.} } \value{ a \code{list} mirroring the specified parameters } \description{ -\lifecycle{maturing} - \loadmathjax This function is used to further control some aspects of the memory-based learning process in the \code{mbl} function. @@ -81,20 +74,20 @@ the memory-based learning method used are described as follows: \item{Local leave-group-out cross-validation (\code{"local_cv"}):}{ The group of neighbors of each observation to be predicted is partitioned into different equal size subsets. Each partition is selected based on a - stratified random sampling that uses the the distribution of - the response variable in the corresponding set of neighbors. When - \code{p} \mjeqn{>=}{\geqslant} 0.5 (i.e. the number of calibration - observations to retain is larger than 50% of the total samples in the neighborhood), - the sampling is conducted for selecting the validation samples, and when - \code{p} < 0.5 the sampling is conducted for selecting the calibration - samples (samples used for model fitting). The model fitted with the selected - calibration samples is used to predict the response values of the local - validation samples and the local root mean square error is computed. - This process is repeated \mjeqn{m}{m} times and the final local - error is computed as the average of the local root mean square errors - obtained for all the \mjeqn{m}{m} iterations. In the \code{mbl_control} function - \mjeqn{m}{m} is controlled by the \code{number} argument and the size of the - subsets is controlled by the \code{p} argument which indicates the + stratified random sampling that uses the the distribution of + the response variable in the corresponding set of neighbors. When + \code{p} \mjeqn{>=}{\geqslant} 0.5 (i.e. the number of calibration + observations to retain is larger than 50% of the total samples in the neighborhood), + the sampling is conducted for selecting the validation samples, and when + \code{p} < 0.5 the sampling is conducted for selecting the calibration + samples (samples used for model fitting). The model fitted with the selected + calibration samples is used to predict the response values of the local + validation samples and the local root mean square error is computed. + This process is repeated \mjeqn{m}{m} times and the final local + error is computed as the average of the local root mean square errors + obtained for all the \mjeqn{m}{m} iterations. In the \code{mbl_control} function + \mjeqn{m}{m} is controlled by the \code{number} argument and the size of the + subsets is controlled by the \code{p} argument which indicates the percentage of observations to be selected from the subset of nearest neighbors. The global error of the predictions is computed as the average of the local root mean square errors.} diff --git a/man/ortho_diss.Rd b/man/ortho_diss.Rd index 1c8eea4..ddb1eb7 100644 --- a/man/ortho_diss.Rd +++ b/man/ortho_diss.Rd @@ -18,8 +18,8 @@ ortho_diss(Xr, Xu = NULL, allow_parallel = TRUE, ...) } \arguments{ -\item{Xr}{a matrix containing \code{n} reference observations/rows and -\code{p} variables/columns.} +\item{Xr}{a matrix containing \code{n} reference observations rows and +\code{p} variablescolumns.} \item{Xu}{an optional matrix containing data of a second set of observations with \code{p} variables/columns.} @@ -60,22 +60,25 @@ method). The methods available are: \item{\code{"cumvar"}:}{ selection of the principal components based on a given cumulative amount of explained variance. In this case, \code{value} must be a value (larger than 0 and below or equal to 1) - indicating the minimum amount of cumulative variance that the + indicating the minimum amount of cumulative variance that the combination of retained components should explain.} \item{\code{"var"}:}{ selection of the principal components based on a given amount of explained variance. In this case, \code{value} must be a value (larger than 0 and below or equal to 1) - indicating the minimum amount of variance that a single component + indicating the minimum amount of variance that a single component should explain in order to be retained.} \item{\code{"manual"}:}{ for manually specifying a fix number of principal components. In this case, \code{value} must be a value - (larger than 0 and below \code{min(nrow(Xr)} \code{+ nrow(Xu),} \code{ncol(Xr))}). + (larger than 0 and + below the minimum dimension of \code{Xr} or \code{Xr} and \code{Xu} + combined). indicating the minimum amount of variance that a component should explain in order to be retained.} } -The default list passed is \code{list(method = "var", value = 0.01)}. +Default is \code{list(method = "var", value = 0.01)}. + Optionally, the \code{pc_selection} argument admits \code{"opc"} or \code{"cumvar"} or \code{"var"} or \code{"manual"} as a single character string. In such case, the default \code{"value"} when either \code{"opc"} or @@ -156,8 +159,6 @@ a \code{list} of class \code{ortho_diss} with the following elements: } } \description{ -\lifecycle{maturing} - \loadmathjax This function computes dissimilarities (in an orthogonal space) between either observations in a given set or between observations in two different @@ -188,13 +189,12 @@ if \code{.local = TRUE}, care must be taken as some neighborhoods might not have enough observations with non-missing \code{"Yr"} values, which might retrieve unreliable dissimilarity computations. -If \code{.local = TRUE} and \code{pc_selection$method} is \code{"opc"} or -\code{"manual"}, the minimum number of observations with non-missing \code{"Yr"} -values at each neighborhood is determined by \code{pc_selection$value} -(i.e. the maximum number of components to compute). +If \code{"opc"} or \code{"manual"} are used in \code{pc_selection$method} +and \code{.local = TRUE}, the minimum number of observations with non-missing +\code{"Yr"} values at each neighborhood is determined by +\code{pc_selection$value} (i.e. the maximum number of components to compute). } \examples{ -\dontrun{ library(prospectr) data(NIRsoil) @@ -231,7 +231,6 @@ pls_diss_optim <- ortho_diss( diss_method = "pls" ) } -} \references{ Ramirez-Lopez, L., Behrens, T., Schmidt, K., Stevens, A., Dematte, J.A.M., Scholten, T. 2013a. The spectrum-based learner: A new local approach for diff --git a/man/ortho_projection.Rd b/man/ortho_projection.Rd index bc8b259..48ecf55 100644 --- a/man/ortho_projection.Rd +++ b/man/ortho_projection.Rd @@ -43,10 +43,14 @@ and their corresponding most similar observations in terms of the side informati provided. A single discrete variable of class factor can also be passed. In that case, the kappa index is used. See \code{\link{sim_eval}} function for more details.} -\item{method}{the method for projecting the data. Options are: "pca" (principal -component analysis using the singular value decomposition algorithm), -"pca.nipals" (principal component analysis using the non-linear iterative -partial least squares algorithm) and "pls" (partial least squares).} +\item{method}{the method for projecting the data. Options are: +\itemize{ +\item{\code{"pca"}:}{ principal component analysis using the singular value +decomposition algorithm.} +\item{\code{"pca.nipals"}:}{ principal component analysis using the +non-linear iterative partial least squares algorithm.} +\item{\code{"pls"}:}{ partial least squares.} +}} \item{pc_selection}{a list of length 2 which specifies the method to be used for optimizing the number of components (principal components or pls factors) @@ -61,13 +65,13 @@ method). The methods available are: minimizes the differences between the \code{Yr} value of each observation and the \code{Yr} value of its closest observation. In this case \code{value} must be a value (larger than 0 and - below \code{min(nrow(Xr)} \code{+ nrow(Xu),} \code{ncol(Xr))} indicating + below \code{min(nrow(Xr)} \code{+ nrow(Xu),} \code{ncol(Xr))} indicating the maximum number of principal components to be tested. See details.} \item{\code{"cumvar"}:}{ selection of the principal components based on a given cumulative amount of explained variance. In this case, \code{value} must be a value (larger than 0 and below or equal to 1) - indicating the minimum amount of cumulative variance that the + indicating the minimum amount of cumulative variance that the combination of retained components should explain.} \item{\code{"var"}:}{ selection of the principal components based @@ -78,11 +82,13 @@ method). The methods available are: \item{\code{"manual"}:}{ for manually specifying a fix number of principal components. In this case, \code{value} must be a value - (larger than 0 and below \code{min(nrow(Xr)} \code{+ nrow(Xu),} \code{ncol(Xr))}). + (larger than 0 and + below the minimum dimension of \code{Xr} or \code{Xr} and \code{Xu} + combined). indicating the minimum amount of variance that a component should explain in order to be retained.} } -The default list passed is \code{list(method = "var", value = 0.01)}. +The list \code{list(method = "var", value = 0.01)} is the default. Optionally, the \code{pc_selection} argument admits \code{"opc"} or \code{"cumvar"} or \code{"var"} or \code{"manual"} as a single character string. In such a case the default \code{"value"} when either \code{"opc"} or @@ -100,7 +106,7 @@ centered.} must be scaled. If \code{Xu} is specified the data is scaled on the basis of \mjeqn{Xr \cup Xu}{Xr U Xu}.} -\item{...}{additional arguments to be passed from \code{ortho_projection} +\item{...}{additional arguments to be passed to \code{pc_projection} or \code{pls_projection}.} \item{tol}{tolerance limit for convergence of the algorithm in the nipals @@ -111,16 +117,14 @@ more than one variable.} \code{method = "pls"} this applies only to \code{Yr} matrices with more than one variable.} -\item{object}{object of class "ortho_projection" (as returned by -\code{ortho_projection}, \code{pc_projection} or \code{pls_projection}).} +\item{object}{object of class \code{"ortho_projection"}.} \item{newdata}{an optional data frame or matrix in which to look for variables with which to predict. If omitted, the scores are used. It must contain the same number of columns, to be used in the same order.} } \value{ -\code{ortho_projection}, \code{pc_projection}, \code{pls_projection}, -return a \code{list} of class \code{ortho_projection} with the following +a \code{list} of class \code{ortho_projection} with the following components: \itemize{ \item{\code{scores}}{ a matrix of scores corresponding to the observations in @@ -139,8 +143,8 @@ components: onto a "pls" space. This object is only returned if the "pls" algorithm was used.} \item{\code{variance}}{ a matrix indicating the standard deviation of each - component (sd), the variance explained by each single component - (explained_var) and the cumulative explained variance + component (sd), the variance explained by each single component + (explained_var) and the cumulative explained variance (cumulative_explained_var). These values are computed based on the data used to create the projection matrices. For example if the "pls" method was used, then these values are computed @@ -176,9 +180,9 @@ using principal component analysis (pca) and partial least squares (pls). \details{ In the case of \code{method = "pca"}, the algrithm used is the singular value decomposition in which a given data matrix (\mjeqn{X}{X}) is factorized as follows: - + \mjdeqn{X = UDV^{T}}{X = UDV^{\mathrm{T}}} - + where \mjeqn{U}{U} and \mjeqn{V}{V} are orthogonal matrices, being the left and right singular vectors of \mjeqn{X}{X} respectively, \mjeqn{D}{D} is a diagonal matrix containing the singular values of \mjeqn{X}{X} and \mjeqn{V}{V} is the is a matrix of @@ -208,12 +212,9 @@ variables, or maximizes the kappa index in the case of categorical variables. In this process, the \code{\link{sim_eval}} function is used. Note that for the \code{"opc"} method \code{Yr} is required (i.e. the side information of the observations). - -This function supports multi-threading for the computation of dissimilarities -via OpenMP in Rcpp. } \examples{ -\dontrun{ +\donttest{ library(prospectr) data(NIRsoil) @@ -221,7 +222,8 @@ data(NIRsoil) # Golay smoothing filter sg_det <- savitzkyGolay( detrend(NIRsoil$spc, - wav = as.numeric(colnames(NIRsoil$spc))), + wav = as.numeric(colnames(NIRsoil$spc)) + ), m = 1, p = 1, w = 7 @@ -229,11 +231,11 @@ sg_det <- savitzkyGolay( NIRsoil$spc_pr <- sg_det # split into training and testing sets -test_x <- NIRsoil$spc_pr[NIRsoil$train == 0 & !is.na(NIRsoil$CEC),] +test_x <- NIRsoil$spc_pr[NIRsoil$train == 0 & !is.na(NIRsoil$CEC), ] test_y <- NIRsoil$CEC[NIRsoil$train == 0 & !is.na(NIRsoil$CEC)] train_y <- NIRsoil$CEC[NIRsoil$train == 1 & !is.na(NIRsoil$CEC)] -train_x <- NIRsoil$spc_pr[NIRsoil$train == 1 & !is.na(NIRsoil$CEC),] +train_x <- NIRsoil$spc_pr[NIRsoil$train == 1 & !is.na(NIRsoil$CEC), ] # A principal component analysis using 5 components pca_projected <- ortho_projection(train_x, pc_selection = list("manual", 5)) @@ -242,7 +244,7 @@ pca_projected # A principal components projection using the "opc" method # for the selection of the optimal number of components pca_projected_2 <- ortho_projection( - Xr = train_x, Xu = test_x, Yr, + Xr = train_x, Xu = test_x, Yr = train_y, method = "pca", pc_selection = list("opc", 40) ) @@ -252,7 +254,7 @@ plot(pca_projected_2) # A partial least squares projection using the "opc" method # for the selection of the optimal number of components pls_projected <- ortho_projection( - Xr = train_x, Xu = test_x, Yr, + Xr = train_x, Xu = test_x, Yr = train_y, method = "pls", pc_selection = list("opc", 40) ) @@ -262,7 +264,7 @@ plot(pls_projected) # A partial least squares projection using the "cumvar" method # for the selection of the optimal number of components pls_projected_2 <- ortho_projection( - Xr = train_x, Yr = train_y, Xu = test_x, + Xr = train_x, Xu = test_x, Yr = train_y, method = "pls", pc_selection = list("cumvar", 0.99) ) diff --git a/man/plot.mbl.Rd b/man/plot.mbl.Rd index 92c6abc..8f70bd3 100644 --- a/man/plot.mbl.Rd +++ b/man/plot.mbl.Rd @@ -11,12 +11,12 @@ \item{g}{a character vector indicating what results shall be plotted. Options are: \code{"validation"} (for plotting the validation results) and/or -\code{"gh"} (for plotting the pls scores used to compute the GH distance. +\code{"gh"} (for plotting the pls scores used to compute the GH distance. See details).} -\item{param}{a character string indicating what validation statistics shall be -plotted. The following options are available: \code{"rmse"}, \code{"st_rmse"} -or \code{"r2"}. These options only available if the \code{mbl} object contains +\item{param}{a character string indicating what validation statistics shall be +plotted. The following options are available: \code{"rmse"}, \code{"st_rmse"} +or \code{"r2"}. These options only available if the \code{mbl} object contains validation results.} \item{pls_c}{a numeric vector of length one or two indicating the pls factors to be @@ -26,8 +26,6 @@ specified in the \code{g} argument.} \item{...}{some arguments to be passed to the plot methods.} } \description{ -\lifecycle{maturing} - Plots the content of an object of class \code{mbl} } \details{ @@ -38,7 +36,7 @@ its covariance matrix. The root square of this matrix is estimated using a singular value decomposition. } \examples{ -\dontrun{ +\donttest{ library(prospectr) data(NIRsoil) diff --git a/man/plot.ortho_projection.Rd b/man/plot.ortho_projection.Rd index dcba34d..1c6ae92 100644 --- a/man/plot.ortho_projection.Rd +++ b/man/plot.ortho_projection.Rd @@ -14,8 +14,6 @@ \item{...}{arguments to be passed to methods.} } \description{ -\lifecycle{maturing} - Plots objects of class \code{ortho_projection} } \seealso{ diff --git a/man/resemble-package.Rd b/man/resemble-package.Rd index d092840..0edad83 100644 --- a/man/resemble-package.Rd +++ b/man/resemble-package.Rd @@ -6,9 +6,10 @@ \alias{resemble} \title{Overview of the functions in the resemble package} \description{ -\lifecycle{maturing} +\ifelse{html}{\out{Maturing lifecycle}}{\strong{Maturing}} Functions for memory-based learning + \if{html}{\figure{logo.png}{options: align='right' alt='logo' width='120'}} } \details{ @@ -17,9 +18,10 @@ implements a number of \code{R} functions useful for modeling complex spectral spectra (e.g. NIR, IR). The package includes functions for dimensionality reduction, computing spectral dissimilarity matrices, nearest neighbor search, -and modeling spectral data using memory-based learning. +and modeling spectral data using memory-based learning. This package builds +upon the methods presented in Ramirez-Lopez et al. (2013) >. -Development versions can be found in the github repository of the package +Development versions can be found in the github repository of the package at \href{https://github.com/l-ramirez-lopez/resemble}{https://github.com/l-ramirez-lopez/resemble}. The functions available for dimensionality reduction are: @@ -56,16 +58,34 @@ Other supplementary functions: \item{\code{\link{plot.ortho_projection}}} } } +\references{ +Ramirez-Lopez, L., Behrens, T., Schmidt, K., Stevens, A., Dematte, J.A.M., +Scholten, T. 2013a. The spectrum-based learner: A new local approach for +modeling soil vis-NIR spectra of complex data sets. Geoderma 195-196, +268-279. +} +\seealso{ +Useful links: +\itemize{ +\item \url{https://github.com/l-ramirez-lopez/resemble} +\item Report bugs at \url{https://github.com/l-ramirez-lopez/resemble/issues} +} +} \author{ -\href{https://orcid.org/0000-0002-5369-5120}{Leonardo Ramirez-Lopez} [aut, cre] +\strong{Maintainer / Creator}: Leonardo Ramirez-Lopez \email{ramirez.lopez.leo@gmail.com} + +Authors: +\itemize{ +\item Leonardo Ramirez-Lopez (\href{https://orcid.org/0000-0002-5369-5120}{ORCID}) -\href{https://orcid.org/0000-0002-1588-7519}{Antoine Stevens} [ctb] +\item Antoine Stevens (\href{https://orcid.org/0000-0002-1588-7519}{ORCID}) -\href{https://orcid.org/0000-0003-1540-4748}{Raphael Viscarra Rossel} [ctb] +\item Raphael Viscarra Rossel (\href{https://orcid.org/0000-0003-1540-4748}{ORCID}) -\href{https://orcid.org/0000-0001-5416-8640}{Craig Lobsey} [ctb] +\item Craig Lobsey (\href{https://orcid.org/0000-0001-5416-8640}{ORCID}) -\href{https://orcid.org/0000-0001-7325-9716}{Alex Wadoux} [ctb] +\item Alex Wadoux (\href{https://orcid.org/0000-0001-7325-9716}{ORCID}) -\href{https://orcid.org/0000-0001-5695-8064}{Timo Breure} [ctb] +\item Timo Breure (\href{https://orcid.org/0000-0001-5695-8064}{ORCID}) +} } diff --git a/man/search_neighbors.Rd b/man/search_neighbors.Rd index 38b49f0..83f55b7 100644 --- a/man/search_neighbors.Rd +++ b/man/search_neighbors.Rd @@ -60,7 +60,7 @@ to be used in the selection of the nearest neighbors of each observation. \itemize{ \item{\code{diss_method = "pls"}} \item{\code{diss_method = "pca"} with \code{"opc"} used as the method - in the \code{pc_selection} argument. See \code{\link{ortho_diss}.}} + in the \code{pc_selection} argument. See [ortho_diss()].} }} \item{k}{an integer value indicating the k-nearest neighbors of each @@ -95,30 +95,32 @@ method). The methods available are: (of set of observations) is the one for which its distance matrix minimizes the differences between the \code{Yr} value of each observation and the \code{Yr} value of its closest observation. In this - case \code{value} must be a value (larger than 0 and - below \code{min(nrow(Xr), nrow(X2), ncol(Xr))}) indicating the maximum - number of principal components to be tested. See the - \code{\link{ortho_projection}} function for more details.} + case \code{value} must be a value (larger than 0 and below the + minimum dimension of \code{Xr} or \code{Xr} and \code{Xu} combined) + indicating the maximum number of principal components to be tested. + See the \code{\link{ortho_projection}} function for more details.} \item{\code{"cumvar"}:}{ selection of the principal components based on a given cumulative amount of explained variance. In this case, \code{value} must be a value (larger than 0 and below or equal to 1) - indicating the minimum amount of cumulative variance that the + indicating the minimum amount of cumulative variance that the combination of retained components should explain.} \item{\code{"var"}:}{ selection of the principal components based on a given amount of explained variance. In this case, \code{value} must be a value (larger than 0 and below or equal to 1) - indicating the minimum amount of variance that a single component + indicating the minimum amount of variance that a single component should explain in order to be retained.} \item{\code{"manual"}:}{ for manually specifying a fix number of principal components. In this case, \code{value} must be a value - (larger than 0 and \code{min(nrow(Xr), nrow(X2), ncol(Xr))}). + (larger than 0 and below the + minimum dimension of \code{Xr} or \code{Xr} and \code{Xu} combined) indicating the minimum amount of variance that a component should explain in order to be retained.} } -The default list passed is \code{list(method = "var", value = 0.01)}. +The default is \code{list(method = "var", value = 0.01)}. + Optionally, the \code{pc_selection} argument admits \code{"opc"} or \code{"cumvar"} or \code{"var"} or \code{"manual"} as a single character string. In such a case the default \code{"value"} when either \code{"opc"} or @@ -185,19 +187,18 @@ a \code{list} containing the following elements: \item{\code{dissimilarity}}{ If \code{return_dissimilarity = TRUE} the dissimilarity object used (as computed by the \code{\link{dissimilarity}} function.} - \item{\code{projection}}{ an \code{ortho_projection} object. Only output if + \item{\code{projection}}{ an \code{ortho_projection} object. Only output if \code{return_projection = TRUE} and if \code{diss_method = "pca"}, \code{diss_method = "pca.nipals"} or \code{diss_method = "pls"}. + This object contains the projection used to compute the dissimilarity matrix. In case of local dissimilarity matrices, the projection corresponds to the global projection used to select the - neighborhoods (see \code{\link{ortho_diss}} function for further + neighborhoods. (see \code{\link{ortho_diss}} function for further details).} } } \description{ -\lifecycle{maturing} - \loadmathjax This function searches in a reference set the neighbors of the observations provided in another set. @@ -218,7 +219,7 @@ used inside \code{\link{dissimilarity}} (i.e. \code{\link{ortho_diss}} those functions as additional arguments (i.e. \code{...}). } \examples{ -\dontrun{ +\donttest{ library(prospectr) data(NIRsoil) diff --git a/man/sid.Rd b/man/sid.Rd index c1cf5a1..38e0f5a 100644 --- a/man/sid.Rd +++ b/man/sid.Rd @@ -83,9 +83,9 @@ a \code{list} with the following components: } } \description{ -\lifecycle{experimental} - \loadmathjax +\ifelse{html}{\out{Experimental lifecycle}}{\strong{Experimental}} + This function computes the spectral information divergence/dissimilarity between spectra based on the kullback-leibler divergence algorithm (see details). } @@ -144,7 +144,7 @@ the \code{sid} function will accept negative values and matrix centering will be possible. } \examples{ -\dontrun{ +\donttest{ library(prospectr) data(NIRsoil) diff --git a/man/sim_eval.Rd b/man/sim_eval.Rd index 3e05671..0d3193e 100644 --- a/man/sim_eval.Rd +++ b/man/sim_eval.Rd @@ -33,7 +33,9 @@ variable, then the kappa index is used. See details.} } } \description{ -\lifecycle{stable} +\loadmathjax + +\ifelse{html}{\out{Stable lifecycle}}{\strong{Stable}} This function searches for the most similar observation (closest neighbor) of each observation in a given data set based on a dissimilarity (e.g. distance @@ -43,7 +45,6 @@ square of differences and the correlation coefficient are used for continuous variables and for discrete variables the kappa index is used. } \details{ -\loadmathjax For the evaluation of dissimilarity matrices this function uses side information (information about one variable which is available for a group of observations, Ramirez-Lopez et al., 2013). It is assumed that there @@ -57,12 +58,12 @@ provided. It is computed as follows: \mjdeqn{j(i) = NN(xr_i, Xr^{\{-i\}})}{j(i) = NN(xr_i, Xr^{\{-i\}})} \mjdeqn{RMSD = \sqrt{\frac{1}{m} \sum_{i=1}^n {(y_i - y_{j(i)})^2}}}{RMSD = \sqrt{1/n sum_{i=1}^m (y_i - y_{j(i)})^2}} -where \mjeqn{NN(xr_i, Xr^{-i})}{NN(xr_i, Xr^{-i})} represents a function to -obtain the index of the nearest neighbor observation found in \mjeqn{Xr}{Xr} -(excluding the \mjeqn{i}{i}th observation) for \mjeqn{xr_i}{xr_i}, -\mjeqn{y_{i}}{y_i} is the value of the side variable of the \mjeqn{i}{i}th -observation, \mjeqn{y_{j(i)}}{y_{j(i)}} is the value of the side variable of -the nearest neighbor of the \mjeqn{i}{i}th observation and \mjeqn{m}{m} is +where \mjeqn{NN(xr_i, Xr^{-i})}{NN(xr_i, Xr^{-i})} represents a function to +obtain the index of the nearest neighbor observation found in \mjeqn{Xr}{Xr} +(excluding the \mjeqn{i}{i}th observation) for \mjeqn{xr_i}{xr_i}, +\mjeqn{y_{i}}{y_i} is the value of the side variable of the \mjeqn{i}{i}th +observation, \mjeqn{y_{j(i)}}{y_{j(i)}} is the value of the side variable of +the nearest neighbor of the \mjeqn{i}{i}th observation and \mjeqn{m}{m} is the total number of observations. If \code{side_info} is a factor the kappa index (\mjeqn{\kappa}{kappa}) is @@ -79,12 +80,9 @@ observations). While \mjeqn{p_o}{p_o} is the relative agreement This functions accepts vectors to be passed to argument \code{d}, in this case, the vector must represent the lower triangle of a dissimilarity matrix (e.g. as returned by the [stats::dist()] function of \code{stats}). - -This function supports multi-threading based on OpenMP for retrieving the -closest observations. } \examples{ -\dontrun{ +\donttest{ library(prospectr) data(NIRsoil) diff --git a/my-comments.md b/my-comments.md new file mode 100644 index 0000000..20ac0ad --- /dev/null +++ b/my-comments.md @@ -0,0 +1,84 @@ +# resemble + +# Rhub checks for release of `resemble 2.0.0` (`gordillo`) + +29.10.2020 +As requested by CRAN: +-The length of the title is now below + 65 characters +- A has been added in the description field of DESCRIPTION +- \donttest{} is now used (instead of \dontrun{}) for those examples + taking more than 5 seconds +- verobse argument has been added to the functions to easily suppress any + message different from error warnings or messages. +- on.exit() is now called properly to reset to user + parameters when the functions are exited +- User's options() are reset in the examples in the vignette + that require changes in those options() +- Examples are now explicitly using maximum two cores (if + available). + +The package has been checked in multiple platforms using rhub::check(), as well +as in the win-builders provided by CRAN. + + + +21.10.2020 + +The checks were conducted in the following platforms trhough rhub: + +- "debian-clang-devel" + +- "debian-gcc-devel" + +- "fedora-gcc-devel" + +- "debian-gcc-devel-nold" + +- "debian-gcc-patched" + +- "debian-gcc-release" + +- "linux-x86_64-rocker-gcc-san" + +- "macos-highsierra-release-cran" + +- "solaris-x86-patched-ods" + +- "ubuntu-gcc-release" + +- "windows-x86_64-devel" + +For example, for checks with "fedora-gcc-devel"", the following code was used:: +``` +rhub::check("./resemble_2.0.0.tar.gz", + platform = c("fedora-gcc-devel"), + email = "ramirez.lopez.leo@gmail.com") +``` + +Some of the unit tests for `pls_projection` and `pc_projection` were failing +in three platfroms apparently due to numeric accuracy and the use of OPENMP in +Rcpp. These platforms were: + + - "debian-gcc-devel" + + - "debian-gcc-devel-nold" + + - "linux-x86_64-rocker-gcc-san" + +The remaining platforms were passing all the tests sucessfully. For the above +three platforms, the solution was to disable OPENMP. + +All platforms pass the checks successfully for the release. + +## Size of the package +To reduce the size of the package, Makevars was modified and Makevars.win was +added. + +## NOTE for compiled code +An strange not was thrown when the check was done locally on windows with R `4.0.3` +It is apparently a problem in R core and not related to the package nor Rcpp. +The issue was reported here: +https://stackoverflow.com/questions/64402688/information-on-o-files-for-x64-is-not-available-note-on-r-package-checks-using/64419033#64419033 + + diff --git a/src/Makevars b/src/Makevars index 5323a4e..7503ac8 100644 --- a/src/Makevars +++ b/src/Makevars @@ -1,3 +1,6 @@ ## Use the R_HOME indirection to support installations of multiple R version PKG_LIBS = `$(R_HOME)/bin/Rscript -e "Rcpp:::LdFlags()"` $(LAPACK_LIBS) $(BLAS_LIBS) $(FLIBS) $(SHLIB_OPENMP_CXXFLAGS) -PKG_CXXFLAGS = $(SHLIB_OPENMP_CXXFLAGS) \ No newline at end of file +PKG_CXXFLAGS = $(SHLIB_OPENMP_CXXFLAGS) +strippedLib: $(SHLIB) + if test -e "/usr/bin/strip" & test -e "/bin/uname" & [[ `uname` == "Linux" ]]; then /usr/bin/strip --strip-debug $(SHLIB); fi +.phony: strippedLib diff --git a/src/Makevars.win b/src/Makevars.win new file mode 100644 index 0000000..b8212ab --- /dev/null +++ b/src/Makevars.win @@ -0,0 +1,6 @@ +## This assumes that we can call Rscript to ask Rcpp about its locations +## Use the R_HOME indirection to support installations of multiple R version +PKG_LIBS = $(shell "${R_HOME}/bin${R_ARCH_BIN}/Rscript.exe" -e "Rcpp:::LdFlags()") + +PKG_CPPFLAGS = -I../inst/include -I. +PKG_LIBS += $(LAPACK_LIBS) $(BLAS_LIBS) $(FLIBS) diff --git a/src/RcppExports.cpp b/src/RcppExports.cpp index 502f743..b63b93e 100644 --- a/src/RcppExports.cpp +++ b/src/RcppExports.cpp @@ -30,38 +30,38 @@ BEGIN_RCPP return rcpp_result_gen; END_RCPP } -// fastDistVVL -NumericVector fastDistVVL(NumericVector X); -RcppExport SEXP _resemble_fastDistVVL(SEXP XSEXP) { +// moving_cor_diss +NumericMatrix moving_cor_diss(arma::mat X, arma::mat Y, int w); +RcppExport SEXP _resemble_moving_cor_diss(SEXP XSEXP, SEXP YSEXP, SEXP wSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; - Rcpp::traits::input_parameter< NumericVector >::type X(XSEXP); - rcpp_result_gen = Rcpp::wrap(fastDistVVL(X)); + Rcpp::traits::input_parameter< arma::mat >::type X(XSEXP); + Rcpp::traits::input_parameter< arma::mat >::type Y(YSEXP); + Rcpp::traits::input_parameter< int >::type w(wSEXP); + rcpp_result_gen = Rcpp::wrap(moving_cor_diss(X, Y, w)); return rcpp_result_gen; END_RCPP } -// minDissV -NumericVector minDissV(NumericVector X); -RcppExport SEXP _resemble_minDissV(SEXP XSEXP) { +// which_min +NumericVector which_min(NumericMatrix X); +RcppExport SEXP _resemble_which_min(SEXP XSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; - Rcpp::traits::input_parameter< NumericVector >::type X(XSEXP); - rcpp_result_gen = Rcpp::wrap(minDissV(X)); + Rcpp::traits::input_parameter< NumericMatrix >::type X(XSEXP); + rcpp_result_gen = Rcpp::wrap(which_min(X)); return rcpp_result_gen; END_RCPP } -// moving_cor_diss -NumericMatrix moving_cor_diss(NumericMatrix X, NumericMatrix Y, int w); -RcppExport SEXP _resemble_moving_cor_diss(SEXP XSEXP, SEXP YSEXP, SEXP wSEXP) { +// which_min_vector +NumericVector which_min_vector(NumericVector X); +RcppExport SEXP _resemble_which_min_vector(SEXP XSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; - Rcpp::traits::input_parameter< NumericMatrix >::type X(XSEXP); - Rcpp::traits::input_parameter< NumericMatrix >::type Y(YSEXP); - Rcpp::traits::input_parameter< int >::type w(wSEXP); - rcpp_result_gen = Rcpp::wrap(moving_cor_diss(X, Y, w)); + Rcpp::traits::input_parameter< NumericVector >::type X(XSEXP); + rcpp_result_gen = Rcpp::wrap(which_min_vector(X)); return rcpp_result_gen; END_RCPP } @@ -208,15 +208,18 @@ BEGIN_RCPP END_RCPP } // reconstruction_error -Rcpp::NumericMatrix reconstruction_error(arma::mat x, arma::mat projection_mat, arma::mat xloadings); -RcppExport SEXP _resemble_reconstruction_error(SEXP xSEXP, SEXP projection_matSEXP, SEXP xloadingsSEXP) { +Rcpp::NumericMatrix reconstruction_error(arma::mat x, arma::mat projection_mat, arma::mat xloadings, bool scale, arma::mat Xcenter, arma::mat Xscale); +RcppExport SEXP _resemble_reconstruction_error(SEXP xSEXP, SEXP projection_matSEXP, SEXP xloadingsSEXP, SEXP scaleSEXP, SEXP XcenterSEXP, SEXP XscaleSEXP) { BEGIN_RCPP Rcpp::RObject rcpp_result_gen; Rcpp::RNGScope rcpp_rngScope_gen; Rcpp::traits::input_parameter< arma::mat >::type x(xSEXP); Rcpp::traits::input_parameter< arma::mat >::type projection_mat(projection_matSEXP); Rcpp::traits::input_parameter< arma::mat >::type xloadings(xloadingsSEXP); - rcpp_result_gen = Rcpp::wrap(reconstruction_error(x, projection_mat, xloadings)); + Rcpp::traits::input_parameter< bool >::type scale(scaleSEXP); + Rcpp::traits::input_parameter< arma::mat >::type Xcenter(XcenterSEXP); + Rcpp::traits::input_parameter< arma::mat >::type Xscale(XscaleSEXP); + rcpp_result_gen = Rcpp::wrap(reconstruction_error(x, projection_mat, xloadings, scale, Xcenter, Xscale)); return rcpp_result_gen; END_RCPP } @@ -327,35 +330,13 @@ BEGIN_RCPP return rcpp_result_gen; END_RCPP } -// which_min -NumericVector which_min(NumericMatrix X); -RcppExport SEXP _resemble_which_min(SEXP XSEXP) { -BEGIN_RCPP - Rcpp::RObject rcpp_result_gen; - Rcpp::RNGScope rcpp_rngScope_gen; - Rcpp::traits::input_parameter< NumericMatrix >::type X(XSEXP); - rcpp_result_gen = Rcpp::wrap(which_min(X)); - return rcpp_result_gen; -END_RCPP -} -// which_min_vector -NumericVector which_min_vector(NumericVector X); -RcppExport SEXP _resemble_which_min_vector(SEXP XSEXP) { -BEGIN_RCPP - Rcpp::RObject rcpp_result_gen; - Rcpp::RNGScope rcpp_rngScope_gen; - Rcpp::traits::input_parameter< NumericVector >::type X(XSEXP); - rcpp_result_gen = Rcpp::wrap(which_min_vector(X)); - return rcpp_result_gen; -END_RCPP -} static const R_CallMethodDef CallEntries[] = { {"_resemble_fast_diss", (DL_FUNC) &_resemble_fast_diss, 3}, {"_resemble_fast_diss_vector", (DL_FUNC) &_resemble_fast_diss_vector, 1}, - {"_resemble_fastDistVVL", (DL_FUNC) &_resemble_fastDistVVL, 1}, - {"_resemble_minDissV", (DL_FUNC) &_resemble_minDissV, 1}, {"_resemble_moving_cor_diss", (DL_FUNC) &_resemble_moving_cor_diss, 3}, + {"_resemble_which_min", (DL_FUNC) &_resemble_which_min, 1}, + {"_resemble_which_min_vector", (DL_FUNC) &_resemble_which_min_vector, 1}, {"_resemble_get_col_largest_sd", (DL_FUNC) &_resemble_get_col_largest_sd, 1}, {"_resemble_get_column_sds", (DL_FUNC) &_resemble_get_column_sds, 1}, {"_resemble_get_column_means", (DL_FUNC) &_resemble_get_column_means, 1}, @@ -366,15 +347,13 @@ static const R_CallMethodDef CallEntries[] = { {"_resemble_opls_get_basics", (DL_FUNC) &_resemble_opls_get_basics, 6}, {"_resemble_predict_opls", (DL_FUNC) &_resemble_predict_opls, 6}, {"_resemble_project_opls", (DL_FUNC) &_resemble_project_opls, 6}, - {"_resemble_reconstruction_error", (DL_FUNC) &_resemble_reconstruction_error, 3}, + {"_resemble_reconstruction_error", (DL_FUNC) &_resemble_reconstruction_error, 6}, {"_resemble_get_pls_weights", (DL_FUNC) &_resemble_get_pls_weights, 9}, {"_resemble_opls_cv_cpp", (DL_FUNC) &_resemble_opls_cv_cpp, 12}, {"_resemble_gaussian_process", (DL_FUNC) &_resemble_gaussian_process, 4}, {"_resemble_predict_gaussian_process", (DL_FUNC) &_resemble_predict_gaussian_process, 8}, {"_resemble_gaussian_process_cv", (DL_FUNC) &_resemble_gaussian_process_cv, 6}, {"_resemble_pca_nipals", (DL_FUNC) &_resemble_pca_nipals, 8}, - {"_resemble_which_min", (DL_FUNC) &_resemble_which_min, 1}, - {"_resemble_which_min_vector", (DL_FUNC) &_resemble_which_min_vector, 1}, {NULL, NULL, 0} }; diff --git a/src/diss_helpers.cpp b/src/diss_helpers.cpp new file mode 100644 index 0000000..bbf25a9 --- /dev/null +++ b/src/diss_helpers.cpp @@ -0,0 +1,285 @@ +#include +// [[Rcpp::depends(RcppArmadillo)]] + +using namespace Rcpp; +using namespace RcppArmadillo; + +//' @title A fast distance algorithm for two matrices written in C++ +//' @description Computes distances between two data matrices using +//' "euclid", "cor", "cosine" +//' @usage +//' fast_diss(X, Y, method) +//' @param X a matrix +//' @param Y a matrix +//' @param method a \code{string} with possible values "euclid", "cor", "cosine" +//' @return a distance matrix +//' @keywords internal +//' @useDynLib resemble +//' @author Antoine Stevens and Leonardo Ramirez-Lopez +// [[Rcpp::export]] +arma::mat fast_diss(NumericMatrix X, NumericMatrix Y, String method){ + + //double eps = sqrt(DOUBLE_EPS); + //FIXME check numerical precision in Rcpp + //in some cases it returns 0s as -1e-14 + //perhaps due to reuse memory? + // Option: make the inputs arma::mat X arma::mat Y + + int nX = X.nrow(), kX = X.ncol(), nY = Y.nrow(), kY = Y.ncol(); + arma::mat XX(X.begin(), nX, kX, false); // reuses memory and avoids extra copy + arma::mat YY(Y.begin(), nY, kY, false); // reuses memory and avoids extra copy + if(method == "euclid"){ + arma::mat output = arma::ones(nY,1) * arma::sum(arma::square(XX),1).t() + arma::sum(arma::square(YY),1) * arma::ones(1,nX) - 2 * YY * XX.t(); + return output; + } + if(method=="cor"){ + arma::mat output = (1 - arma::cor(XX.t(), YY.t()))/2; + return output.t(); + } + else{ // cosine + arma::mat numerator = XX * YY.t(); + arma::mat dvsr = arma::sqrt(arma::sum(arma::square(XX),1)) * arma::sqrt(arma::sum(arma::square(YY),1)).t(); + arma::mat output = arma::acos(numerator/dvsr); + return output.t(); + } +} + + + +//' @title A fast algorithm of (squared) Euclidean cross-distance for vectors written in C++ +//' @description A fast (parallel for linux) algorithm of (squared) Euclidean cross-distance for vectors written in C++ +//' @usage +//' fast_diss_vector(X) +//' @param X a vector. +//' @return a vector of distance (lower triangle of the distance matrix, stored by column) +//' @details used internally in ortho_projection +//' @author Antoine Stevens +//' @keywords internal +//' @useDynLib resemble +// [[Rcpp::export]] +NumericVector fast_diss_vector(NumericVector X){ + int nX = X.size(); + int n = ((nX*nX)-nX)/2; + NumericVector output(n); + // #if defined(_OPENMP) + // #pragma omp parallel for schedule(dynamic) + // #endif + for(int i = 0; i < nX-1; i++) + for(int j = i+1; j < nX; j++){ + double x = X(j)-X(i); + output(nX*i - (i * (i + 3) / 2) + j - 1) = x * x; + } + return output; +} + +// //' @title A fast (serial) algorithm of Euclidean (non-squared) cross-distance for vectors written in C++ +// //' @description A fast algorithm of (squared) Euclidean cross-distance for vectors written in C++ +// //' @usage +// //' fastDistVVL(X) +// //' @param X a vector +// //' @return a vector of distance (lower triangle of the distance matrix, stored by column) +// //' @details used internally in ortho_projection +// //' @author Leo Ramirez-Lopez +// //' @keywords internal +// //' @useDynLib resemble +// // [[Rcpp::export]] +// NumericVector fastDistVVL(NumericVector X){ +// int nX = X.size(); +// // Compute the length of the output vector +// // that will contain the lower tringle +// // of the distance matrix... +// int n = ((nX * nX) - nX) / 2; +// // ... and create the vector +// NumericVector output(n); +// +// int starti = 0; +// int endi = 0; +// for(int i = 0; i < (nX - 1); i++){ +// int ii = i + 1; +// int length_vector; +// starti = (i * (nX - 1)) - (((i * i) - i) / 2); +// endi = ((ii * (nX - 1)) - (((ii * ii) - ii) / 2)) - 1; +// // create a vector or indices +// Range indices_1(starti, endi); +// Range indices_2(ii, nX - 1); +// length_vector = indices_1.size(); +// // Create a vector of length length_vector with +// // repeated X[i] values +// NumericVector ith_point(length_vector, X[i]); +// output[indices_1] = ith_point - X[indices_2]; +// } +// +// // compute power of 2 +// // convert to absolute differences (equivalent to euclidean) +// output = pow(output, 2); +// return output; +// } + + +// //' @title A function to compute indices of minimum values of a distance vector +// //' @description For internal use only +// //' @usage +// //' minDissV(X) +// //' @param X a vector of distance (as computed in \code{resemble:::fastDistVV} or \code{base::dist}). +// //' @return a vector of the indices of the nearest neighbors +// //' @details +// //' Used internally to find the nearest neighbors. +// //' It searches in lower (or upper?) trianguular matrix. Therefore this must be the format of the +// //' input data. The piece of code int \code{len = (sqrt(X.size()*8+1)+1)/2} generated an error in CRAN +// //' since \code{sqrt} cannot be applied to integers. +// //' @keywords internal +// //' @useDynLib resemble +// //' @author Antoine Stevens +// // [[Rcpp::plugins(openmp)]] +// // [[Rcpp::export]] +// NumericVector minDissV(NumericVector X){ +// // For the distances +// int nX = X.size(); +// int n = ((nX * nX) - nX) / 2; +// NumericVector doutput(n); +// +// // For the indices +// arma::uword index; +// double vct = (sqrt(((double)doutput.size()) * 8.0 + 1.0) + 1.0) / 2.0; +// int len = (int)vct; +// // int len = (sqrt(Ds.size()*8+1)+1)/2; +// arma::uvec vindex(len); +// int i,j; +// #if defined(_OPENMP) +// #pragma omp parallel for private(i,j) schedule(dynamic) +// #endif +// for(int i = 0; i < nX - 1; i++) +// for(int j = i + 1; j < nX; j++){ +// double x = X(j) - X(i); +// doutput(nX*i - (i * (i + 3) / 2) + j - 1) = x * x; +// } +// +// for(i = 0; i < len; i++){ +// arma::vec x(len); +// for(j = 0; j < i; j++){ +// // triangular sequence +// int k = j * len - (j * (j + 3) / 2) + i - 1; +// x[j] = doutput(k); +// } +// for(j = i + 1; j < len; j++){ +// // triangular sequence +// int k2 = i * len - (i * (i + 3) / 2) + j - 1; +// x[j] = doutput(k2); +// } +// x[i] = arma::datum::nan; // remove diag +// x.min(index); // don't assign result to a value since we are interested only in the index +// vindex[i] = index; +// } +// return wrap(vindex + 1); +// } + + +//' @title Moving/rolling correlation distance of two matrices +//' @description Computes a moving window correlation distance between two data matrices +//' @usage +//' moving_cor_diss(X,Y,w) +//' @param X a matrix +//' @param Y a matrix +//' @param w window size (must be odd) +//' @return a matrix of correlation distance +//' @keywords internal +//' @useDynLib resemble +//' @author Leonardo Ramirez-Lopez and Antoine Stevens +// [[Rcpp::export]] +NumericMatrix moving_cor_diss(arma::mat X, arma::mat Y, int w){ + arma::mat rmwF = arma::zeros(X.n_rows, Y.n_rows); + if(!w%2) { + throw std::invalid_argument("w must be odd"); + } + int gap = (w - 1) / 2; + if((Y.n_cols - w ) < 1){ + arma::mat rmwF = arma::cor(trans(X), trans(Y)); + } else { + int ny = Y.n_cols; + for(int i = gap; i < ny - gap; i++){ + // sum of the correlations + rmwF += arma::cor(X.cols(i-gap, i + gap).t(), Y.cols(i - gap, i + gap).t()); + } + // get the average + rmwF = rmwF / (Y.n_cols - (2*gap)); + } + // get the distance + arma::mat scmw = (1 - rmwF)/2; + return wrap(scmw.t()); +} + +//' @title A function to compute row-wise index of minimum values of a square distance matrix +//' @description For internal use only +//' @usage +//' which_min(X) +//' @param X a square matrix of distances +//' @return a vector of the indices of the minimum value in each row of the input matrix +//' @details Used internally to find the nearest neighbors +//' @keywords internal +//' @useDynLib resemble +//' @author Antoine Stevens +// [[Rcpp::export]] +NumericVector which_min(NumericMatrix X){ + int nX = X.nrow(), kX = X.ncol(); + arma::mat XX(X.begin(), nX, kX, false); + arma::uword index; + arma::uvec vindex(nX); + // #if defined(_OPENMP) + // #pragma omp parallel for schedule(static) + // #endif + for(int i = 0; i < nX; i++){ + arma::rowvec x = XX.row(i); + x(i) = arma::datum::nan; // remove diag + x.min(index); // don't assign result to a value since we are interested only in the index + vindex[i] = index; + } + return wrap(vindex +1); +} + + +//' @title A function to compute indices of minimum values of a distance vector +//' @description For internal use only +//' @usage +//' which_min_vector(X) +//' @param X a vector of distances +//' @return a vector of the indices of the nearest neighbors +//' @details +//' Used internally to find the nearest neighbors. +//' It searches in lower (or upper) triangular matrix. Therefore this must be the format of the +//' input data. The piece of code int \code{len = (sqrt(X.size()*8+1)+1)/2} generated an error in CRAN +//' since \code{sqrt} cannot be applied to integers. +//' @keywords internal +//' @useDynLib resemble +//' @author Antoine Stevens +// [[Rcpp::export]] +NumericVector which_min_vector(NumericVector X){ + arma::uword index; + double vct = (sqrt(((double)X.size()) * 8.0 + 1.0) + 1.0) / 2.0; + int len = (int)vct; + // int len = (sqrt(X.size()*8+1)+1)/2; + arma::uvec vindex(len); + int i; + int j; + // #if defined(_OPENMP) + // #pragma omp parallel for private(i,j) schedule(dynamic) + // #endif + + for(i = 0; i < len; i++){ + arma::vec x(len); + for(j = 0; j < i; j++){ + // triangular sequence + int k = j * len - (j * (j + 3) / 2) + i - 1; + x[j] = X(k); + } + for(j = i+1; j < len; j++){ + // triangular sequence + int k2 = i * len - (i * (i + 3) / 2) + j - 1; + x[j] = X(k2); + } + x[i] = arma::datum::nan; // remove diag + x.min(index); // don't assign result to a value since we are interested only in the index + vindex[i] = index; + } + return wrap(vindex + 1); +} + diff --git a/src/fast_diss.cpp b/src/fast_diss.cpp deleted file mode 100644 index f9a3109..0000000 --- a/src/fast_diss.cpp +++ /dev/null @@ -1,177 +0,0 @@ -#include -#ifdef _OPENMP -#include // OpenMP -#endif -// [[Rcpp::depends(RcppArmadillo)]] -// [[Rcpp::plugins(openmp)]] - -using namespace Rcpp; - -//' @title A fast distance algorithm for two matrices written in C++ -//' @description Computes distances between two data matrices using -//' "euclid", "cor", "cosine" -//' @usage -//' fast_diss(X, Y, method) -//' @param X a matrix -//' @param Y a matrix -//' @param method a \code{string} with possible values "euclid", "cor", "cosine" -//' @return a distance matrix -//' @keywords internal -//' @useDynLib resemble -//' @author Antoine Stevens and Leonardo Ramirez-Lopez -// [[Rcpp::export]] -arma::mat fast_diss(NumericMatrix X, NumericMatrix Y, String method){ - - //double eps = sqrt(DOUBLE_EPS); - //FIXME check numerical precision in Rcpp - //in some cases it returns 0s as -1e-14 - //perhaps due to reuse memory? - - int nX = X.nrow(), kX = X.ncol(), nY = Y.nrow(), kY = Y.ncol(); - arma::mat XX(X.begin(), nX, kX, false); // reuses memory and avoids extra copy - arma::mat YY(Y.begin(), nY, kY, false); // reuses memory and avoids extra copy - if(method == "euclid"){ - arma::mat output = arma::ones(nY,1) * arma::sum(arma::square(XX),1).t() + arma::sum(arma::square(YY),1) * arma::ones(1,nX) - 2 * YY * XX.t(); - return output; - } - if(method=="cor"){ - arma::mat output = (1 - arma::cor(XX.t(), YY.t()))/2; - return output.t(); - } - else{ // cosine - arma::mat numerator = XX * YY.t(); - arma::mat dvsr = arma::sqrt(arma::sum(arma::square(XX),1)) * arma::sqrt(arma::sum(arma::square(YY),1)).t(); - arma::mat output = arma::acos(numerator/dvsr); - return output.t(); - } -} - - - -//' @title A fast algorithm of (squared) Euclidean cross-distance for vectors written in C++ -//' @description A fast (parallel for linux) algorithm of (squared) Euclidean cross-distance for vectors written in C++ -//' @usage -//' fast_diss_vector(X) -//' @param X a vector. -//' @return a vector of distance (lower triangle of the distance matrix, stored by column) -//' @details used internally in ortho_projection -//' @author Antoine Stevens -//' @keywords internal -//' @useDynLib resemble -// [[Rcpp::plugins(openmp)]] -// [[Rcpp::export]] -NumericVector fast_diss_vector(NumericVector X){ - int nX = X.size(); - int n = ((nX*nX)-nX)/2; - NumericVector output(n); -#if defined(_OPENMP) -#pragma omp parallel for schedule(dynamic) -#endif - for(int i = 0; i < nX-1; i++) - for(int j = i+1; j < nX; j++){ - double x = X(j)-X(i); - output(nX*i - (i * (i + 3) / 2) + j - 1) = x * x; - } - return output; -} - -//' @title A fast (serial) algorithm of Euclidean (non-squared) cross-distance for vectors written in C++ -//' @description A fast algorithm of (squared) Euclidean cross-distance for vectors written in C++ -//' @usage -//' fastDistVVL(X) -//' @param X a vector -//' @return a vector of distance (lower triangle of the distance matrix, stored by column) -//' @details used internally in ortho_projection -//' @author Leo Ramirez-Lopez -//' @keywords internal -//' @useDynLib resemble -// [[Rcpp::export]] -NumericVector fastDistVVL(NumericVector X){ - int nX = X.size(); - // Compute the length of the output vector - // that will contain the lower tringle - // of the distance matrix... - int n = ((nX * nX) - nX) / 2; - // ... and create the vector - NumericVector output(n); - - int starti = 0; - int endi = 0; - for(int i = 0; i < (nX - 1); i++){ - int ii = i + 1; - int length_vector; - starti = (i * (nX - 1)) - (((i * i) - i) / 2); - endi = ((ii * (nX - 1)) - (((ii * ii) - ii) / 2)) - 1; - // create a vector or indices - Range indices_1(starti, endi); - Range indices_2(ii, nX - 1); - length_vector = indices_1.size(); - // Create a vector of length length_vector with - // repeated X[i] values - NumericVector ith_point(length_vector, X[i]); - output[indices_1] = ith_point - X[indices_2]; - } - - // compute power of 2 - // convert to absolute differences (equivalent to euclidean) - output = pow(output, 2); - return output; -} - - -//' @title A function to compute indices of minimum values of a distance vector -//' @description For internal use only -//' @usage -//' minDissV(X,omp_threads) -//' @param X a vector of distance (as computed in \code{resemble:::fastDistVV} or \code{base::dist}). -//' @return a vector of the indices of the nearest neighbors -//' @details -//' Used internally to find the nearest neighbors. -//' It searches in lower (or upper?) trianguular matrix. Therefore this must be the format of the -//' input data. The piece of code int \code{len = (sqrt(X.size()*8+1)+1)/2} generated an error in CRAN -//' since \code{sqrt} cannot be applied to integers. -//' @keywords internal -//' @useDynLib resemble -//' @author Antoine Stevens -// [[Rcpp::plugins(openmp)]] -// [[Rcpp::export]] -NumericVector minDissV(NumericVector X){ - // For the distances - int nX = X.size(); - int n = ((nX * nX) - nX) / 2; - NumericVector doutput(n); - - // For the indices - arma::uword index; - double vct = (sqrt(((double)doutput.size()) * 8.0 + 1.0) + 1.0) / 2.0; - int len = (int)vct; - // int len = (sqrt(Ds.size()*8+1)+1)/2; - arma::uvec vindex(len); - int i,j; -#if defined(_OPENMP) -#pragma omp parallel for private(i,j) schedule(dynamic) -#endif - for(int i = 0; i < nX - 1; i++) - for(int j = i + 1; j < nX; j++){ - double x = X(j) - X(i); - doutput(nX*i - (i * (i + 3) / 2) + j - 1) = x * x; - } - - for(i = 0; i < len; i++){ - arma::vec x(len); - for(j = 0; j < i; j++){ - // triangular sequence - int k = j * len - (j * (j + 3) / 2) + i - 1; - x[j] = doutput(k); - } - for(j = i + 1; j < len; j++){ - // triangular sequence - int k2 = i * len - (i * (i + 3) / 2) + j - 1; - x[j] = doutput(k2); - } - x[i] = arma::datum::nan; // remove diag - x.min(index); // don't assign result to a value since we are interested only in the index - vindex[i] = index; - } - return wrap(vindex + 1); -} diff --git a/src/moving_cor_diss.cpp b/src/moving_cor_diss.cpp deleted file mode 100644 index f1bcb9b..0000000 --- a/src/moving_cor_diss.cpp +++ /dev/null @@ -1,33 +0,0 @@ -#include -// [[Rcpp::depends(RcppArmadillo)]] -using namespace Rcpp; - -//' @title Moving/rolling correlation distance of two matrices -//' @description Computes a moving window correlation distance between two data matrices -//' @usage -//' moving_cor_diss(X,Y,w) -//' @param X a matrix -//' @param Y a matrix -//' @param w window size (must be odd) -//' @return a matrix of correlation distance -//' @keywords internal -//' @useDynLib resemble -//' @author Leonardo Ramirez-Lopez and Antoine Stevens -// [[Rcpp::export]] -NumericMatrix moving_cor_diss(NumericMatrix X, NumericMatrix Y, int w){ - arma::mat XX(X.begin(), X.nrow(), X.ncol(), false); - arma::mat YY(Y.begin(), Y.nrow(), Y.ncol(), false); - arma::mat rmwF = arma::zeros(X.nrow(),Y.nrow()); - if(!w%2) throw exception("'w' must be odd") ; - int gap = (w-1)/2; - if((Y.ncol() - w ) < 1){ - arma::mat rmwF = arma::cor(XX.t(), YY.t()); - } else { - for(int i = gap; i < Y.ncol() - gap; i++){ - rmwF += arma::cor(XX.cols(i-gap,i+gap).t(), YY.cols(i-gap,i+gap).t()); // sum of the correlations - } - rmwF = rmwF/(Y.ncol()- (2*gap)); // get the average - } - arma::mat scmw = (1 - rmwF)/2; // get the distance - return wrap(scmw.t()); -} diff --git a/src/regression_methods.cpp b/src/regression_methods.cpp index bc13cce..9410211 100644 --- a/src/regression_methods.cpp +++ b/src/regression_methods.cpp @@ -1,12 +1,8 @@ #include #include #include -#ifdef _OPENMP -#include // OpenMP -#endif // [[Rcpp::depends(RcppArmadillo)]] -// [[Rcpp::plugins(openmp)]] using namespace Rcpp; @@ -265,7 +261,7 @@ List opls_for_projection(arma::mat X, ncomp = ith_comp - 1; ith_comp = ith_comp - 2; if(i == 0) { - throw exception("With the current value in the 'pc_selection' argument, no components are selected. Try another value."); + throw std::invalid_argument("With the current value in the 'pc_selection' argument, no components are selected. Try another value."); } if (pcSelmethod == "cumvar") { ncomp = ncomp + 1; @@ -276,9 +272,9 @@ List opls_for_projection(arma::mat X, } } - - + + arma::uvec pc_indices; if (pcSelmethod != "manual") { if (pcSelmethod == "var" || pcSelmethod == "cumvar") { @@ -835,7 +831,6 @@ List opls_get_basics(arma::mat X, for (int i = 0; i < ncomp; i++) { Yplsb = Ypls; - Xpls = Xpls; // Select the Y variable with the largest standard deviation imsd = get_col_largest_sd(Ypls); iypls = Ypls.col(imsd[0]); @@ -939,14 +934,14 @@ Rcpp::NumericMatrix predict_opls(arma::mat bo, bool scale, arma::mat Xscale ){ - + if (scale) { newdata = newdata / arma::repmat(Xscale, newdata.n_rows, 1); } // Not Necessary to center since b0 is used // Xz = Xz - arma::repmat(Xcenter, newdata.n_rows, 1); - + arma::mat predicted = (newdata * b.cols(0, ncomp - 1)) + arma::repmat(bo.cols(0, ncomp - 1), newdata.n_rows, 1); return Rcpp::wrap(predicted); } @@ -1001,10 +996,29 @@ Rcpp::NumericMatrix project_opls(arma::mat projection_mat, // [[Rcpp::export]] Rcpp::NumericMatrix reconstruction_error(arma::mat x, arma::mat projection_mat, - arma::mat xloadings){ + arma::mat xloadings, + bool scale, + arma::mat Xcenter, + arma::mat Xscale){ + + if(scale){ + x = x / arma::repmat(Xscale, x.n_rows, 1); + } + + //Necessary to center + x = x - arma::repmat(Xcenter, x.n_rows, 1); + arma::mat xrec = x; arma::mat xrmse; xrec = x * projection_mat * xloadings; + + // if(scale){ + // xrec = xrec % arma::repmat(Xscale, x.n_rows, 1); + // } + // + // //Necessary to center + // xrec = xrec + arma::repmat(Xcenter, newdata.n_rows, 1); + xrmse = sqrt(arma::mean(arma::mean(pow(x - xrec, 2), 0), 1)); return Rcpp::wrap(xrmse); } @@ -1684,9 +1698,8 @@ List pca_nipals(arma::mat X, { ncomp = ith_comp - 1; ith_comp = ith_comp - 2; - if (i == 0) - { - throw exception("With the current value in the 'pc_selection' argument, no components are selected. Try another value."); + if (i == 0) { + throw std::invalid_argument("With the current value in the 'pc_selection' argument, no components are selected. Try another value."); } break; } diff --git a/src/which_min.cpp b/src/which_min.cpp deleted file mode 100644 index e0a8331..0000000 --- a/src/which_min.cpp +++ /dev/null @@ -1,36 +0,0 @@ -#include -#ifdef _OPENMP -#include // OpenMP -#endif - -using namespace Rcpp; -// [[Rcpp::plugins(openmp)]] -// [[Rcpp::depends(RcppArmadillo)]] - -//' @title A function to compute row-wise index of minimum values of a square distance matrix -//' @description For internal use only -//' @usage -//' which_min(X) -//' @param X a square matrix of distances -//' @return a vector of the indices of the minimum value in each row of the input matrix -//' @details Used internally to find the nearest neighbors -//' @keywords internal -//' @useDynLib resemble -//' @author Antoine Stevens -// [[Rcpp::export]] -NumericVector which_min(NumericMatrix X){ - int nX = X.nrow(), kX = X.ncol(); - arma::mat XX(X.begin(), nX, kX, false); - arma::uword index; - arma::uvec vindex(nX); -#if defined(_OPENMP) - #pragma omp parallel for schedule(static) -#endif - for(int i = 0; i < nX; i++){ - arma::rowvec x = XX.row(i); - x(i) = arma::datum::nan; // remove diag - x.min(index); // don't assign result to a value since we are interested only in the index - vindex[i] = index; - } - return wrap(vindex +1); -} diff --git a/src/which_min_vector.cpp b/src/which_min_vector.cpp deleted file mode 100644 index a5a3263..0000000 --- a/src/which_min_vector.cpp +++ /dev/null @@ -1,52 +0,0 @@ -#include -#ifdef _OPENMP -#include // OpenMP -#endif - -using namespace Rcpp; -// [[Rcpp::plugins(openmp)]] -// [[Rcpp::depends(RcppArmadillo)]] - -//' @title A function to compute indices of minimum values of a distance vector -//' @description For internal use only -//' @usage -//' which_min_vector(X) -//' @param X a vector of distances -//' @return a vector of the indices of the nearest neighbors -//' @details -//' Used internally to find the nearest neighbors. -//' It searches in lower (or upper) triangular matrix. Therefore this must be the format of the -//' input data. The piece of code int \code{len = (sqrt(X.size()*8+1)+1)/2} generated an error in CRAN -//' since \code{sqrt} cannot be applied to integers. -//' @keywords internal -//' @useDynLib resemble -//' @author Antoine Stevens -// [[Rcpp::export]] -NumericVector which_min_vector(NumericVector X){ - arma::uword index; - double vct = (sqrt(((double)X.size()) * 8.0 + 1.0) + 1.0) / 2.0; - int len = (int)vct; - // int len = (sqrt(X.size()*8+1)+1)/2; - arma::uvec vindex(len); - int i,j; -#if defined(_OPENMP) - #pragma omp parallel for private(i,j) schedule(dynamic) -#endif - for(i = 0; i < len; i++){ - arma::vec x(len); - for(j = 0; j < i; j++){ - // triangular sequence - int k = j * len - (j * (j + 3) / 2) + i - 1; - x[j] = X(k); - } - for(j = i+1; j < len; j++){ - // triangular sequence - int k2 = i * len - (i * (i + 3) / 2) + j - 1; - x[j] = X(k2); - } - x[i] = arma::datum::nan; // remove diag - x.min(index); // don't assign result to a value since we are interested only in the index - vindex[i] = index; - } - return wrap(vindex +1); -} diff --git a/tests/testthat.R b/tests/testthat.R index b2bdcd6..3fefd38 100644 --- a/tests/testthat.R +++ b/tests/testthat.R @@ -1,4 +1,5 @@ library(testthat) +library(prospectr) library(resemble) test_check("resemble") diff --git a/tests/testthat/test-dissimilarity.R b/tests/testthat/test-dissimilarity.R index be418d5..4f55080 100644 --- a/tests/testthat/test-dissimilarity.R +++ b/tests/testthat/test-dissimilarity.R @@ -1,7 +1,116 @@ context("test-dissimilarity") + + test_that("dissimilarity works", { nirdata <- data("NIRsoil", package = "prospectr") + + Xu <- NIRsoil$spc[!as.logical(NIRsoil$train), ] + Yu <- NIRsoil$CEC[!as.logical(NIRsoil$train)] + + Yr <- NIRsoil$CEC[as.logical(NIRsoil$train)] + Xr <- NIRsoil$spc[as.logical(NIRsoil$train), ] + + Xu <- Xu[!is.na(Yu), ][1:20, ] + Xr <- Xr[!is.na(Yr), ][1:40, ] + + Yu <- Yu[!is.na(Yu)][1:20] + Yr <- Yr[!is.na(Yr)][1:40] + + dsm_pca <- dissimilarity( + Xr = Xr, Xu = Xu, + diss_method = c("pca"), + Yr = Yr, gh = TRUE, pc_selection = list("opc", 15), + return_projection = TRUE, + center = TRUE, scale = TRUE + ) + expected_n_comp <- 6 + dsm_pls <- dissimilarity( + Xr = standardNormalVariate(Xr), Xu = standardNormalVariate(Xu), + diss_method = c("pls"), + Yr = Yr, gh = TRUE, pc_selection = list("opc", 15), + return_projection = TRUE, + center = TRUE, scale = TRUE + ) + expected_n_pls <- 15 + + dsm_pca_var <- dissimilarity( + Xr = Xr, Xu = Xu, + diss_method = c("pca"), + Yr = Yr, gh = TRUE, pc_selection = list("var", 0.01), + return_projection = TRUE, + center = TRUE, scale = TRUE + ) + expected_n_comp_var <- 2 + dsm_pls_var <- dissimilarity( + Xr = Xr, Xu = Xu, + diss_method = c("pls"), + Yr = Yr, gh = TRUE, pc_selection = list("var", 0.01), + return_projection = TRUE, + center = TRUE, scale = TRUE + ) + expected_n_pls_var <- 2 + + + dsm_euclid <- dissimilarity( + Xr = Xr, Xu = Xu, + diss_method = "euclid", + return_projection = TRUE, + center = TRUE, scale = TRUE + ) + + dsm_euclid_xu <- dissimilarity( + Xr = Xu[1:10, ], + diss_method = "euclid", + center = FALSE, scale = FALSE + )$dissimilarity + dsm_euclid_xu <- ((dsm_euclid_xu^2) * ncol(Xu))^0.5 + dist_euclid_xu <- as.matrix(dist(Xu[1:10, ])) + + dsm_cor <- dissimilarity( + Xr = Xr, Xu = Xu, + diss_method = "cor", + return_projection = TRUE, + ws = 11, + center = TRUE, scale = FALSE + ) + + output_names_pca <- names(dsm_pca) + output_names_pls <- names(dsm_pls) + expected_names <- c("dissimilarity", "projection", "gh", "documentation") + + expect_is(dsm_pca, "list") + expect_is(dsm_pls, "list") + expect_is(dsm_pca_var, "list") + expect_is(dsm_pls_var, "list") + expect_is(dsm_euclid, "list") + expect_is(dsm_euclid, "list") + expect_true(dsm_pca$projection$n_components == expected_n_comp) + expect_true(dsm_pls$projection$n_components == expected_n_pls) + expect_true(dsm_pca_var$projection$n_components == expected_n_comp_var) + expect_true(dsm_pls_var$projection$n_components == expected_n_pls_var) + expect_true(all(expected_names %in% output_names_pca)) + expect_true(all(expected_names %in% output_names_pls)) + expect_true(sum(abs(round(dist_euclid_xu - dsm_euclid_xu, 5))) == 0) +}) + + + + + + + + + + + + + + +test_that("dissimilarity large sets works", { + skip_on_cran() + skip_on_travis() + nirdata <- data("NIRsoil", package = "prospectr") Xu <- NIRsoil$spc[!as.logical(NIRsoil$train), ] Yu <- NIRsoil$CEC[!as.logical(NIRsoil$train)] @@ -24,14 +133,13 @@ test_that("dissimilarity works", { ) expected_n_comp <- 24 dsm_pls <- dissimilarity( - Xr = Xr, Xu = Xu, + Xr = standardNormalVariate(Xr), Xu = standardNormalVariate(Xu), diss_method = c("pls"), Yr = Yr, gh = TRUE, pc_selection = list("opc", 30), return_projection = TRUE, center = TRUE, scale = TRUE ) - expected_n_pls <- 14 - + expected_n_pls <- 10 dsm_pca_var <- dissimilarity( Xr = Xr, Xu = Xu, diff --git a/tests/testthat/test-mbl.R b/tests/testthat/test-mbl.R index 67280a4..7e1c4bd 100644 --- a/tests/testthat/test-mbl.R +++ b/tests/testthat/test-mbl.R @@ -2,88 +2,94 @@ context("test-mbl") test_that("mbl works", { nirdata <- data("NIRsoil", package = "prospectr") - + Xu <- NIRsoil$spc[!as.logical(NIRsoil$train), ] Yu <- NIRsoil$CEC[!as.logical(NIRsoil$train)] - + Yr <- NIRsoil$CEC[as.logical(NIRsoil$train)] Xr <- NIRsoil$spc[as.logical(NIRsoil$train), ] - + Xu <- Xu[!is.na(Yu), ][1:20, ] Xr <- Xr[!is.na(Yr), ][1:40, ] - + Yu <- Yu[!is.na(Yu)][1:20] Yr <- Yr[!is.na(Yr)][1:40] - + k_test <- seq(25, 35, by = 10) k_diss_test <- 0.1 k_range_test <- c(15, 30) - + ctrl_1 <- mbl_control( validation_type = c("NNv", "local_cv"), number = 4, p = 0.5, - progress = FALSE, allow_parallel = FALSE ) - + gpr <- mbl( Xr = Xr, Yr = Yr, Xu = Xu, Yu = Yu, k = k_test, method = local_fit_gpr(), control = ctrl_1, + verbose = FALSE ) - + pls <- mbl( Xr = Xr, Yr = Yr, Xu = Xu, Yu = Yu, k = k_test, method = local_fit_pls(5), control = ctrl_1, + verbose = FALSE ) - + wapls <- mbl( Xr = Xr, Yr = Yr, Xu = Xu, Yu = Yu, k = k_test, method = local_fit_wapls(3, 5), - control = ctrl_1 + control = ctrl_1, + verbose = FALSE ) - + gpr_k_diss <- mbl( Xr = Xr, Yr = Yr, Xu = Xu, Yu = Yu, k_diss = k_diss_test, k_range = k_range_test, method = local_fit_gpr(), control = ctrl_1, + verbose = FALSE ) - + pls_k_diss <- mbl( Xr = Xr, Yr = Yr, Xu = Xu, Yu = Yu, k_diss = k_diss_test, k_range = k_range_test, method = local_fit_pls(5), control = ctrl_1, + verbose = FALSE ) - + wapls_k_diss <- mbl( Xr = Xr, Yr = Yr, Xu = Xu, Yu = Yu, k_diss = k_diss_test, k_range = k_range_test, method = local_fit_wapls(3, 5), - control = ctrl_1 + control = ctrl_1, + verbose = FALSE ) - - + + group_test <- mbl( Xr = Xr, Yr = Yr, Xu = Xu, Yu = Yu, k = k_test, method = local_fit_pls(5), control = ctrl_1, - group = rep(c(1, 2), length.out = nrow(Xr)) + group = rep(c(1, 2), length.out = nrow(Xr)), + verbose = FALSE ) - + output_names <- names(gpr) expected_names <- c( "call", "cntrl_param", "dissimilarities", "Xu_neighbors", "n_predictions", "gh", "validation_results", "results", "documentation" ) - + expect_is(gpr, "list") expect_is(pls, "list") expect_is(wapls, "list") @@ -100,173 +106,180 @@ test_that("mbl delivers expeted results", { require(prospectr) nirdata <- data("NIRsoil", package = "prospectr") NIRsoil$spc <- prospectr::savitzkyGolay(NIRsoil$spc, p = 3, w = 11, m = 0) - + Xu <- NIRsoil$spc[!as.logical(NIRsoil$train), ] Yu <- NIRsoil$CEC[!as.logical(NIRsoil$train)] - + Yr <- NIRsoil$CEC[as.logical(NIRsoil$train)] Xr <- NIRsoil$spc[as.logical(NIRsoil$train), ] - + Xu <- Xu[!is.na(Yu), ] Xr <- Xr[!is.na(Yr), ] - + Yu <- Yu[!is.na(Yu)] Yr <- Yr[!is.na(Yr)] - + # tune locally ctrl_1 <- mbl_control( validation_type = c("NNv", "local_cv"), number = 4, p = 0.8, tune_locally = TRUE, - progress = FALSE, allow_parallel = FALSE ) - + k_test <- c(40, 150) k_diss_test <- 0.1 k_range_test <- c(20, 100) pls_wapls <- c(3, 15) pls_pls <- c(10) grpnoisevar <- 0.0001 - + tseed <- 141020 - + set.seed(tseed) gpr <- mbl( Xr = Xr, Yr = Yr, Xu = Xu, Yu = Yu, k = k_test, method = local_fit_gpr(grpnoisevar), control = ctrl_1, + verbose = FALSE ) - + set.seed(tseed) pls <- mbl( Xr = Xr, Yr = Yr, Xu = Xu, Yu = Yu, k = k_test, method = local_fit_pls(pls_pls), control = ctrl_1, + verbose = FALSE ) - + set.seed(tseed) wapls <- mbl( Xr = Xr, Yr = Yr, Xu = Xu, Yu = Yu, k = k_test, method = local_fit_wapls(pls_wapls[1], pls_wapls[2]), - control = ctrl_1 + control = ctrl_1, + verbose = FALSE ) - + set.seed(tseed) gpr_k_diss <- mbl( Xr = Xr, Yr = Yr, Xu = Xu, Yu = Yu, k_diss = k_diss_test, k_range = k_range_test, method = local_fit_gpr(), control = ctrl_1, + verbose = FALSE ) - + set.seed(tseed) pls_k_diss <- mbl( Xr = Xr, Yr = Yr, Xu = Xu, Yu = Yu, k_diss = k_diss_test, k_range = k_range_test, method = local_fit_pls(pls_pls), control = ctrl_1, + verbose = FALSE ) - + set.seed(tseed) wapls_k_diss <- mbl( Xr = Xr, Yr = Yr, Xu = Xu, Yu = Yu, k_diss = k_diss_test, k_range = k_range_test, method = local_fit_wapls(pls_wapls[1], pls_wapls[2]), - control = ctrl_1 + control = ctrl_1, + verbose = FALSE ) - + set.seed(tseed) - xgroup <- rep((1:(floor(nrow(Xr)/2))), each = 2) + xgroup <- rep((1:(floor(nrow(Xr) / 2))), each = 2) pls_group <- mbl( Xr = Xr, Yr = Yr, Xu = Xu, Yu = Yu, k = k_test, method = local_fit_pls(pls_pls), - control = ctrl_1, group = xgroup + control = ctrl_1, group = xgroup, + verbose = FALSE ) - - set.seed(tseed) + + set.seed(tseed) pls_group_local <- mbl( Xr = Xr, Yr = Yr, Xu = Xu, Yu = Yu, k = k_test, method = local_fit_pls(pls_pls), control = ctrl_1, group = xgroup, - .local = TRUE, pre_k = 200 + .local = TRUE, pre_k = 200, + verbose = FALSE ) - - + + cv_gpr <- c( gpr$validation_results$local_cross_validation$rmse < 1.8, gpr$validation_results$local_cross_validation$rmse > 1.5 ) - + cv_pls <- c( pls$validation_results$local_cross_validation$rmse < 1.8, pls$validation_results$local_cross_validation$rmse > 1.4 ) - + cv_wapls <- c( wapls$validation_results$local_cross_validation$rmse < 1.8, wapls$validation_results$local_cross_validation$rmse > 1.4 ) - + cv_gpr_k_diss <- c( gpr_k_diss$validation_results$local_cross_validation$rmse < 1.9, gpr_k_diss$validation_results$local_cross_validation$rmse > 1.6 ) - + cv_pls_k_diss <- c( pls_k_diss$validation_results$local_cross_validation$rmse < 1.7, pls_k_diss$validation_results$local_cross_validation$rmse > 1.4 ) - + cv_wapls_k_diss <- c( wapls_k_diss$validation_results$local_cross_validation$rmse < 2, wapls_k_diss$validation_results$local_cross_validation$rmse > 1.4 ) - + cv_group <- c( pls_group$validation_results$local_cross_validation$rmse < 2, pls_group$validation_results$local_cross_validation$rmse > 1.4 ) - + cv_group_local <- c( pls_group_local$validation_results$local_cross_validation$rmse < 2, pls_group_local$validation_results$local_cross_validation$rmse > 1 ) - + nnv_group_local <- pls_group_local$validation_results$nearest_neighbor_validation$r2 > 0.4 - + nnv_group <- pls_group$validation_results$nearest_neighbor_validation$r2 > 0.7 - + nnv_gpr <- gpr$validation_results$nearest_neighbor_validation$r2 > 0.81 - + nnv_pls <- pls$validation_results$nearest_neighbor_validation$r2 > 0.74 - + nnv_wapls <- wapls$validation_results$nearest_neighbor_validation$r2 > 0.80 - + nnv_gpr_k_diss <- gpr_k_diss$validation_results$nearest_neighbor_validation$r2 > 0.81 - + nnv_pls_k_diss <- pls_k_diss$validation_results$nearest_neighbor_validation$r2 > 0.81 - + nnv_wapls_k_diss <- wapls_k_diss$validation_results$nearest_neighbor_validation$r2 > 0.81 - - + + yuv_gpr <- gpr$validation_results$Yu_prediction_statistics$r2 > 0.72 - + yuv_pls <- pls$validation_results$Yu_prediction_statistics$r2 > 0.67 - + yuv_wapls <- wapls$validation_results$Yu_prediction_statistics$r2 > 0.69 - + yuv_gpr_k_diss <- gpr_k_diss$validation_results$Yu_prediction_statistics$r2 > 0.72 - + yuv_pls_k_diss <- pls_k_diss$validation_results$Yu_prediction_statistics$r2 > 0.60 - + yuv_wapls_k_diss <- wapls_k_diss$validation_results$Yu_prediction_statistics$r2 > 0.65 - + expect_true(all(cv_gpr)) expect_true(all(cv_pls)) expect_true(all(cv_wapls)) @@ -275,7 +288,7 @@ test_that("mbl delivers expeted results", { expect_true(all(cv_wapls_k_diss)) expect_true(all(cv_group)) expect_true(all(cv_group_local)) - + expect_true(all(nnv_gpr)) expect_true(all(nnv_pls)) expect_true(all(nnv_wapls)) @@ -284,7 +297,7 @@ test_that("mbl delivers expeted results", { expect_true(all(nnv_wapls_k_diss)) expect_true(all(nnv_group)) expect_true(all(nnv_group_local)) - + expect_true(all(yuv_gpr)) expect_true(all(yuv_pls)) expect_true(all(yuv_wapls)) diff --git a/tests/testthat/test-pc_projection.R b/tests/testthat/test-pc_projection.R index 517f138..ccf677f 100644 --- a/tests/testthat/test-pc_projection.R +++ b/tests/testthat/test-pc_projection.R @@ -1,7 +1,8 @@ context("test-pc_projection") + test_that("pc_projection works", { - # tolernce for results supposed to be 0s + # tolernce for results supposed to be 0s tol <- 1e-5 nirdata <- data("NIRsoil", package = "prospectr") @@ -13,15 +14,22 @@ test_that("pc_projection works", { Xr <- NIRsoil$spc[as.logical(NIRsoil$train), ] Xu <- Xu[!is.na(Yu), ] - Xr <- Xr[!is.na(Yr), ] + y_sel <- !is.na(Yr) & !is.na(Yr_2) + Xr <- Xr[y_sel, ] Yu <- Yu[!is.na(Yu)] - Yr_2 <- Yr[!is.na(Yr)] - Yr <- Yr[!is.na(Yr)] + Yr_2 <- Yr_2[y_sel] + Yr <- Yr[y_sel] + Xu <- Xu[1:20,] + Yu <- Yu[1:20] + Xr <- Xr[1:40,] + Yr <- Yr[1:40] + Yr_2 <- Yr_2[1:40] + cumvar_value <- 0.999 - one_input_matrix <- pc_projection(Xr, + one_input_matrix <- pc_projection(Xr, pc_selection = list(method = "cumvar", value = cumvar_value), center = TRUE, scale = FALSE, method = "pca") @@ -48,35 +56,38 @@ test_that("pc_projection works", { preds <- sum(abs(predict(two_input_matrices)[1:nrow(Xr), ] - predict(two_input_matrices, Xr))) expect_true(preds < tol) - opc_method <- pc_projection(Xr, Xu, Yr = Yr, - pc_selection = list(method = "opc", value = 30), - center = TRUE, scale = FALSE, + opc_method <- pc_projection(Xr, Xu, + Yr = Yr, + pc_selection = list(method = "opc", value = 15), + center = TRUE, scale = TRUE, method = "pca") - opc_method_nipals <- pc_projection(Xr, Xu, Yr = Yr, + opc_method_nipals <- pc_projection(Xr, Xu, + Yr = Yr, pc_selection = list(method = "opc", value = 30), - center = TRUE, scale = FALSE, + center = TRUE, scale = TRUE, method = "pca.nipals") - expect_true(opc_method$n_components == which.min(opc_method$opc_evaluation[,2])) - expect_true(opc_method$n_components == 20) + expect_true(opc_method$n_components == which.min(opc_method$opc_evaluation[, 2])) + expect_true(opc_method$n_components == 7) # check that nipals is equivalent to svd method expect_true(opc_method$n_components == opc_method_nipals$n_components) - cor_equiv <- sapply(1:opc_method$n_components, - FUN = function(x, y, i) abs(cor(x[,i], y[,i])), + cor_equiv <- sapply(1:opc_method$n_components, + FUN = function(x, y, i) abs(cor(x[, i], y[, i])), x = opc_method_nipals$scores, - y = opc_method$scores) + y = opc_method$scores + ) expect_true(sum(1 - cor_equiv) < tol) - # check that the number of components for method = "cumvar" is properly - # obtained, this can be done with the results of opc_method as it selects more + # check that the number of components for method = "cumvar" is properly + # obtained, this can be done with the results of opc_method as it selects more # components than in the "cumvar" test - expect_true(sum(opc_method$variance[3, ] < cumvar_value) == two_input_matrices$n_components - 1) + expect_true(sum(opc_method$variance[3, ] < cumvar_value) == two_input_matrices$n_components - 1) # do the same for method = "var" - expect_true(sum(opc_method$variance[2,] > (1 - cumvar_value)) == two_input_matrices_var$n_components) + expect_true(sum(opc_method$variance[2, ] > (1 - cumvar_value)) == two_input_matrices_var$n_components) expect_true(ncol(two_input_matrices$scores) == two_input_matrices$n_components) @@ -86,11 +97,119 @@ test_that("pc_projection works", { bb <- cbind(name_test_yr = Yr, Yr_2) - opc_method_nipals <- pc_projection(Xr, Xu, Yr = bb, + opc_method_nipals <- pc_projection(Xr, Xu, + Yr = bb, pc_selection = list(method = "opc", value = 30), center = TRUE, scale = FALSE, - method = "pca.nipals") + method = "pca.nipals" + ) expect_true("rmsd_name_test_yr" %in% colnames(opc_method_nipals$opc_evaluation)) - +}) + + +test_that("pc_projection large sets works", { + skip_on_cran() + skip_on_travis() + # tolernce for results supposed to be 0s + tol <- 1e-5 + nirdata <- data("NIRsoil", package = "prospectr") + + Xu <- NIRsoil$spc[!as.logical(NIRsoil$train), ] + Yu <- NIRsoil$CEC[!as.logical(NIRsoil$train)] + + Yr <- NIRsoil$CEC[as.logical(NIRsoil$train)] + Yr_2 <- NIRsoil$Ciso[as.logical(NIRsoil$train)] + Xr <- NIRsoil$spc[as.logical(NIRsoil$train), ] + + Xu <- Xu[!is.na(Yu), ] + Xr <- Xr[!is.na(Yr), ] + + Yu <- Yu[!is.na(Yu)] + Yr_2 <- Yr[!is.na(Yr)] + Yr <- Yr[!is.na(Yr)] + + + cumvar_value <- 0.999 + one_input_matrix <- pc_projection(Xr, + pc_selection = list(method = "cumvar", value = cumvar_value), + center = TRUE, scale = FALSE, + method = "pca" + ) + + expect_true(ncol(one_input_matrix$scores) == one_input_matrix$n_components) + test_ncomp <- one_input_matrix$n_components - 1 + expect_true(all(one_input_matrix$variance[3, 1:test_ncomp] < cumvar_value)) + + two_input_matrices <- pc_projection(Xr, Xu, + pc_selection = list(method = "cumvar", value = cumvar_value), + center = TRUE, scale = FALSE, + method = "pca" + ) + + two_input_matrices_var <- pc_projection(Xr, Xu, + pc_selection = list(method = "var", value = 1 - cumvar_value), + center = TRUE, scale = FALSE, + method = "pca" + ) + + + expect_true(ncol(two_input_matrices$scores) == two_input_matrices$n_components) + two_test_ncomp <- two_input_matrices$n_components - 1 + expect_true(all(two_input_matrices$variance[3, 1:two_test_ncomp] < cumvar_value)) + + preds <- sum(abs(predict(two_input_matrices)[1:nrow(Xr), ] - predict(two_input_matrices, Xr))) + expect_true(preds < tol) + + opc_method <- pc_projection(Xr, Xu, + Yr = Yr, + pc_selection = list(method = "opc", value = 30), + center = TRUE, scale = FALSE, + method = "pca" + ) + + opc_method_nipals <- pc_projection(Xr, Xu, + Yr = Yr, + pc_selection = list(method = "opc", value = 30), + center = TRUE, scale = FALSE, + method = "pca.nipals" + ) + + expect_true(opc_method$n_components == which.min(opc_method$opc_evaluation[, 2])) + expect_true(opc_method$n_components == 20) + + # check that nipals is equivalent to svd method + expect_true(opc_method$n_components == opc_method_nipals$n_components) + + cor_equiv <- sapply(1:opc_method$n_components, + FUN = function(x, y, i) abs(cor(x[, i], y[, i])), + x = opc_method_nipals$scores, + y = opc_method$scores + ) + + expect_true(sum(1 - cor_equiv) < tol) + + # check that the number of components for method = "cumvar" is properly + # obtained, this can be done with the results of opc_method as it selects more + # components than in the "cumvar" test + expect_true(sum(opc_method$variance[3, ] < cumvar_value) == two_input_matrices$n_components - 1) + # do the same for method = "var" + expect_true(sum(opc_method$variance[2, ] > (1 - cumvar_value)) == two_input_matrices_var$n_components) + + + expect_true(ncol(two_input_matrices$scores) == two_input_matrices$n_components) + test_ncomp <- two_input_matrices$n_components - 1 + expect_true(all(two_input_matrices$variance[3, 1:test_ncomp] < cumvar_value)) + + + bb <- cbind(name_test_yr = Yr, Yr_2) + + opc_method_nipals <- pc_projection(Xr, Xu, + Yr = bb, + pc_selection = list(method = "opc", value = 30), + center = TRUE, scale = FALSE, + method = "pca.nipals" + ) + + expect_true("rmsd_name_test_yr" %in% colnames(opc_method_nipals$opc_evaluation)) }) diff --git a/tests/testthat/test-pls_projection.R b/tests/testthat/test-pls_projection.R index ac2f6a5..f88447c 100644 --- a/tests/testthat/test-pls_projection.R +++ b/tests/testthat/test-pls_projection.R @@ -1,27 +1,35 @@ context("test-pls_projection") test_that("pls_projection works", { - # tolernce for results supposed to be 0s + # tolernce for results supposed to be 0s tol <- 1e-5 nirdata <- data("NIRsoil", package = "prospectr") - + Xu <- NIRsoil$spc[!as.logical(NIRsoil$train), ] Yu <- NIRsoil$CEC[!as.logical(NIRsoil$train)] - + Yr <- NIRsoil$CEC[as.logical(NIRsoil$train)] Yr_2 <- NIRsoil$Ciso[as.logical(NIRsoil$train)] Xr <- NIRsoil$spc[as.logical(NIRsoil$train), ] - + Xu <- Xu[!is.na(Yu), ] y_sel <- !is.na(Yr) & !is.na(Yr_2) Xr <- Xr[y_sel, ] - + Yu <- Yu[!is.na(Yu)] Yr_2 <- Yr_2[y_sel] Yr <- Yr[y_sel] - - cumvar_value <- 0.991 - one_input_matrix <- pls_projection(Xr, Yr = Yr, + + Xu <- Xu[1:20,] + Yu <- Yu[1:20] + + Xr <- Xr[1:40,] + Yr <- Yr[1:40] + Yr_2 <- Yr_2[1:40] + + cumvar_value <- 0.74 + one_input_matrix <- pls_projection(Xr, + Yr = Yr, pc_selection = list(method = "cumvar", value = cumvar_value), center = TRUE, scale = FALSE) @@ -30,12 +38,12 @@ test_that("pls_projection works", { expect_true(all(one_input_matrix$variance$x_var[3, 1:test_ncomp] < cumvar_value)) two_input_matrices <- pls_projection(Xr, Xu, Yr, - pc_selection = list(method = "cumvar", value = cumvar_value), - center = TRUE, scale = FALSE) + pc_selection = list(method = "cumvar", value = cumvar_value), + center = TRUE, scale = FALSE) two_input_matrices_var <- pls_projection(Xr, Xu, Yr, - pc_selection = list(method = "var", value = 1 - cumvar_value), - center = TRUE, scale = FALSE) + pc_selection = list(method = "var", value = 1 - cumvar_value), + center = TRUE, scale = FALSE) two_test_ncomp <- two_input_matrices$n_components - 1 expect_true(ncol(two_input_matrices$scores) == two_input_matrices$n_components) @@ -43,39 +51,145 @@ test_that("pls_projection works", { preds <- sum(abs(predict(two_input_matrices)[1:nrow(Xr), ] - predict(two_input_matrices, Xr))) expect_true(preds < tol) - - opc_method <- pls_projection(Xr, Xu, Yr = Yr, - pc_selection = list(method = "opc", value = 20), - center = TRUE, scale = FALSE) - expect_true(opc_method$n_components == which.min(opc_method$opc_evaluation[,2])) - expect_true(opc_method$n_components == 12) + opc_method <- pls_projection(Xr, Xu, + Yr = Yr, + pc_selection = list(method = "opc", value = 15), + center = TRUE, scale = TRUE + ) - # check that the number of components for method = "cumvar" is properly - # obtained, this can be done with the results of opc_method as it selects more + expect_true(opc_method$n_components == which.min(opc_method$opc_evaluation[, 2])) + expect_true(opc_method$n_components == 8) + + # check that the number of components for method = "cumvar" is properly + # obtained, this can be done with the results of opc_method as it selects more # components than in the "cumvar" test - expect_true(sum(opc_method$variance$x_var[3,] < cumvar_value) == two_input_matrices$n_components - 1) + expect_true(sum(opc_method$variance$x_var[3, ] < cumvar_value) == two_input_matrices$n_components - 1) # do the same for method = "var" - expect_true(sum(opc_method$variance$x_var[2,] > (1 - cumvar_value)) == two_input_matrices_var$n_components) + expect_true(sum(opc_method$variance$x_var[2, ] > (1 - cumvar_value)) == two_input_matrices_var$n_components) expect_true(ncol(two_input_matrices$scores) == two_input_matrices$n_components) two_test_ncnomp <- two_input_matrices$n_components - 1 expect_true(all(two_input_matrices$variance$x_var[3, 1:two_test_ncnomp] < cumvar_value)) - opc_method <- pls_projection(Xr, Xu, Yr = Yr, - pc_selection = list(method = "opc", value = 20), - center = TRUE, scale = FALSE) + opc_method <- pls_projection(Xr, Xu, + Yr = Yr, + pc_selection = list(method = "opc", value = 15), + center = TRUE, scale = TRUE) ## pls2 - pls2_opc_method <- pls_projection(Xr, Xu, Yr = cbind(Yr, Yr_2), - pc_selection = list(method = "opc", value = 20), - center = TRUE, scale = FALSE) + pls2_opc_method <- pls_projection(Xr, Xu, + Yr = cbind(Yr, Yr_2), + pc_selection = list(method = "opc", value = 15), + center = TRUE, scale = TRUE) - expect_true(pls2_opc_method$n_components == which.min(pls2_opc_method$opc_evaluation[,4])) - distm <- as.matrix(dist(scale(pls2_opc_method$scores[1:nrow(Xr),], TRUE, TRUE))) + expect_true(pls2_opc_method$n_components == which.min(pls2_opc_method$opc_evaluation[, 4])) + distm <- as.matrix(dist(scale(pls2_opc_method$scores[1:nrow(Xr), ], TRUE, TRUE))) + distm2 <- f_diss(pls2_opc_method$scores[1:nrow(Xr), ], scale = TRUE, center = TRUE) nn <- apply(distm, MARGIN = 2, FUN = function(x) order(x)[2]) - result_rmsd <- colMeans((cbind(Yr, Yr_2) - cbind(Yr, Yr_2)[nn,])^2)^0.5 - expect_true(all(pls2_opc_method$opc_evaluation[pls2_opc_method$n_components,2:3] == result_rmsd)) + distm + + result_rmsd <- as.vector(round(colMeans((cbind(Yr, Yr_2) - cbind(Yr, Yr_2)[nn, ])^2)^0.5, 4)) + + resemble_rmsds <- as.vector(round(pls2_opc_method$opc_evaluation[pls2_opc_method$n_components, 2:3], 4)) + + expect_true(all(resemble_rmsds == result_rmsd)) +}) + + + +test_that("pls_projection large sets works", { + skip_on_cran() + skip_on_travis() + # tolernce for results supposed to be 0s + tol <- 1e-5 + nirdata <- data("NIRsoil", package = "prospectr") + + Xu <- NIRsoil$spc[!as.logical(NIRsoil$train), ] + Yu <- NIRsoil$CEC[!as.logical(NIRsoil$train)] + + Yr <- NIRsoil$CEC[as.logical(NIRsoil$train)] + Yr_2 <- NIRsoil$Ciso[as.logical(NIRsoil$train)] + Xr <- NIRsoil$spc[as.logical(NIRsoil$train), ] + + Xu <- Xu[!is.na(Yu), ] + y_sel <- !is.na(Yr) & !is.na(Yr_2) + Xr <- Xr[y_sel, ] + + Yu <- Yu[!is.na(Yu)] + Yr_2 <- Yr_2[y_sel] + Yr <- Yr[y_sel] + + cumvar_value <- 0.991 + one_input_matrix <- pls_projection(Xr, + Yr = Yr, + pc_selection = list(method = "cumvar", value = cumvar_value), + center = TRUE, scale = FALSE + ) + + test_ncomp <- one_input_matrix$n_components - 1 + expect_true(ncol(one_input_matrix$scores) == one_input_matrix$n_components) + expect_true(all(one_input_matrix$variance$x_var[3, 1:test_ncomp] < cumvar_value)) + + two_input_matrices <- pls_projection(Xr, Xu, Yr, + pc_selection = list(method = "cumvar", value = cumvar_value), + center = TRUE, scale = FALSE + ) + + two_input_matrices_var <- pls_projection(Xr, Xu, Yr, + pc_selection = list(method = "var", value = 1 - cumvar_value), + center = TRUE, scale = FALSE + ) + + two_test_ncomp <- two_input_matrices$n_components - 1 + expect_true(ncol(two_input_matrices$scores) == two_input_matrices$n_components) + expect_true(all(two_input_matrices$variance$x_var[3, 1:two_test_ncomp] < cumvar_value)) + + preds <- sum(abs(predict(two_input_matrices)[1:nrow(Xr), ] - predict(two_input_matrices, Xr))) + expect_true(preds < tol) + + opc_method <- pls_projection(Xr, Xu, + Yr = Yr, + pc_selection = list(method = "opc", value = 20), + center = TRUE, scale = FALSE + ) + + expect_true(opc_method$n_components == which.min(opc_method$opc_evaluation[, 2])) + expect_true(opc_method$n_components == 12) + + # check that the number of components for method = "cumvar" is properly + # obtained, this can be done with the results of opc_method as it selects more + # components than in the "cumvar" test + expect_true(sum(opc_method$variance$x_var[3, ] < cumvar_value) == two_input_matrices$n_components - 1) + # do the same for method = "var" + expect_true(sum(opc_method$variance$x_var[2, ] > (1 - cumvar_value)) == two_input_matrices_var$n_components) + + + expect_true(ncol(two_input_matrices$scores) == two_input_matrices$n_components) + two_test_ncnomp <- two_input_matrices$n_components - 1 + expect_true(all(two_input_matrices$variance$x_var[3, 1:two_test_ncnomp] < cumvar_value)) + + opc_method <- pls_projection(Xr, Xu, + Yr = Yr, + pc_selection = list(method = "opc", value = 20), + center = TRUE, scale = FALSE + ) + + ## pls2 + pls2_opc_method <- pls_projection(Xr, Xu, + Yr = cbind(Yr, Yr_2), + pc_selection = list(method = "opc", value = 20), + center = TRUE, scale = FALSE + ) + + expect_true(pls2_opc_method$n_components == which.min(pls2_opc_method$opc_evaluation[, 4])) + distm <- as.matrix(dist(scale(pls2_opc_method$scores[1:nrow(Xr), ], TRUE, TRUE))) + nn <- apply(distm, MARGIN = 2, FUN = function(x) order(x)[2]) + + result_rmsd <- as.vector(round(colMeans((cbind(Yr, Yr_2) - cbind(Yr, Yr_2)[nn, ])^2)^0.5, 4)) + + resemble_rmsds <- as.vector(round(pls2_opc_method$opc_evaluation[pls2_opc_method$n_components, 2:3], 4)) + expect_true(all(resemble_rmsds == result_rmsd)) }) diff --git a/vignettes/MBL.gif b/vignettes/MBL.gif index 520755a..5b117ee 100644 Binary files a/vignettes/MBL.gif and b/vignettes/MBL.gif differ diff --git a/vignettes/logo.jpg b/vignettes/logo.jpg new file mode 100644 index 0000000..397753b Binary files /dev/null and b/vignettes/logo.jpg differ diff --git a/vignettes/resemble.Rmd b/vignettes/resemble.Rmd index ea1b576..0651359 100644 --- a/vignettes/resemble.Rmd +++ b/vignettes/resemble.Rmd @@ -1,15 +1,17 @@ --- title: "Modelling complex spectral data with the resemble package" author: - - name: Leonardo Ramirez-Lopez, Alexandre M.J.-C. Wadoux + - name: Leonardo Ramirez-Lopez and Alexandre M.J.-C. Wadoux email: ramirez.lopez.leo@gmail.com date: "`r Sys.Date()`" +clean: true bibliography: ["resemble.bib"] biblio-style: "apalike" link-citations: true -output: +output: bookdown::html_document2: theme: united + highlight: breezedark toc: true toc_depth: 3 toc_float: @@ -33,7 +35,7 @@ knitr::opts_chunk$set(

Think Globally, Fit Locally [@saul2003think]

```{r classdiagram, echo = FALSE, out.width = '20%', fig.align = 'right'} -knitr::include_graphics("logo.png") +knitr::include_graphics("logo.jpg") ``` # Introduction @@ -46,19 +48,18 @@ compositional characteristics (chemical, physical or biological) of the matrix. Therefore, it is possible to develop empirical models that can accurately quantify properties of different matrices. In this respect, quantitative spectroscopy techniques are usually fast, non-destructive and cost-efficient in -comparison to conventional laboratory methods used in the analyses of these +comparison to conventional laboratory methods used in the analyses of such matrices. This has resulted in the development of comprehensive spectral databases for several agricultural products comprising large amounts of observations. The size of such databases increases *de facto* their -complexity. To analyze large and complex spectral data, one must then resort -numerical and statistical tools such as dimensionality reduction, and local -spectroscopic modelling based on spectral dissimilarity concepts. +complexity. To analyze large and complex spectral data, one must then resort to +numerical and statistical tools and methods such as dimensionality reduction, +and local spectroscopic modeling based on spectral dissimilarity concepts. The aim of the `resemble` package is to provide tools to efficiently and accurately extract meaningful quantitative information from large and complex -spectral databases. The package contains functions for dimensionality reduction, -spectral dissimilarity measurements, neighbour search, and local modeling. -The core functionalities of the package include: +spectral databases. The core functionalities of the package include: + * dimensionality reduction * computation of dissimilarity measures @@ -66,33 +67,47 @@ The core functionalities of the package include: * spectral neighbour search * fitting and predicting local spectroscopic models +# Citing the package +Simply type and you will get the info you need: +```{r eval = TRUE} +citation(package = "resemble") +``` + + # Example dataset -This vignette uses a soil spectroscopic dataset provided in the +This vignette uses the soil Near-Infrared (NIR) spectral dataset provided in the [package `prospectr`](https://CRAN.R-project.org/package=prospectr) -package [@stevens2020introduction]. It is a soil spectral library used in the -\sQuote{Chimiometrie 2006} challenge by @pierna2008soil. The library contains -absorbance spectra of dried and sieved 825 soil observations/samples. These -samples originate from agricultural fields collected from all over the Walloon -region in Belgium. The dataset is in a data frame which is organized as follows: +package [@stevens2020introduction]. The reason why we use this dataset is because +soils are one of the most complex matrices analyzed with NIR spectroscopy. This +spectral dataset/library was used in the \sQuote{Chimiometrie 2006} challenge by +@pierna2008soil. The library contains NIR absorbance spectra of dried and sieved +825 soil observations/samples. These samples originate from agricultural fields +collected from all over the Walloon region in Belgium. The data are in an `R` +`data.frame` object which is organized as follows: * __Response variables__: - * ___Nt___ (Total Nitrogen in g/Kg of dry soil): A numerical variable (values + + * ___Nt___ (Total Nitrogen in g/kg of dry soil): a numerical variable (values are available for 645 samples and missing for 180 samples). - * ___Ciso___ (Carbon in g/100 g of dry soil): A numerical + + * ___Ciso___ (Carbon in g/100 g of dry soil): a numerical variable (values are available for 732 and missing for 93 samples). + * ___CEC___ (Cation Exchange Capacity in meq/100 g of dry soil): A numerical variable (values are available for 447 and missing for 378 samples). -* __Predictor variables__: The predictor variables are in a matrix embedded in the -data frame, which can be accessed via `NIRSoil$spc`. These variables contain the -absorbance Near-Infrared (NIR) spectra of the samples recorded between the -1100 nm and 2498 nm of the electromagnetic spectrum at 2 nm interval. Each -column name in the matrix of spectra represent a specific wavelength (in nm). -* __Set__: this is a "binary" variable that indicates the 618 samples belong to the -training subset (represented by 1) and the 207 samples that belong to the test -subset (represented by 0). - -Load the necessary packages and data. + +* __Predictor variables__: the predictor variables are in a matrix embedded in + the data frame, which can be accessed via `NIRsoil$spc`. These variables + contain the NIR absorbance spectra of the samples recorded between the + 1100 nm and 2498 nm of the electromagnetic spectrum at 2 nm interval. Each + column name in the matrix of spectra represents a specific wavelength (in nm). + +* __Set__: a binary variable that indicates whether the samples belong to the + training subset (represented by 1, 618 samples) or to the test + subset (represented by 0, 207 samples). + +Load the necessary packages and data: ```{r libraries, tidy = TRUE, message = FALSE} library(resemble) library(prospectr) @@ -109,7 +124,7 @@ str(NIRsoil) # Spectra pre-processing This step aims at improving the signal quality of the spectra for quantitative analysis. In this respect, the following standard methods are applied using the -package `prospectr` [@stevens2020introduction]: +[package `prospectr`](https://CRAN.R-project.org/package=prospectr) [@stevens2020introduction]: 1. Resampling from a resolution of 2 nm to a resolution of 5 nm. 2. First derivative using Savitsky-Golay filtering [@Savitzky1964]. @@ -130,14 +145,9 @@ NIRsoil$spc_p <- NIRsoil$spc %>% resample(wav = wavs, new.wav = seq(min(wavs), max(wavs), by = new_res)) %>% savitzkyGolay(p = poly_order, w = window, m = diff_order) ``` - -Both the raw absorbance spectra and the first derivative spectra are shown in -Figure \ref(fig:plotspectra). The first derivative spectra represents the -explanatory variables that will be used for all the examples throughout this -document. -```{r plotspectra, fig.cap = "Raw spectral absorbance data (top) and first derivative of the absorbance spectra (bottom).", fig.cap.style = "Image Caption", fig.align = "center", fig.width = 8, fig.height = 8, echo = FALSE} -op <- par() -par(mfrow = c(2, 1), mar = c(4,4,1,4)) +```{r plotspectra, fig.cap = "Raw spectral absorbance data (top) and first derivative of the absorbance spectra (bottom).", fig.cap.style = "Image Caption", fig.align = "center", fig.width = 7, fig.height = 7, echo = FALSE, fig.retina = 0.85} +old_par <- par("mfrow", "mar") +par(mfrow = c(2, 1), mar = c(4, 4, 1, 4)) new_wavs <- as.matrix(as.numeric(colnames(NIRsoil$spc_p))) plot(range(wavs), range(NIRsoil$spc), col = NA, @@ -155,6 +165,7 @@ rect(par("usr")[1], par("usr")[3], par("usr")[2], par("usr")[4], col = "#EFBF478 grid(lty = 1, col = "#E47C4E80") matlines(x = new_wavs, y = t(NIRsoil$spc_p), lty = 1, col = "#5177A133") +par(old_par) ``` ```{r eval = FALSE} @@ -171,17 +182,22 @@ matplot(x = new_wavs, y = t(NIRsoil$spc_p), type = "l", lty = 1, col = "#5177A133") ``` -For more explicit examples, the `NIRSoil` data is split into training and +Both the raw absorbance spectra and the first derivative spectra are shown in +Figure \@ref(fig:plotspectra). The first derivative spectra represents the +explanatory variables that will be used for all the examples throughout this +document. + +For more explicit examples, the `NIRsoil` data is split into training and testing subsets: ```{r} # training dataset training <- NIRsoil[NIRsoil$train == 1, ] +# testing dataset testing <- NIRsoil[NIRsoil$train == 0, ] ``` -Note that in the resemble package we follow the notation provided by -@ramirez2013spectrum, i.e.: +In the resemble package we use the following notation (@ramirez2013spectrum): * __Training observations__: @@ -189,49 +205,48 @@ Note that in the resemble package we follow the notation provided by set (spectral data for calibration). * `Yr` stands for the response variable(s) in the reference/training set -(dependent variable for calibration).In the context of this package, `Yr` is +(dependent variable for calibration). In the context of this package, `Yr` is also referred as to "___side information___", which is a variable or set of variables that are associated to the training observations that can also be used -to support or guide optimization during modeling but that not necessarily are +to support or guide optimization during modeling, but that not necessarily are part of the input of such models. For example, we will see in latter sections -that `Yr` can be used in Principal Component Analysis to help to decide how +that `Yr` can be used in Principal Component Analysis to help on deciding how many components are optimal. * __Testing observations__: * `Xu` stands for the matrix of predictor variables in the unknown/test -set (spectral data for validation/esting). +set (spectral data for validation/testing). * `Yu` stands for the response variable(s) in the unknown/test set (dependent -variable for calibration). +variable for testing). # Dimensionality reduction -When conducting exploratory analysis of spectral data, we are immediately -burdened with the issue of high dimensionality. It is such that we may be -dealing with (using NIR spectra data as an example) hundreds to thousands of -individual wavelengths for each spectrum. When one wants to investigate patterns -in the data, spectral similarities and differences, or detect spectral outliers, -it is necessary to reduce the dimension of the spectra while retaining important -information. - -Principal component (PC) analysis and Partial Least Squares (PLS) decomposition +When conducting exploratory analysis of spectral data, we face the curse of +dimensionality. It is such that we may be dealing with (using NIR spectra data +as an example) hundreds to thousands of individual wavelengths for each +spectrum. When one wants to find patterns in the data, spectral similarities and +differences, or detect spectral outliers, it is necessary to reduce the +dimension of the spectra while retaining important information. + +Principal Component (PC) analysis and Partial Least Squares (PLS) decomposition methods assume that the meaningful structure the data intrinsically lies on a lower dimensional space. Both methods attempt to find a projection matrix -that projects or converts the original variables onto a new and less complex -space represented by few variables. These new variables mimic the original +that projects the original variables onto a less complex +space represented by new few variables. These new variables mimic the original variability across observations. These two methods can be considered as the standard ones for dimensionality reduction in many fields of spectroscopic analysis. -The difference between PC and PLS is that in the first the objective is to +The difference between PC and PLS is that in PC the objective is to find few new variables (which are orthogonal) that capture as much of the original data variance while in the latter the objective is to find few new variables that maximize their variance with respect to a set of one or more external variables (e.g. response variables or side information variables). ## Methods -In the `resemble` package PC analysis and PLS decomposition are available +In the `resemble` package, PC analysis and PLS decomposition are available through the `ortho_projection()` function which offers the following algorithms: * `"pca"`: the standard method for PC analysis based on the singular value @@ -242,7 +257,7 @@ least squares algorithm [NIPALS, @wold1975soft] for the purpose of PC analysis. * `"pls"`: Here, PLS decomposition also uses the NIPALS algorithm, but in this case it makes use of ___side information___, which can be a variable or set of -variables that are associated to the training observations and that are used to +variables that are __associated to the training observations__ and that are used to project the data. In this case, the variance between the projected variables and the ___side information___ variable(s) is maximized. @@ -255,9 +270,9 @@ pca_tr <- ortho_projection(Xr = training$spc_p, method = "pca") pca_tr ``` -plot the `ortho_projection` object: +Plot the `ortho_projection` object: -```{r plotpcsvariance, fig.cap = "Cumulative explained variance of the principal components (left) and individual contribution the the explained variance for each of the components (right).", fig.cap.style = "Image Caption", fig.align = "center", fig.width = 7, fig.height = 3} +```{r plotpcsvariance, fig.cap = "Individual contribution to the explained variance for each component (left) and cumulative variance explained by the principal components (right).", fig.cap.style = "Image Caption", fig.align = "center", fig.width = 7, fig.height = 3, fig.retina = 0.85} plot(pca_tr, col = "#D42B08CC") ``` @@ -265,10 +280,10 @@ The code above shows that in this dataset, `r pca_tr$n_components` components are required to explain around `r round(100 * sum(pca_tr$variance[2,]), 0)`% of the original variance found in the spectra (Figure \@ref(fig:plotpcsvariance)). -Equivalent results can be obtained by using the NIPALS algorithm: +Equivalent results can be obtained with the NIPALS algorithm: ```{r, results = 'hide'} # principal component (pc) analysis with the default -# method (singular value decomposition) +# NIPALS algorithm pca_nipals_tr <- ortho_projection(Xr = training$spc_p, method = "pca.nipals") @@ -285,12 +300,17 @@ information used is the Total Carbon (`Ciso`): # Partial Least Squares decomposition using # Total carbon as side information # (this might take some seconds) -pls_tr <- ortho_projection(Xr = training$spc_p[!is.na(training$Ciso),], - Yr = training$Ciso[!is.na(training$Ciso)], +pls_tr <- ortho_projection(Xr = training$spc_p, + Yr = training$Ciso, method = "pls") pls_tr ``` +Note that in the previous code, for PLS projection the observations with missing +`training$Ciso` are hold out, and then the projection takes place. The +missing observations are projected with the resulting projection matrix and +pooled together with the initial results. + By default the `ortho_projection()` function retains all the first components that, alone, account for at least 1% of the original variance of data. In the following section we will see that the function also offers additional @@ -307,7 +327,7 @@ Those components that alone explain more than a given amount of the original spectral variance are retained. Example: ```{r, results = 'hide', eval = FALSE} -# This retains components that anlone explain at more than 5% of the original +# This retains components that alone explain at least 5% of the original # variation in training$spc_p var_sel <- list(method = "var", value = 0.05) pca_tr_minvar5 <- ortho_projection(Xr = training$spc_p, @@ -322,21 +342,21 @@ pca_tr_minvar5 Only the first components that together explain at least a given amount of the original variance are retained. Example: ```{r, results = 'hide', eval = FALSE} -# This retains components that together explain at least 95% of the original +# This retains components that together explain at least 90% of the original # variation in training$spc_p cumvar_sel <- list(method = "cumvar", value = 0.90) -pca_tr_cumvar95 <- ortho_projection(Xr = training$spc_p, +pca_tr_cumvar90 <- ortho_projection(Xr = training$spc_p, method = "pca", pc_selection = cumvar_sel) -pca_tr_cumvar95 +pca_tr_cumvar90 ``` ### _Optimal component selection_ `"opc"`: This is a more sophisticated method in which the selection of the components is based on the side information concept presented in @ramirez2013spectrum. First -let be $P$ a sequence of retained components (so that $P = 1, 2, ...,k$). At +let $P$ be a sequence of retained components (so that $P = 1, 2, ...,k$). At each iteration, the function computes a dissimilarity matrix retaining $p_i$ components. The values in this side information variable are compared against the side information values of their most spectrally similar observations. @@ -364,7 +384,7 @@ on how the RMSD and kappa are calculated in the function. The rationale behind the `"opc"` method is based on the assumption that the closer two observations are in terms of their explanatory variables (`Xr`), the -closer they may be in terms of their side information (`Yr`). +closer they may be in terms of their side information (`Yr`). ```{r, results = 'hide'} @@ -382,12 +402,12 @@ represent the space in which the overall Total Carbon difference between each sample and its corresponding nearest neighbor is minimized. The following graph shows how the RMSD varies as a function of the number of components (Figure \@ref(fig:pcrmsd)): -```{r pcrmsd, fig.cap = "Root mean squared difference between the samples and their corresponding nearest neighbors (for Total Carbon as side finormation) found by using dissimilarity matrices computed with different number of PCs.", fig.id = "plot_pcs_opc", fig.cap.style = "Image Caption", fig.align = "center", fig.width = 5, fig.height = 4} +```{r pcrmsd, fig.cap = "Root mean squared difference between the samples and their corresponding nearest neighbors (for Total Carbon as side finormation) found by using dissimilarity matrices computed with different number of PCs.", fig.id = "plot_pcs_opc", fig.cap.style = "Image Caption", fig.align = "center", fig.width = 5, fig.height = 4, fig.retina = 0.85} plot(pca_tr_opc, col = "#FF1A00CC") ``` The following code exemplifies how the RMSD is calculated (only for the `r pca_tr_opc$n_components`th component, Figure \@ref(fig:rmsdscatter)): -```{r rmsdscatter, fig.cap = paste("Comparison between each sample and its corresponding nearest neighbor (in terms of Total Carbon) when ", pca_tr_opc$n_components, "are used for dissimilarity matrix computations."), fig.id = "plot_pcs_opc2", fig.cap.style = "Image Caption", fig.align = "center", fig.width = 4, fig.height = 4} +```{r rmsdscatter, fig.cap = paste("Comparison between each sample and its corresponding nearest neighbor (in terms of Total Carbon) when ", pca_tr_opc$n_components, "are used for dissimilarity matrix computations."), fig.id = "plot_pcs_opc2", fig.cap.style = "Image Caption", fig.align = "center", fig.width = 4, fig.height = 4, fig.retina = 0.85} # compute the dissimilarity matrix using all the retained scores pc_diss <- f_diss(pca_tr_opc$scores, diss_method = "mahalanobis") # get the nearest neighbor for each sample @@ -411,11 +431,13 @@ The user explicitly defines how many components to retrieve. Example: ```{r, results = 'hide'} # This uses manual component selection manual_sel <- list(method = "manual", value = 9) +# PC pca_tr_manual <- ortho_projection(Xr = training$spc_p, method = "pca", pc_selection = manual_sel) pca_tr_manual +# PLS pls_tr_manual <- ortho_projection(Xr = training$spc_p, Yr = training$Ciso, method = "pls", @@ -434,6 +456,7 @@ used to project new data: ```{r, results = 'hide'} optimal_sel <- list(method = "opc", value = 40) +# PLS pls_tr_opc <- ortho_projection(Xr = training$spc_p, Yr = training$Ciso, method = "pls", @@ -444,6 +467,7 @@ pls_tr_opc$projection_mat pls_projected <- predict(pls_tr_opc, newdata = testing$spc_p) +# PC pca_tr_opc <- ortho_projection(Xr = training$spc_p, Yr = training$Ciso, method = "pca", @@ -457,7 +481,7 @@ pca_projected <- predict(pca_tr_opc, newdata = testing$spc_p) ## Projecting two separate datasets in one single run -The `ortho_projection` function allows to project two separate datasets in one +The `ortho_projection()`function allows to project two separate datasets in one run. For example, training and testing data can be passed to the function as follows: ```{r, results = 'hide', eval = FALSE} @@ -471,13 +495,14 @@ pca_tr_ts <- ortho_projection(Xr = training$spc_p, plot(pca_tr_ts) ``` In the above code for PC analyisis, `training` and `testing` datasets are pooled -together and then projected. For the `opc` selection method, the dissimilarity +together and then projected and split back for presenting the final results. +For the `opc` selection method, the dissimilarity matrices are built only for the `training` data and for the observations with available side information (Total Carbon). These dissimilarity matrices are used only to find the optimal number of PCs. Note that `Xr` and `Yr` refer to the same observations. Also note that the optimal number of PCs might not be the same as when `testing` is not passed to the `Xu` argument since the PC projection -model is built from different number of observations. +model is built from a different pool of observations. In the case of PLS, the observations used for projection necessarily have to have side information available, therefore the missing values in `Yr` are hold @@ -493,7 +518,7 @@ pls_tr_ts <- ortho_projection(Xr = training$spc_p, pc_selection = optimal_sel, scale = TRUE) -# the same pls projection model can be obtained with: +# the same PLS projection model can be obtained with: pls_tr_ts2 <- ortho_projection(Xr = training$spc_p[!is.na(training$Ciso),], Yr = training$Ciso[!is.na(training$Ciso)], method = "pls", @@ -504,7 +529,7 @@ identical(pls_tr_ts$projection_mat, pls_tr_ts2$projection_mat) ``` ## Using more than one variable as side information -The `ortho_projection` function allows to pass more than one variable to `Yr` +The `ortho_projection()`function allows to pass more than one variable to `Yr` (side information): ```{r, results = 'hide', eval = FALSE} @@ -519,10 +544,10 @@ plot(pls_multi_yr) ``` In the above code for PLS projections using multivariate side information, the -PLS2 method (based on the NIPALS algorithm) is used [See wold1983multivariate]. +PLS2 method (based on the NIPALS algorithm) is used [see @wold1983multivariate]. The optimal component selection (`opc`) also uses the multiple variables passed to `Yr`, the RMSD is computed for each of the variables. Each RMSD is then -standardized and the final RMSD used for optimization is an average of them. +standardized and the final RMSD used for optimization is their average. For the example above, this data can be accessed as follows: ```{r, results = 'hide', eval = FALSE} @@ -538,12 +563,12 @@ RMSD. Similarity/dissimilarity measures between objects are often estimated by means of distance measurements, the closer two objects are to one another, the higher the similarity between them. Dissimilarity or distance measures are useful for -a number of applications, for example for outlier detection or nearest neighbors +a number of applications, for example for outlier detection and nearest neighbors search. The `dissimilarity()` function is the main function for measuring dissimilarities between observations. It is basically a wrapper to other existing dissimilarity -functions wthin the package (see `fdiss()`, `cor_diss()`, `sid()` and +functions within the package (see `f_diss()`, `cor_diss()`, `sid()` and `ortho_diss()`). It allows to compute dissimilarities between: * all the observations in a single matrix. @@ -553,10 +578,10 @@ functions wthin the package (see `fdiss()`, `cor_diss()`, `sid()` and The dissimilarity methods available in `dissimilarity()` are as follows (see `diss_method` argument): -* `"pca"`: Mahalanobis distance computed on the matrix of scores of a Principal -Component (PC) projection of `Xr` (and `Xu` if provided). PC projection is done -using the singular value decomposition (SVD) algorithm. Type `help(ortho_diss)` -for more details on the function called by this method. +* `"pca"`: Mahalanobis distance computed on the matrix of scores of a PC projection of +`Xr` (and `Xu` if provided). PC projection is done using the singular value +decomposition (SVD) algorithm. Type `help(ortho_diss)` for more details on the +function called by this method. * `"pca.nipals"`: Mahalanobis distance computed on the matrix of scores of a PC projection of `Xr` (and `Xu` if provided). PC projection is done @@ -608,12 +633,12 @@ $\mathrm T$ [@de2000mahalanobis]. To compute orthogonal dissimilarities in the `resemble` package, the `dissimilarity()` function can be used as follows: ```{r, results = 'hide', eval = FALSE} -# For PC dissimilarity using the default settings +# for PC dissimilarity using the default settings pcd <- dissimilarity(Xr = training$spc_p, diss_method = "pca") dim(pcd$dissimilarity) -# For PC dissimilarity using the optimized component selection method +# for PC dissimilarity using the optimized component selection method pcd2 <- dissimilarity(Xr = training$spc_p, diss_method = "pca.nipals", Yr = training$Ciso, @@ -621,9 +646,9 @@ pcd2 <- dissimilarity(Xr = training$spc_p, return_projection = TRUE) dim(pcd2$dissimilarity) pcd2$dissimilarity -pcd2$projection # The projection used to compute the dissimilarity matrix +pcd2$projection # the projection used to compute the dissimilarity matrix -# For PLS dissimilarity +# for PLS dissimilarity plsd <- dissimilarity(Xr = training$spc_p, diss_method = "pls", Yr = training$Ciso, @@ -631,7 +656,7 @@ plsd <- dissimilarity(Xr = training$spc_p, return_projection = TRUE) dim(plsd$dissimilarity) plsd$dissimilarity -plsd$projection # The projection used to compute the dissimilarity matrix +plsd$projection # the projection used to compute the dissimilarity matrix ``` To compute the correlation dissimilarity between training and testing @@ -653,11 +678,11 @@ plsd_tr_ts <- dissimilarity(Xr = training$spc_p, pc_selection = list("opc", 20)) dim(plsd_tr_ts$dissimilarity) ``` -In this last two examples, matrices of `r nrow(plsd_tr_ts$dissimilarity)` rows -and `r ncol(plsd_tr_ts$dissimilarity)` columns are retrieved. The rows are the -same as the rows in the training dataset while the columns are the same as the -rows in the testing dataset. The dissimilarity between the $i$th observation in -the training dataset and the $j$th observation in the tetsing dataset is stored +In the last two examples, matrices of `r nrow(plsd_tr_ts$dissimilarity)` rows +and `r ncol(plsd_tr_ts$dissimilarity)` columns are retrieved. The number of rows +is the same as in the training dataset while the number of columns is the same as +in the testing dataset. The dissimilarity between the $i$th observation in +the training dataset and the $j$th observation in the testing dataset is stored in the $i$th row and the $j$th column of the resulting dissimilarity matrices. @@ -681,14 +706,14 @@ $d(x_i, x_j) \neq d(x_j, x_i)$). For computing this type of localized dissimilarity matrix, two arguments need to be passed to the `dissimilarity()` function: `.local` and `pre_k`. These are not formal arguments of the function, however, they are passed to -the `ortho_diss` function which is used by the `dissimilarity()` function for +the `ortho_diss()`function which is used by the `dissimilarity()` function for computing the dissimilarities in the orthogonal spaces. -Here are some two examples on how to perform localized dissimilarity +Here are two examples on how to perform localized dissimilarity computations: ```{r localdiss, eval = TRUE} -# For localized PC dissimilarity using the optimized component selection method +# for localized PC dissimilarity using the optimized component selection method # set the number of neighbors to retain knn <- 200 local_pcd_tr_ts <- dissimilarity(Xr = training$spc_p, @@ -720,22 +745,22 @@ local_plsd_tr_ts$dissimilarity[1:10, 1:2] ## Correlation dissimilarity Correlation dissimilarity is based on the Pearson's $\rho$ correlation coefficient between observations. The value of Pearson's $\rho$ varies between -1 and 1. A -correlation of 1 indicates between two observations would indicate that they -might have identical characteristics (i.e. they are can be considered as highly -similar). A value of -1, conversely, would indicate that the two observations -are strongly negatively correlated (i.e. the two observations are highly -dissimilar). The correlation dissimilarity implemented in the package -scales the values between 0 (highest dissimilarity) and 1 (highest similarity). -To measure $d$ between two observations $\mathbf{x}_{i}$ and $\mathbf{x}_{j}$ +correlation of 1 between two observations would indicate that they are perfectly +correlated and might have identical characteristics (i.e. they are can be +considered as highly similar). A value of -1, conversely, would indicate that +the two observations are perfectly negatively correlated (i.e. the two +observations are highly dissimilar). The correlation dissimilarity implemented +in the package scales the values between 0 (highest dissimilarity) and 1 +(highest similarity). To measure $d$ between two observations $x_i$ and $x_j$ based on the correlation dissimilarity the following equation is applied: -$$d(x_i, x_j) = \frac{1}{2} 1 - \rho(x_i, x_j)$$ +$$d(x_i, x_j) = \frac{1}{2} (1 - \rho(x_i, x_j))$$ Note that $d$ cannot be considered as a distance metric since it does not satisfy the axiom of identity of indiscernibles. Therefore we prefer the use of the term dissimilarity. -The following code demonstrates how to compute the coerrelation dissimilarity +The following code demonstrates how to compute the correlation dissimilarity between all observations in the training dataset: ```{r, results = 'hide', eval = FALSE} cd_tr <- dissimilarity(Xr = training$spc_p, diss_method = "cor") @@ -755,7 +780,7 @@ cd_tr_ts$dissimilarity Alternatively, the correlation dissimilarity can be computed using a moving window. In this respect, a window size term $w$ is introduced to the original -$d$ equation: +equation: $$d(x_i, x_j; w) = \frac{1}{2 w}\sum_{k=1}^{p-w}1 - \rho(x_{i,\{k:k+w\}}, x_{j,\{k:k+w\}})$$ @@ -782,7 +807,7 @@ cd_mw$dissimilarity In the computation of the Euclidean dissimilarity, each feature has equal significance. Hence, correlated variables which may represent irrelevant features, may have a disproportional influence on the final dissimilarity -measurement [brereton2003chemometrics]. Therefore, it is not recommended to +measurement [@brereton2003chemometrics]. Therefore, it is not recommended to use this measure directly on the raw data. To compute the dissimilarity between two observations/vectors $x_i$ and $x_j$ the package uses the following equation: @@ -797,7 +822,7 @@ ed <- dissimilarity(Xr = training$spc_p, diss_method = "euclid") ed$dissimilarity ``` -The `dist()` function in the `R``stats` package can also be used to compute +The `dist()` function in the `R` package `stats` can also be used to compute Euclidean distances, however the `resemble` implementation tends to be faster (especially for very large matrices): ```{r, results = 'hide', eval = FALSE} @@ -821,8 +846,9 @@ ed_resemble$dissimilarity[1:2, 1:3] ``` In the above code it can be seen that the results of the `dist()` require scaling based on the number of input variables. This means that, by default, -the results of the `dist()` tend to increase with the number of input +the values output by `dist()` increase with the number of input variables. This is an effect that is already accounted for in the implementation +of the Euclidean (and also Mahalanobis) dissimilarity implementation of `resemble`. Another advantage of the Euclidean dissimilarity in `resemble` over the one in @@ -843,8 +869,8 @@ been extensively applied in remote sensing as a tool for unsupervised classification and spectral similarity analysis. The cosine dissimilarity between two observations ($x_i$ and $x_j$) is calculated as: -$$d (x_i, x_j) = cos^{-1} \tfrac{\sum_{k=1}^{d} x_{i,k} x_{j,k} } {\sqrt{\sum_{k=1}^{d} x_{i,k}^2} \sqrt{\sum_{k=1}^{d} x_{j,k}^2}}$$ -where $d$ is the number of variables. +$$d (x_i, x_j) = cos^{-1} \tfrac{\sum_{k=1}^{p} x_{i,k} x_{j,k} } {\sqrt{\sum_{k=1}^{p} x_{i,k}^2} \sqrt{\sum_{k=1}^{p} x_{j,k}^2}}$$ +where $p$ is the number of variables. With the `dissimilarity()` function the Euclidean dissimilarity can be computed as follows: @@ -867,7 +893,7 @@ $x_j$ based on this method is computed as: $$d(x_i, x_j) = kl (x_i, x_j) + kl (x_j, x_i)$$ -The following code can be sued to compute the SID between the training and +The following code can be used to compute the SID between the training and testing observations: ```{r, results = 'hide', eval = FALSE} sid_tr_ts <- dissimilarity(Xr = training$spc_p, @@ -878,14 +904,20 @@ sid_tr_ts$dissimilarity ``` See the `sid()` function in the `resemble` package for more details. -## Comparing the dissimilarity measures using the nearest neighbor method +## How to know if a dissimilarity method is reliable? -Here the different methods to measure dissimilarity between spectra are -compared in terms of their ability to retrieve observations (first nearest -neighbor or closest observation) with similar Total Carbon ("Ciso"). This -indicates how well the spectral similarity between observations reflect their -compositional similarity. +Usually, dissimilarity assessment is disregarded and the decision on what method +to use is sometimes arbitrary. However, if the estimations of +similarity/dissimilarity between observations from its predictor/explanatory +variables fail to reflect the real or main similarity/dissimilarity, these +estimations can be seen as useless for further analyses. +The package `resemble` offers functionality for assessing dissimilarity matrices. +These assestments are based on first nearest neighbor search (1-NN). In this +section, the different methods to measure dissimilarity between spectra are +compared in terms of their ability to retrieve 1-NNs observations with +similar Total Carbon ("Ciso"). This indicates how well the spectral similarity +between observations reflect their compositional similarity. Compute a dissimilarty matrix for `training$spc_p` using the different methods: ```{r, results = 'hide', eval = TRUE} @@ -930,7 +962,7 @@ sinfd <- dissimilarity(training$spc_p, diss_method = "sid", scale = TRUE) ``` -Use the `sim_eval` function with each dissimilarity matrix to find the closest +Use the `sim_eval()`function with each dissimilarity matrix to find the closest observation to each observation and compare them in terms of the `Ciso` variable: @@ -961,13 +993,14 @@ fig_cap <- paste("Comparison between observations and their corresponding "cd: Correlation dissimilarity;", "mcd: Moving window correlation dissimilarity;", "ed: Euclidean dissimilarity;", - "sinfd: Spectral information divergence/dissimilarity") + "sinfd: Spectral information divergence/dissimilarity.") ``` Table \@ref(tab:tcomparisons) and Figure \@ref(fig:pcomparisons) show the -results of the comparisons between the Total Carbon of the observations and the -Total Carbon of their most similar samples (1-NN) according to the dissimilarity -method used. In the example, the spectral dissimilarity matrices that best reflect the -compositions similarity are those built with the +results of the comparisons (for the training dataset) between the Total Carbon +of the observations and the Total Carbon of their most similar samples (1-NN) +according to the dissimilarity method used. In the example, the spectral +dissimilarity matrices that best reflect the compositions similarity are those +built with the _pls with optimized component selection_ (`o_plsd`) and _pca with optimized component selection_ (`o_pcad`). @@ -1011,12 +1044,12 @@ knitr::kable(do.call("rbind", comparisons), "for between the observations and their", "corrresponding closest observations", "retrieved with the different dissimilarity - methods"), + methods."), format = "simple", digits = 2, align = "l", padding = 2) ``` ```{r eval = FALSE} -op <- par() +old_par <- par("mfrow") par(mfrow = c(3, 3)) p <- sapply(names(ev), FUN = function(x, label, labs = c("Ciso (1-NN), %", "Ciso, %")) { @@ -1028,10 +1061,12 @@ p <- sapply(names(ev), }, x = ev) +par(old_par) ``` -```{r pcomparisons, fig.cap = paste(fig_cap), fig.cap.style = "Image Caption", fig.align = "center", fig.width = 8, fig.height = 8, echo = FALSE} -op <- par() +```{r pcomparisons, fig.cap = paste(fig_cap), fig.cap.style = "Image Caption", fig.align = "center", fig.width = 8, fig.height = 8, echo = FALSE, fig.retina = 0.85} +old_par <- par("mfrow", "mar") + par(mfrow = c(3, 3), pch = 16, mar = c(4,4,4,4)) my_cols <- c("#750E3380", "#C3BC6180", @@ -1053,25 +1088,20 @@ p <- sapply(names(ev), title(label) grid(col= "#80808080", lty = 1) abline(0, 1, col = "#FF1A0080") - }, x = ev, cols = my_cols) +par(old_par) ``` -```{r, results = 'hide', eval = FALSE} -## reset the graphical parameters -par(op) -``` - # k-Nearest Neighbors (k-NN) search In the package, the k-NN search aims at finding in a given reference set of -observations the most spectrally similar observations for another given set of +observations a group of spectrally similar observations for another given set of observations. For an observation, its most similar observations are known as nearest neighbors and they are usually found by using dissimilairty metrics. -In the `resemble` package, the k-nearest neighbor search is implemented in the +In `resemble`, the k-nearest neighbor search is implemented in the function `search_neighbors()`. This function uses the `dissimilarity()` function to compute the dissimilarity matrix that serves in the identification of the neighbors. These neighbors can be retained in two ways: _i_. by providing a @@ -1086,11 +1116,11 @@ which serves as the basis for the examples presented in this section. This means that the neighboring observations are retained regardless their dissimilarity/distance to the target observation. Each target observation for which its neighbors are to be found ends up with the same neighborhood size -($k$). The problem with this approach is that observations that are in fact +($k$). A drawback of this approach is that observations that are in fact largely dissimilar to the target observation might end up in its neighborhood. This is because the requirement for building the neighborhood is based on its size and not on the similarity of the retained observations to the target one. -In the `dissimilarity()` function, the neighborhood size is controlled through +In the `dissimilarity()` function, the neighborhood size is controlled by the argument `k`. Here is an example that demonstrates how `search_neighbors` can be used to @@ -1104,7 +1134,7 @@ knn_pc <- search_neighbors(Xr = training$spc_p, # matrix of neighbors knn_pc$neighbors -# matrix of neighbor distances +# matrix of neighbor distances (dissimilarity scores) knn_pc$neighbors_diss # the index (in the training set) of the first two closest neighbors found in @@ -1125,16 +1155,16 @@ neighbors found. This is a matrix of neighbor indices where every column represents an observarion in the testing set while every row represents the neighbor index (in descending order). Every entry represents the index of the neighbor observation in the training set. The `knn_pc$neighbors_diss` matrix -shows the distances corresponding to the neighbors found. For example, for the -first observation in `testing` its closest observation found in `training` -corresponds to the one with index `r knn_pc$neighbors[1]` +shows the dissimilarity scores corresponding to the neighbors found. For example, +for the first observation in `testing` its closest observation found in +`training` corresponds to the one with index `r knn_pc$neighbors[1]` (`knn_pc$neighbors[1]`) which has a dissimilarity score of -`r knn_pc$neighbors_diss[1]` (`knn_pc$neighbors_diss[1]`). +`r round(knn_pc$neighbors_diss[1], 2)` (`knn_pc$neighbors_diss[1]`). Neighbor search can also be conducted with all the dissimilarity measures described in previous sections. The neighbors retrieved will then depend on the -approach used. Thus, it is recommended to evaluate carefully what dissimilarity -metric to use before neighbor search. +dissimilarity method used. Thus, it is recommended to evaluate carefully what +dissimilarity metric to use before neighbor search. Here are other examples of neighbor search based on other dissimilarity measures: @@ -1207,7 +1237,7 @@ of a very small size or even empty, which constraints any type of analysis within such neighborhoods. On the other hand, some neighborhood might end up with large sizes which might include either redundant observations or in some other cases where $d_{th}$ is too large the complexity in the neighborhood might -be large. In the `dissimilarity()` function, $d_{th}$ is controlled through the +be large. In the `dissimilarity()` function, $d_{th}$ is controlled by the argument `k_diss`. This argument is accompanied by the argument `k_range` which is used to control the maximum and minimum neighborhood sizes. For example, if a neighborhood size is below the minimum size $k_{min}$ specified in `k_range`, @@ -1255,21 +1285,21 @@ dnn_pc$unique_neighbors ``` In the code above, the size of the neighborhoods is not constant, the size -variability can be easily visualize with a histogram on +variability can be easily visualized with a histogram on `dnn_pc$k_diss_info$n_k`. Figure \@ref(fig:knnhist), shows that many neighborhoods were reset to a size of `r k_min` or to a size of `r k_max`. -```{r, knnhist, fig.cap = "Histogram of the original neighborhood sizes", fig.cap.style = "Image Caption", fig.align = "center", fig.width = 8, fig.height = 5} +```{r, knnhist, fig.cap = "Histogram of the original neighborhood sizes", fig.cap.style = "Image Caption", fig.align = "center", fig.width = 5, fig.height = 4, fig.retina = 0.80} hist(dnn_pc$k_diss_info$final_n_k, breaks = k_min, xlab = "Final neighborhood size", main = "", col = "#EFBF47CC") ``` -## Spiking the neighborhoods -In the package, spiking refers to forcing specific observations to be included +## _Spiking_ the neighborhoods +In the package, _spiking_ refers to forcing specific observations to be included in the neighborhoods. For example, if we are searching in the `training` set -the neighbors of the `tetsing` set, and if we want to force certain observations +the neighbors of the `testing` set, and if we want to force certain observations in `training` to be included in the neighborhood of each observation in `tetsing`, we can use the `spike` argument in `search_neighbors()`. For that, in this argument we will need to pass the indices of `training` that we will @@ -1297,7 +1327,7 @@ The previous code shows that the indices specified in `forced_guests` are always selected as part of every neighborhood. Spiking might be useful when there is a prior knowledge of the similarity -between certain obsevations that cannot be easily pick up by the data. +between certain observations that cannot be easily pick up by the data. # Regression @@ -1307,24 +1337,24 @@ methods designed to deal with complex spectral datasets [@ramirez2013spectrum]. In MBL, instead of deriving a general or global regression function, a specific regression model is built for each observation requiring a prediction of a response. Each model is fitted using the nearest neighbors of the target -observation found in a calibration or reference set. While a global -function may be very complex, MBL can describe the target function as a +observation found in a calibration or reference set [\@ref(fig:mblgif)]. While a +global function may be very complex, MBL can describe the target function as a collection of less complex local (or locally stable) approximations -[mitchell1997machine]. For example, for predicting the response variable $Y$ of +[@mitchell1997machine]. For example, for predicting the response variable $Y$ of a set of $m$ observations from their explanatory variables $X$, a set of $m$ functions are required to be fitted. This can be described as: $$\hat{y}_i = \hat{f}_i(x_i;\theta_i) + \varepsilon_{i} \; \forall \; i = \{1, ..., m\}$$ -where $\varepsilon_{i}$ represents a set particular of parameters required to -fit $\hat{f}_i$ (e.g. number of factors in a pls model), and $\varepsilon_{i}$ -the particular uncertainty for the $i$th neighborhood. Therefore, MBL in the -above example can be described broadly as: +where $\theta_{i}$ represents a set of particular parameters required to +fit $\hat{f}_i$ (e.g. number of factors in a PLS model), and $\varepsilon_{i}$ +is a residual error term from the fitting of $\hat{y}_{i}$ in the $i$th +neighborhood. Therefore, MBL in the above example can be described broadly as: $$\hat{f} = \{\hat{f}_1,...,\hat{f}_m\}$$ Figure \@ref(fig:mblgif) illustrates the basic steps in MBL for a set of five observations ($m = 5$). -```{r mblgif, fig.cap = "Example of the main steps in memory-based learning for predicting a response variable in five different observations based on set of p-dimesnional variables", echo = FALSE, out.width = '65%', fig.align = 'center'} +```{r mblgif, fig.cap = "Example of the main steps in memory-based learning for predicting a response variable in five different observations based on set of p-dimensional space.", echo = FALSE, out.width = '65%', fig.align = 'center', fig.retina = 0.85} knitr::include_graphics("MBL.gif") ``` @@ -1334,37 +1364,37 @@ must be defined for any MBL algorithm: 1. ___A dissimilarity metric___: It is required for neighbor search. The dissimilarity metric used must be capable also to reflect the dissimilarity in terms of the response variable for which models are to be built. For example, - in (Near-Infrared) spectral dissisimilarity of soil samples must be capable of - reflecting the compositional dissisimilarity between them. Dissimilarity - methods that poorly reflect this general sample dissimilarity are prone to - lead to poor predictive performance. + in soil NIR spectroscopy, the spectral dissisimilarity values of soil samples + must be capable of reflecting the compositional dissisimilarity between them. + Dissimilarity methods that poorly reflect this general sample dissimilarity + are prone to lead to MBL models with poor predictive performance. 2. ___How many neighbors to look at?___: It is important to optimize the neighborhood size to be used for fitting the local models. Neighborhoods which are too small might be too sensitive to noise and outliers affecting the - robustness of the models[@ramirez2013spectrum]. Small neighborhoods might + robustness of the models [@ramirez2013spectrum]. Small neighborhoods might also lack of enough variance to properly capture the relationships between the predictors and the response. On the other hand, large size neighborhoods might introduce complex non-linear relationships between predictors and - response which migth decrease the accuracy of the models. + response which might decrease the accuracy of the models. - 3. ___How to use the dissimilarity information___: The dissimilarity information + 3. ___How to use the dissimilarity information?___: The dissimilarity information can be: - - Ignored, this means is only used to retrieve neighbors [e.g. the LOCAL + - _Ignored_, this means is only used to retrieve neighbors [e.g. the LOCAL algorithm, @shenk1997investigation]. - - Used to weight the training observations according to their + - _Used to weight the training observations_ according to their dissimilarity to the target observation [e.g. as in locally weighted PLS regression, @naes1990locally]. - - Used as source of additional predictors [@ramirez2013spectrum]. In this + - _Used as source of additional predictors_ [@ramirez2013spectrum]. In this case, the pairwise dissimilarity matrix between all the $k$ neighbors is also retrieved. This matrix of $k \times k$ dimensions is combined with the $p$ predictor variables resulting in a final matrix of predictors (for the neighborhood) of $k \times (k+p)$ dimensions. To predict the - target observation, the preditors used are the $p$ spectral variables in + target observation, the predictors used are the $p$ spectral variables in combination to the vector of distances between the target observation and its neighbors. In some cases, this approach might lead to an increase on the predictive performance of the local models. This combined matrix of @@ -1380,9 +1410,9 @@ d_{k,1} & d_{k,2} & ... & 0_{k,k} & x_{k,k+1} & x_{k,k+2} & ...& x_{k,k+p} where $d_{i,j}$ represents the dissimilarity score between the $i$th neighbor and the $j$th neighbor. - 4. ___How to fit with the local points?___: This is given by the regression + 4. ___How to fit the local points?___: This is given by the regression method used which is usually a linear one, as the relationships between - the explanatory variables and the response are usually assumed as linear + the explanatory variables and the response are usually assumed linear within the neighborhood. In the literature MBL is sometimes referred to as local modeling, nevertheless @@ -1425,7 +1455,7 @@ of the MBL methods (which are described few paragraphs above): is passed, the dissimilarity information is used as source of additional predictors. - 4. _How to fit with the local points?_: This is controlled by the `method` + 4. _How to fit the local points?_: This is controlled by the `method` argument. For this, a `local_fit` object (which carries the information of the regression method and its parameters) is used. There are three methods available: partial least squares (PLS) regression, weighted average partial @@ -1472,7 +1502,7 @@ validation and some optimization aspects of the `mbl()` function. In to be selected from the subset of nearest neighbors. The global error of the predictions is computed as the average of the local root mean square errors. -Let's see some examples on who to build objects for controlling the validation +Let's see some examples on how to build objects for controlling the validation in `mbl`. ```{r, eval = TRUE} @@ -1526,7 +1556,7 @@ local_ciso <- mbl( Now let's explore the `local_ciso` object: -```{r localresultsciso, fig.cap = "MBL results for Total Carbon predictions using the LOCAL algorithm. NNv: nearest-neighbor cross-validation.", fig.align = "center", fig.width = 8, fig.height = 4.5} +```{r localresultsciso, fig.cap = "MBL results for Total Carbon predictions using the LOCAL algorithm. NNv: nearest-neighbor cross-validation.", fig.align = "center", fig.width = 7.5, fig.height = 4, fig.retina = 0.85} plot(local_ciso, main = "") local_ciso ``` @@ -1551,7 +1581,9 @@ local_ciso$results # the get_predictions function makes easier to retrieve the # predictions from the previous object ciso_hat <- as.matrix(get_predictions(local_ciso))[, bki] +``` +```{r, eval = F} # Plot predicted vs reference plot(ciso_hat, testing$Ciso, xlim = c(0, 14), @@ -1576,7 +1608,7 @@ cor(ciso_hat, testing$Ciso, use = "complete.obs")^2 Similar results are obtained when the optimization of the neighbrhoods is based on distance thresholds: ```{r, results = 'hide', eval = TRUE, fig.show = 'hide'} -# create a vector of diatnce threhsolds to evaluate +# create a vector of dissimilarity thresholds to evaluate # since the correlation dissimilarity will be used # these thresholds need to be > 0 and <= 1 dths <- seq(0.025, 0.3, by = 0.025) @@ -1597,7 +1629,9 @@ local_ciso_diss <- mbl( control = nnv_val_control, scale = TRUE ) +``` +```{r, eval = FALSE} plot(local_ciso_diss) ``` @@ -1621,7 +1655,9 @@ bd <- local_ciso_diss$validation_results$nearest_neighbor_validation$k[bdi] # predictions for the best distance ciso_diss_hat <- as.matrix(get_predictions(local_ciso_diss))[, bdi] +``` +```{r, eval = FALSE} # Plot predicted vs reference plot(ciso_diss_hat, testing$Ciso, xlim = c(0, 14), @@ -1637,7 +1673,7 @@ abline(0, 1, col = "red") Here we provide few additional examples of some MBL configurations where we make use of another response variable available in the dataset: soil cation exchange capacity (CEC). This variable is perhaps more challenging to predict in -comparison to Total Carbon. Table \ref(tab:addexamples) provides a summary of +comparison to Total Carbon. Table \@ref(tab:addexamples) provides a summary of the configurations tested in the following code examples. ```{r addexamples, eval = TRUE, echo = FALSE} @@ -1670,7 +1706,7 @@ colnames(my_tab) <- c("Abreviation", knitr::kable(my_tab, caption = paste("Basic description of the different MBL", "configurations in the examples to predict", - "Cation Exhange Capacity (CEC)"), + "Cation Exhange Capacity (CEC)."), format = "simple", align = "l", padding = 2) ``` @@ -1776,11 +1812,12 @@ cor(testing$CEC, preds, use = "complete.obs")^2 colMeans((preds - testing$CEC)^2, na.rm = TRUE)^0.5 ``` -The scatter plots in \ref(fig:mblcomparisons) ilustrate the prediction results +The scatter plots in \@ref(fig:mblcomparisons) ilustrate the prediction results obatined for CEC with each of the MBL configurations tested. ```{r, eval = FALSE, fig.show = 'hide'} -op <- par() +old_par <- par("mfrow", "mar") + par(mfrow = c(2, 2)) plot(testing$CEC, preds[, 2], xlab = "Predicted CEC, meq/100g", @@ -1796,11 +1833,12 @@ plot(testing$CEC, preds[, 4], xlab = "Predicted CEC, meq/100g", ylab = "CEC, meq/100g", main = colnames(preds)[4]) abline(0, 1, col = "red") -par(op) +par(old_par) ``` -```{r mblcomparisons, fig.cap = "CEC prediction results for the different MBL configurations tested" , fig.cap.style = "Image Caption", fig.align = "center", fig.width = 8, fig.height = 8, echo = FALSE} -op <- par() +```{r mblcomparisons, fig.cap = "CEC prediction results for the different MBL configurations tested" , fig.cap.style = "Image Caption", fig.align = "center", fig.width = 8, fig.height = 8, echo = FALSE, fig.retina = 0.85} +old_par <- par("mfrow", "mar") + par(mfrow = c(2, 2), pch = 16, mar = c(4,4,4,4)) my_cols <- c("#D42B08CC", "#750E3380", @@ -1840,7 +1878,32 @@ p <- sapply(colnames(preds), cols = my_cols, rsq = r2s, rmse = rmses) +par(old_par) +``` + +## Using `Yu` argument +If information of the response values in the prediction set is available, then, +the `Yu` argument can be used to directly validate the predictions done by +`mbl()`. It is not taken into accound for any optimization or modeling step. +It can be used as follows: + +```{r, eval = FALSE} +# use Yu argument to validate the predictions +pc_pred_nt_yu <- mbl( + Xr = training$spc_p[!is.na(training$Nt),], + Yr = training$Nt[!is.na(training$Nt)], + Xu = testing$spc_p, + Yu = testing$Nt, + k = seq(40, 100, by = 10), + diss_usage = "none", + control = mbl_control(validation_type = "NNv"), + scale = TRUE +) + +pc_pred_nt_yu ``` + + ## Supported parallelism The `mbl()` function uses the `foreach()` function of the [package `foreach`](https://CRAN.R-project.org/package=foreach) for iterating over every row/observation passed to the argument `Xu`. In the following example, @@ -1850,8 +1913,12 @@ following example we use parallel processing to predict Total Nitrogen: ```{r, eval = FALSE} # Running the mbl function using multiple cores -n_cores <- parallel::detectCores() - 1 -if (n_cores == 0) { + +# Execute with two cores, if available, ... +n_cores <- 2 + +# ... if not then go with 1 core +if (parallel::detectCores() < 2) { n_cores <- 1 } @@ -1883,27 +1950,6 @@ try(stopCluster(clust)) pc_pred_nt ``` -## Using `Yu` argument -If information of the response values in the prediction set is available, then, -the `Yu` argument can be used to directly validate the predictions done by -`mbl()`. It is not taken into accound for any optimization or modeling step. -It can be used as follows: - -```{r, eval = FALSE} -# use Yu argument to validate the predictions -pc_pred_nt_yu <- mbl( - Xr = training$spc_p[!is.na(training$Nt),], - Yr = training$Nt[!is.na(training$Nt)], - Xu = testing$spc_p, - Yu = testing$Nt, - k = seq(40, 100, by = 10), - diss_usage = "none", - control = mbl_control(validation_type = "NNv"), - scale = TRUE -) - -pc_pred_nt_yu -``` # References