- Changed default feature selection method in
custom_preprocessing()
, andpreprocessing_feature_selection()
toBORUTA
, as it is the most effective one, - Changed default imputation method for
preprocessing()
insidetrain()
toknn
, as it is the most effective one, - Fixed an issue for VS plots, where the color was not assigned properly for the models,
- Updated
.Rbuildignore
,DESCRIPTION
, andNAMESPACE
. - Added
cran-comments.md
file for future CRAN submission. - Fixed use of the deprecated size parameter in plotting functions, and replaced it with linewidth.
- Fixed issues, which occured while creating plots for outputs with less than 10 models.
- Enhanced the verbose mode, by adding more information about the current stage of the training process, added execution times for different training methods, and improved the overall readability of the messages.
- Introduced parallel training for random search, and bayesian optimization, which significantly speeds up the training process. It works for
ranger
, andxgboost
only, due to the limitations of other packages. It results in an addition of theparallel
parameter intrain()
,random_search()
, andtrain_models_bayesopt()
functions. - Intorduced more verbosity options for bayesian optimization process by introudcing the
bayes_info
parameter intrain()
, andtrain_models_bayesopt()
functions. The user can now track the quality of found hyperparameters. - Introduced the
select_models()
function, which lets the user shrink the output oftrain()
function to the selected models only. It is useful when the user wants to focus on the best models only, and limit the size of the output, while maintaining the ability to generate reports, and other functionalities. - Fixed an issue where for different tasks we had incosistent parameters in the output of the
train()
function, regarding the predictions of all three sets. - Modified the tests, so they fit with the current package version, and added tests for
select_models()
function. - Added
check_correlation
parameter for bothtrain()
andcheck_Data()
functions, which lets the user decide whether to check the correlation between the features or not. In some corner cases, with highly dimensional data, it can be time-consuming, and not necessary.
- Updated
.Rbuildignore
and.gitignore
. - Updated package
DESCRIPTION
, andNAMESPACE
. - Removed:
choose_best_models.R
,forester_palette.R
,format_model_details.R
- Updated
check_data()
:- Added more checks of input parameters,
- Modified
check_y_balance()
sub-function for multiclass classification.
- Updated
custom_preprocessing()
:- Added parameter type,
- Added more checks of input parameters,
- Added more advanced verbose options.
- Added
multiclass classification
task, with all extensional functionalities included (report, explainability, plots, etc), which led to the abundance of changes in majority of functions. - Added more input checks for
train_test_balance()
. - Updated documentation.
- Added manual test for multiclass classification task.
- Moved old tests into
misc/old_tests_03_02_2023
folder. - Created a new, and more in depth tests for the package from scratch for regression, binary classification, and multiclass classification tasks.
- In the DESCRIPTION updated the description, and RoxygenNote version.
- In README updated guide for catboost installation.
- Moved
verbose_cat()
, andguess_type()
, from their own files tocheck_data()
. - Removed
choose_best_models()
function, and implemented its logic directly intrain()
. - Removed the unused
create_ranked_list()
, andformat_models_details()
functions. - Fixed
plot_classification()
function, associated with the error cannot xtfrm data frames occurred. - Suppressed meaningless warnings from
prepare_data()
, andpreprocessing_removal()
functions. - Fixed error cannot xtfrm data frames in both reports.
- Added folder
misc/manual_tests
with supplementary, manual tests of the package, focusing on report generation. - Removed
test-choose_best_model
test, and fixed two other tests.
- In the DESCRIPTION:
- Added in Imports the
patchwork
package, and movedarules
from Imports to Suggests, - Updated the authors of the package.
- Added in Imports the
- In
report()
function:- Added separate generation of binary classification and regression reports,
- Enhanced the quality of plots in the report,
- Fixed the bugs leading to lower quality of the report,
- Added detailed description of how to interpret the visualizations for the beginners.
- Created
plot_classifcation
file, which overridesplot()
function for the binary classification object returned bytrain()
. The function lets us create plots for binary classification tasks, such as metrics comparison line plot, ROC curve, confusion matrix, and train vs test plot. - Created
plot_regression
file, which overridesplot()
function for the regression object returned bytrain()
. The function lets us create plots for regression tasks, such as residuals box plot, observed vs prediction plot, and train vs test plot. - Removed
report.Rmd
file, and created two separate files for different tasks calledreport_binary.Rmd
, andreport_regression.Rmd
. - In
train()
function: added different classes for the outcomes, depending on the task type. - In
plot_metrics()
, changed color palette tocolors_discrete_forester
. - In
check_data()
, added a small fix for the report.
- In the DESCRIPTION: Added in Suggests
SurvMetrics
,randomForestSRC
, andsurvival
packages. - In
train()
function:- Added parameters
time
, andstatus
describing the survival analysis task, - Added parameters
time
, andstatus
to the functions output, - Modified inner methods, to work with survival analysis task.
- Added parameters
- In
check data()
function:- Added parameters
time
, andstatus
describing the survival analysis task, - Modified the methods:
basic_info()
,check_missing()
,check_cor()
, andcheck_y_balance()
, so they also work for the survival analysis.
- Added parameters
- In
preprocessing()
function, added the binarization of survival task target (status). - In
prepare_data()
function, added a method for the survival analysis task. - In
train_models()
function, added a method for training a survival analysis model fromrandomForestSRC
package. - In
random_search()
function, added a method for tuning a survival analysis model. - In
train_models_bayesopt()
function, added a method for tuning a survival analysis model. - In
predict_models_all()
,predict_models()
, andpredict_new
functions, added a method for predicting a survival analysis model. - In
score_models()
, added a method for evaluation of survival analysis models withBrier Score
, andConcordance Index (CIN)
. - In
guess_type()
, added method of detecting the survival analysis task. - Modified
tests
of the package, to cover the changes made in the package.
- In the DESCRIPTION: Added in Imports the
VIM
package, and in the Suggestssivs
,parallel
,rmcfs
, andvarrank
packages. - In
check_data()
function:- removed 'no', and added 'index' to id_names used for id-like columns detection,
- Fixed minor printing and code formatting issues.
- Added a
custom_preprocessing()
function, which is more advanced and customizable approach for the preprocessing pipeline. It executes other new functions implementing three major pillars of preprocessing. The functions arepreprocessing_removal()
,preprocessing_imputation()
,preprocessing_feature_selection()
:preprocessing_removal()
- This function includes 6 modules for the removal of unwanted features / observations. We can remove duplicate columns, the ID-like columns, static columns (with specified staticity threshold), sparse columns (with specified sparsity threshold), and highly correlated ones (with specified high correlation threshold). Additionally we can remove the observations that are too sparse (sparsity threshold), and have missing target value. One can turn on and off each module by setting proper logical values.preprocessing_imputation()
- Imputes missing values according to one of four prepared methods:median-other
- The numeric features are imputed with median value, whereas the categorical ones with the 'other' string,median-frequency
- The numeric features are imputed with median value, whereas the categorical ones with the most frequent value,knn
- All features are imputed with KNN algorithm,mice
- All features are imputed with MICE algorithm.
preprocessing_feature_selection()
- Conducts a feature selection process with one out of four proposed methods:VI
- The variable importance method based on random forest,MCFS
- The Monte Carlo Feature Selection,MI
- The Varrank method based on mutual information scores,BORUTA
- The BORUTA algorithm - short time.
- Added tests for the
custom_preprocessing()
,preprocessing_removal()
,preprocessing_imputation()
, andpreprocessing_feature_selection()
functions. - In the
train()
function:- Removed
advanced_preprocessing
parameter, ascustom_preprocessing()
is more advanced version of it, - Added
custom_preprocessing
parameter, which takes the output of acustom_preprocessing()
function, - Added
deleted_rows
value to the output, - Changed output parameter from
columns
todeleted_columns
.
- Removed
- In the
preprocessing()
function:- Removed
advanced
parameter, ascustom_preprocessing()
is more advanced version of it, - Changed output parameter from
columns
torm_columns
. - Removed the example from this function.
- Removed
- In the
save()
function:- Renamed to
save_forest()
, - Merged parameters
name
andpath
intofile
as it is standard approach - Added an example of loading saved data to documentation.
- Renamed to
- In the
prepare_data()
function added encoding for thectree
model so it can use columns with more than 30 levels. - In the
random_search()
function added parameterverbose
. When set to TRUE the function provides information about the training. - Removed plenty of examples from functions that are not meant for the user.
- Fixed plenty of formatting issues in various functions.
- In the
train()
function:- Added the
split_seed
parameter which enables the user to set the seed for the train-test split method, - Added the
train_inds
,test_inds
, andvalid_inds
vectors to the train output, which enable the user to recover the information which observation went to the train, test, and validation sets.
- Added the
- In the
train_models_bayesopt()
function:- Removed unwanted seeds before starting the Bayesian optimization.
- In the
check_data()
function:- Fixed typos.
- Updated the documentation or examples in:
choose_best_model()
,create_ranked_list()
,format_models_details()
,draw_scatter_plot()
,predict_models()
,predict_models_all()
,save()
,train_models()
,train_test_balance()
.
- In the
explain()
function:- Updated the documentation and the examples,
- Fixed the handling of the factors.
- In the
draw_radar_plot()
function:- Updated the documentation,
- Deleted unnecessary prints.
- In the
prepare_data()
function:- Updated the documentation,
- Fixed the process of creating an 'other' label during the preparation of train / test splits.
- In the
preprocessing()
function:- Updated the documentation,
- Removed unused parameter type,
- Added a factorization of the target for binary classification task.
- In the
random_search()
function:- Updated the documentation,
- Fixed the method for the ranger model,
- Unified the grid with the Bayesian Optimization grid.
- In the
score_models()
function:- Updated the documentation,
- Added RMSE as a main metric for the regression task,
- Added accuracy as a main metric for the binary classification task,
- Fixed the scoring method for the regression.
- In the
train()
function:- Updated the documentation,
- Enabled providing the tibble as a data parameter (with warning),
- Added a warning if the user tries to use incorrect task type,
- Added score_train and predictions train to the output of the function.
- In the
train_models_bayesopt()
function:- Updated the documentation,
- Fixed the Bayesian Optimization. Before hand it didn't work at all for models other than ranger,
- Changed tuned hyperparameters and their values for all models, which increased the performance of the method.
- Updated the majority of
tests
so they match the current version. - Added alternative catboost installation in the
README
.
- In the
report()
function:- Removed engine and tuning columns from the ranked list table,
- Explained how the model name is created, so the columns engine and tuning are not needed,
- Fixed the radar plot issues,
- In the
draw_radar_plot()
function:- Fixed radar plots, so the models are named correctly (by model names, not engines),
- Fixed issue where the plot for binary classification had an empty observation name,
- Removed unnecessary and invalid Metric names (the ones that are not used and engine, name, tuning).
- In the
check_data()
function:- Added a vector of possible outliers to the output,
- The outliers indexes are printed if there are less than 50 of them.
- In the
train()
function:- Added a vector of possible outliers to the output.
- In the
train()
function:- Added objects to the output of the function:
- Vector: valid_observed.
- Vectors: train_observed_labels, test_observed_labels, valid_observed_labels, predictions_all_labels predictions_best_best as labels for binary classification task.
- Removed catboost from default engines.
- Added check if catboost is installed.
- Removed parameters: loss, validation, tuning, keep, because they weren't used.
- Added train_test_split parameter that determines proportions of train, test and validation subsets.
- Conditioned the message about columns deleted during the advanced preprocessing.
- Added objects to the output of the function:
- In the
check_data()
function:- Fixed an issue of too much unique values for the Crammer V's correlation. If the number of combinations exceeds 2^29, the correlation is not calculated and the user is informed about it (it is caused by R limitations).
- Fixed an issue with omitting strong correlations that are negative.
- Added ticks and crosses that indicate whether something is of the high or low quality.
- Added the information which class is the dominating one in the unbalanced classification data set.
- In the
preprocessing()
function:- Added a factorization of the target at the beginning of the
binarize_target()
sub function. - Resolved issues with data frames where columns are not saved as factors.
- Added a factorization of the target at the beginning of the
- The
report()
function:- If tinytex is not installed raises an error.
- If ggradar is not installed, there will be no error, no plot, and a note about installing the ggradar package.
- Removed engine parameter from
predict_models_all()
. - Removed misspelled function
predcit_models_all()
. - The
train_models()
function no longer returns NULL objects if engine is not selected. - The
train_models_bayesopt()
function no longer returns NULL objects if engine is not selected. - For the
format_models_details()
function fixed a bug with the method not working for a classification task for an xgboost model. - In the
DESCRIPTION
catboost, ggradar and tinytex dependencies moved from Suggests to Imports and added crayon to Import. - Added installation guides for catboost, ggradar and tinytex in the
README.md
, which solves installation issues on macOS.
- Cleared typos and inconsistent notation in the
train()
function. - Added info that the Bayesian Optimization takes a lot of time.
- Fixed 'The best model details' section in the report to be seen.
- Added info about deleted columns during
preprocessing()
. - Added more details of the
train()
function output. - Added parameter 'seed' to keep the data division comparable.
- Cleared structure of the
predict_new()
function. - Kept numeric values of model$raw_data$lightgbm_data as numeric, and not string values.
- Used different parameters for the xgboost model to remove the warnings.
- Changed [1,2] target values to [0,1] for the regression task.
- Added info about the type of correlation used between columns.
- In the
save()
the file name has a right month. - The
train()
function returns table with metrics on validation subset. - The
score()
function returns tables with additional columns:engine
andtuning
. - Added a new vignette 'Knowledge Check' with extended use cases.
- Setting
bayes_iter = 0
causes thatBayestian_opt()
is not ran anymore. - Changed the parameter name from
random_iter
torandom_evals
. - Setting
verbose = FALSE
disables thecheck_data
entirely.
- Changes in a
train()
function by adding parameters to:- Choose metrics.
- Choose metric to sort model.
- Add self-created metric function. # forester 1.0.0
- General code quality and readability improvements as well as improving the quality of the documentation.
check_data()
added detection of id columns and reformatted the outputs.create_ranked_list()
able to work with missing values.- Improved data sets documentation.
explain()
function redesigned to work on single and multiple models.plot_metrics()
improved all plots and changed them intoggplot
visualizations, added afeature importance
plot.- Renamed
predcit_models_all()
intopredict_models_all()
and enabled prediction on non-fixed, larger amount of models. - Added
predict_new()
function for new observations. - Expanded
preprocessing()
function by optional, advanced preprocessing consisting of deleting correlated values, deleting id columns, selecting only the most important features via the BORUTA algorithm. - Redesigned a file generated by the 'report()` function, by enhancing the quality of the plots.
- Added
save()
function which saves the output of thetrain()
function. - Redesigned a
score_models()
function, so that the user can add their own metric function for scoring models. - Redesigned a
train()
function by:- Adding a verbose parameter for silent runs.
- Adding parameters for limiting the iterations of random search and Bayesian optimization.
- Turning on the advanced preprocessing.
- Increased the output information by adding engine, predictions of all models, predictions of the best models and raw_training data set (unprocessed by
prepare_data()
).
- Added a
verbose_cat()
function for optional messages.