naiveautoml
is a tool to find optimal machine learning pipelines for
- classification tasks (binary, multi-class, or multi-label) and
- regression tasks.
Other than most AutoML tools, naiveautoml
has no (also no implicit) definitions of timeouts. While timeouts can optionally provided, naiveautoml
will simply stop as soon as it believes that no better pipeline can be found; this can be surprisingly quick.
Install via pip install naiveautoml.
The current version is 0.1.5.
We highly recommend to check out the usage example python notebook.
Finding an optimal model for your data is then as easy as running:
import naiveautoml
import sklearn.datasets
naml = naiveautoml.NaiveAutoML()
X, y = sklearn.datasets.load_iris(return_X_y=True)
naml.fit(X, y)
print(naml.chosen_model)
The task type (here classification) is derived automatically, but it can also be specified via task_type
with values in classification
, regression
or multilabel-indicator
to be sure.
To get the history of considered pipelines, together with a (relative) timestamp and internal validation scores, you can access the history:
print(naml.history)
Want to limit the number of candidates considered during hyper-parameter tuning?
naml = naiveautoml.NaiveAutoML(max_hpo_iterations=20)
Want to put a timeout? Specify it in seconds (should be always bigger than 10s to avoid strange side effects).
naml = naiveautoml.NaiveAutoML(timeout_overall=20)
You can modify the pipeline timeout on single pipeline evaluations with
naml = naiveautoml.NaiveAutoML(timeout_candidate=20)
However, be aware that on many pipelines this time out is not enforced since this not safely possible without memory leakage or malfunction.
This can also be combined with max_hpo_iterations
.
Want to see the progress bar for the optimization process?
naml = naiveautoml.NaiveAutoML(show_progress=True)
Want logging?
# configure logger
import logging
logger = logging.getLogger('naiveautoml')
logger.setLevel(logging.INFO)
ch = logging.StreamHandler()
ch.setLevel(logging.INFO)
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
ch.setFormatter(formatter)
logger.addHandler(ch)
By default, log-loss is used for classification (AUROC in the case of binary classification). To use a custom scoring function, pass it in the constructor:
naml = naiveautoml.NaiveAutoML(scoring="accuracy")
To additionally evaluate other scoring functions (not used to rank candidates), you can use a list of passive_scorings
:
naml = naiveautoml.NaiveAutoML(scoring="accuracy", passive_scorings=["neg_log_loss", "f1_score"])
You can also pass a custom scoring function through a tuple:
scorer = make_scorer(**{
"score_func": lambda y, y_pred: (y == y_pred).mean(),
"greater_is_better": True,
"response_method": "predict"
})
naml = naiveautoml.NaiveAutoML(scoring=("custom_accuracy", scorer))
another way is to directly pass the scoring function, and, in that case, the function name is inferred and used as a field:
def custom_accuracy(y, y_pred):
return (y == y_pred).mean()
scorer = make_scorer(**{
"score_func": custom_accuracy,
"greater_is_better": True,
"response_method": "predict"
})
naml = naiveautoml.NaiveAutoML(scoring=scorer)
Naive AutoML determines the categorical attributes automatically as far as possible.
However, sometimes even columns consisting only of numbers should be treated as categorical attributes.
To pass an explicit list of attributes that should be treated as categoricals, use the categorical_features
parameter in the fit
function:
naml.fit(X_df, y, categorical_features=["name_of_first_categorical_attribute", "name_of_second_categorical_attribute"])
alternatively (or if your data is a numpy array), you can use the index of the column:
naml.fit(X_df, y, categorical_features=[4, 9])
Please use the reference from the Machine Learning Journal to cite Naive AutoML:
https://link.springer.com/article/10.1007/s10994-022-06200-0#article-info
@article{mohr2022naive,
title={{Naive Automated Machine Learning}},
author={Mohr, Felix and Wever, Marcel},
journal={Machine Learning},
pages={1131--1170},
year={2022},
publisher={Springer},
volume={112},
issue={4}
}