AutoML Pipeline for Data Streams (AML4S)
This repository contains a loan data stream generator and a fully automated online machine learning method for data streams. It also contains visualization tools, experiments and examples of the method.
- To install AML4S from GitHub use:
git clone https://github.com/AuthEceSoftEng/automl-data-streams.git
- To create a loan dataset, use the
create_loandataset
function. - To convert a dataset from arff to csv use the
convert_arff_to_csv
function. - To prepare a dataset for the pipeline (if it’s not a list of dictionaries) from a CSV or from real datasets of River, use the
prepare_data
function. - To create and use an instance of AML4S, use the [
AML4S
] class.- Create an instance of AML4S with
__init__
. - Create a small training data set.
- Train AML4S for the first time with
init_train
. - Predict using AML4S with
predict_one
. - Train AML4S with a new instance with
learn_one
.
- Create an instance of AML4S with
- To evaluate the created pipelines (one or more), use the
evaluation
function. - To create plots for the evaluations, use the
create_plots
function. - To create plots of dataset features, use the
data_plot
function. - To create interactive diagrams from saved files of the experiments with online methods run file plots_online_exp.py.
- To create interactive diagrams from saved files of the experiments with OAML run file plots_oaml_exp.py.
- To create comparison diagrams from saved files of the experiments with online methods run file comparison_with_online.py.
- To create comparison diagrams from saved files of the experiments with OAML run file comparison_with_online.py.
A good example of how to use the AML4S is included in the AML4S_Usage
file.
Some good examples of how to use all the functions are included in the Exeperiments
directory.
- File:
AML4S_class.py
- Description: Contains the functions and the parameters of the AML4S object.
- Description: Creates the object AML4S (constructor).
AML4S(target, data_drift_detector, consept_drift_detector)
target
(str): The target variable for the model to predict.data_drift_detector
(boolean): True if there is data drift detector, else False.consept_drift_detector
(boolean): True if there is concept drift detector, else False.seed
(int | None): Random seed for reproducibility
- Description: Trains the pipeline for the first time with a provided dataset.
init_train(self, init_train_data)
init_train_data
(list[dict]): List of dictionaries with the training data.
- Description: Predicts the target variable given the features.
predict_one(self, x)
x
(dict): Sample of data with the features.
y
(int): Predicted target values.
- Description: Training sample by sample of the pipeline
learn_one(self, x, y)
x
(dict): Sample of data with the features.y
(int): Predicted target values.
- File:
AML4S_Usage.py
- Description: Executes the AutoML pipeline on the provided dataset, including data drift and concept drift detection.
use_AML4S(data, target, data_drift_detector, consept_drift_detector)
data
(list): The dataset to be processed by the pipeline.target
(str): The target variable for the model to predict.data_drift_detector
(boolean): True if there is data drift detector, else False.consept_drift_detector
(boolean): True if there is concept drift detector, else False.
y_real
(list): Real target values.y_pred
(list): Predicted target values.pipeline.data_drifts
(list): Detected data drifts.pipeline.concept_drifts
(list): Detected concept drifts.
- File:
Find_best_pipeline.py
- Description: Finds the best-performing pipeline among various models and configurations, using data and concept drift detection methods.
find_best_pipeline(x_train, y_train, data_drift_detector_method, concept_drift_detector_method)
x_train
(list): Data with feature values for training.y_train
(list): Data with target values for training.data_drift_detector_method
(object): Method for detecting data drift.concept_drift_detector_method
(object): Method for detecting concept drift.
pipeline
(object): The selected best pipeline.accuracy
(object): The accuracy of the selected pipeline.data_drift_detectors
(object): Data drift detectors used in the selected pipeline.concept_drift_detector
(object): The concept drift detector used in the selected pipeline.
- File:
Change_pipeline.py
- Description: Trains and evaluates a new AutoML pipeline, selecting it if it performs better than the current one.
change_pipeline(pipeline_old, x_train, y_train, data_drift_detectors_old, concept_drift_detector_old, data_drift_detector_method, concept_drift_detector_method, buffer_accuracy)
pipeline_old
(object): The existing classifier pipeline.x_train
(list): Data with feature values for training.y_train
(list): Data with target values for training.data_drift_detectors_old
(object): The existing pipeline's data drift detectors.concept_drift_detector_old
(object): The existing pipeline's concept drift detector.data_drift_detector_method
(object): The method for detecting data drift.concept_drift_detector_method
(object): The method for detecting concept drift.buffer_accuracy
(object): The accuracy of the current model in the buffer.
pipeline
(object): The selected pipeline (either new or old).accuracy
(object): The accuracy of the selected pipeline.data_drift_detectors
(object): The data drift detectors used in the selected pipeline.concept_drift_detector
(object): The concept drift detector used in the selected pipeline.
- File:
Simple_pipeline_use.py
- Description: Constructs a simple machine learning pipeline using a model, an optional preprocessor, and an optional feature selector. It then trains and evaluates the pipeline on the provided dataset.
simple_pipeline(model, preprocessor, feature_selector, data, target)
model
(object): The machine learning model to be used in the pipeline.preprocessor
(object or None): An optional preprocessing object. IfNone
, no preprocessing is applied.feature_selector
(object or None): An optional feature selector object. IfNone
, no feature selection is applied.data
(list): The dataset to be used for training and prediction. Each element should be a dictionary of features.target
(str): The name of the target variable in the dataset.
y_real
(list): The actual target values from the dataset.y_pred
(list): The predicted target values from the pipeline.data_drifts
(list): A placeholder list, empty in this implementation.concept_drifts
(list): A placeholder list, empty in this implementation.
- File:
Convert_arff_to_csv
- Description: File converter from arff to csv.
convert_arff_to_csv('arff_name.arff', 'csv_name.csv')
arff_file
(string): path for arff file e.g. 'arff_name.arff'csv_name
(string): path for new csv file
- saved vsc file
- File:
Create_loandataset.py
- Description: Creates a loan dataset with specified drifts.
create_loandataset(class_num, datalimit, conceptdriftpoints, datadriftpoints, seed)
class_num
(2, 3, 4): Number of class in the output of the generatordatalist
(int): Number of data samples in the dataset (e.g., 30000).conceptdriftpoints
(list[dict]): Points of drifts with function names (e.g., [4000: "crisis", 10000: "normal"]).datadriftpoints
(list[dict]): Points of drifts with function names (e.g., [2000: "crisis", 8000: "normal"]).seed
(int): Seed for dataset reproducibility (e.g., 42).
data
(list[dict]): List of dictionaries containing the created dataset.
- File:
Prepare_data.py
- Description: Prepares the dataset for the pipeline.
prepare_data(dataset)
dataset
(str or River dataset): Path of a CSV file or a River dataset.
data
(list[dict]): List of dictionaries with the dataset's data.
- File:
Evaluation.py
- Description: Evaluates the pipelines created.
evaluation(y_real, y_predicted, metric_algorithm)
y_real
(list[list]): Real target values from each pipeline.y_predicted
(list[list]): Predicted target values from each pipeline.metric_algorithm
(object): Instance of the metric for evaluation.
results
(list[list]): Evaluation results for each pipeline.
- File:
Create_Plots.py
- Description: Creates plots for the evaluation metrics of each pipeline.
create_plots(evaluates, data_drifts, concept_drifts)
evaluates
(list[list]): Evaluation results from theevaluation
function.data_drifts
(list[list]): Data drift points from each pipeline.concept_drifts
(list[list]): Concept drift points from each pipeline.
- Plot of the metric we used in evaluation for all pipelines used.
- File:
Comparison_with_OAML_basic_plot.py
- Description: Creates plots in same figure to compare metric results of some methods with OAML-basic.
compare_with_oaml(results)
results
(list[list]): Evaluation results from theevaluation
function and OAML results with step 1000 and start 6000.
- Figure with the metric plot of every method.
- File:
Data_plot.py
- Description: Creates plots for dataset features.
data_plot(data, step)
data
(list[dict]): List of dictionaries containing the data.step
(int): Step of the visualization of the dataset.
- Plots of each feature in the dataset.
- File:
Accuracy_check.py
- Description: Compares accuracy against a mean accuracy to decide if a model retrain is needed.
accuracy_check(mean_accuracy, y_true_buffer, y_predicted_buffer)
mean_accuracy
(float): The mean accuracy to compare against.y_true_buffer
(list): True target values of the last samples.y_predicted_buffer
(list): Predicted target values of the last samples.
need_change
(boolean): Indicates if the accuracy difference exceeds a threshold.
- File:
Split_data.py
- Description: Splits the data into features and target.
split_data(dictionary, target_key)
dictionary
(dict): Dictionary containing features and target value.target_key
(string): Name of the target variable.
features
(dict): Features of the input sample.target
: Target value of the sample.
The generator can produce datasets with data and concept drifts at specified points.
- File for 2 class output:
loandataset_2_class.py
- File for 3 class output:
loandataset_3_class.py
- File for 4 class output:
loandataset_4_class.py
- Description: Loandataset generator
- crisis: Tighter limits.
- normal: Normal limits.
- growth: Looser limits.
- crisis: Smaller salaries.
- normal: Normal salaries.
- growth: Bigger salaries.
To create a loan dataset, use the create_loandataset
function.