Skip to content
/ meta_act Public

MetaRecommender Utilities for Entropy Threshold based Active Learning

License

Notifications You must be signed in to change notification settings

vinmh/meta_act

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MetaRecommender for Entropy Threshold based Active Learning

Coverage

Introduction

This library aims to provide utilities for use in meta-recommending entropy thresholds (z values) for Active Learning.

Right now only stream-based active learning is supported.

Installation

On *nix systems make install should suffice as long as Python 3.7 is installed. On windows some makefile interpreter is required such as Cygwin.

Usage

This library provides the following utilities:

Metadatabase generation

This library generates metadatabases to train a metarecommender of your choice. This works by either using pre-specified datasets or generating them on the fly, alternating between a different set of generators.

The generators are taken from scikit-multiflow, the available ones are:

  • HyperplaneGenerator
  • LEDGeneratorDrift
  • MIXEDGenerator
  • RandomRBFGeneratorDrift
  • RandomTreeGenerator
  • SineGenerator
  • STAGGERGenerator

The generator can be invoked like this: meta_act.ds_gen.generate_datasets("HyperplaneGenerator", "./datasets", max_samples=100000, **gen_kwargs)

This will generate one dataset with the HyperplaneGenerator with 10000 samples and save it at ./datasets/, the gen_kwargs are passed on to the generator constructor.

There is also an alternate generator that, instead of saving dataset files, it returns a python generator function that returns in memory datasets created from alternating generators with randomized hyperparameters. For example:

meta_act.ds_gen.dataset_generator([("HyperplaneGenerator",
                                   {"n_features": [5, 10, 15]}),
                                   ("LEDGeneratorDrift", {"noise_percentage": [0.0, 0.1, 0.5]}],
                                   max_samples=100000)

This will generate infinite datasets by alternating between the HyperplaneGenerator and the LEDGeneratorDrift and each time selecting a random value for the hyperparemeters, in the case of the _ HyperplaneGenerator_ the hyperparameter would be n_features, and in the case of LEDGeneratorDrift it would be _ noise_percentage_. Do note that as many hyperparameters as desired are possible, in fact it is encouraged to use as many as possible. For numeric hyperparameters, the possible values can be easily set through list(range(x,y,z)).

The actual metadatabase can be generated with the following function:

meta_act.metadb_craft.create_metadb(stream_files,
                                    z_vals_n=5,
                                    z_val_selection_margin=0.02,
                                    window_pre_train_sample_n=300,
                                    window_adwin_delta=0.0001,
                                    stop_conditions={"minority_target": 100},
                                    max_failures=100)

There are more arguments that can be passed, check the code for specifications. If stream_files is sent as a list of strings, it assumes a list of filepaths was sent and it will load them individually, if it detects a generator, it will take the datasets from it, this is intented to be used with the python generator above.

Stop conditions are conditions that will interrupt the metadatabase creation independently of the amount of stream files remaining, this needs to be set if the dataset_generator is being used with infinite datasets, otherwise it will enter an infinite loop until the computer runs out of memory.Possible stop conditions are the following:

  • max_datasets: Maximum number of used datasets;
  • max_samples: Maximum number of samples in the metadatabase;
  • minority_target: Maximum number of samples with the minority z value
  • majority_target: Maximum number of samples with the majority z value

Multiple stop conditions may be set as well.

The z_vals_n parameter specify how many z values each dataset will be evaluated against, the z values are generated according to the max entropy of the dataset (log(n_classes, base=z_vals_base)), for example if z_vals_n = 4 and the max entropy is 1.0, the z values being evaluated will be 0.2, 0.4, 0.6 and 0.8.

If the parameter output_path is not set, the metadatabase is returned in-memory from the function, otherwise it will save as a csv file in the specified path.

The parameter max_failures must be set to a reasonable large number preferably, it will determine the amount of datasets that can fail before aborting the metadatabase generation, specially important when using an infinite generator since in case of errors, stop conditions may never be achieved.

Online Window Extraction and MetaLearning

Window features may be extracted from a stream window with the function:

meta_act.windows.get_window_features(X, mfe_features, tsfel_config, summary_funcs, n_classes)

This can be facilitated with the use of the ActiveLearner class in meta_act.act_learner.ActiveLearner. It allows the use of the parameter store_history that when set to True, the learner will store all data used to train its model and when the function ActiveLearner.get_last_window(mfe_features, tsfel_config, summary_funcs, n_classes) is called, the learner will return the features of the entire stored history.

If the n_classes attribute is set to a number, two extra features will be added related to the classes in the stream, n_classes and max_possible_entropy. These two features are always added in the metadb creation function.

This will return a single line of features to be used with the metalearner, X is expected to be a numpy array.

With this, its possible to use the MetaLearner class from meta_act.metalearn.MetaLearner(learner, *learner_args, **learner_kwargs). If the learner parameter is set to a string, it will assume an already trained model is being attempted to be loaded, joblib is required, in this case the parameters learner_args and learner_kwargs are simply ignored. Otherwise the learner parameter is treated as a sklearn algorithm class and will attempt to initialize it with the args and kwargs sent.

meta_act.metalearn.MetaLearner.fit(X, y, oversample=True, scale=True, test_data=None) can be called to train the model, if oversample is set to True, it will attempt to oversample the training dataset with SMOTE. If the parameter scale is set to true, the X data is scaled with scikit-learn's StandardScaler. If the parameter test_data is set to a tuple, it is assumed it is composed of a two element tuple, the first being a test X array and the second being a test y array, and the results will include test metrics (R^2 on test data, MSE and MAE).

After the model is trained, it can be used to predict z values on a number of samples with meta_act.metalearn.MetaLearner.predict(X), recover metrics from test data with meta_act.metalearn.MetaLearner.test(X, y) and be saved on a file with meta_act.metalearn.MetaLearner.save_model(filepath). Saving the model two files are created, a metadata file containing various data about the training environment and the serialized model file to be loaded like before. The metadata file is only loaded if it is present in the same directory as the serialized model file.

About

MetaRecommender Utilities for Entropy Threshold based Active Learning

Resources

License

Stars

Watchers

Forks

Packages

No packages published