Nodes used to preprocess data within the AutoML framework
Data preprocessing is a necessary step in any robust machine-learning application. It is particularly important when it comes to automated pipelines where, by definition, most control is removed from the user.
.automl.X.node.function
Top-level preprocessing node functions
configuration
Pass run configuration into AutoML graph
featureData
Load feature data from process/alternative data source
targetData
Load target vector from process/alternative data source
dataCheck
Build configuration dictionary and check dataset is suitable
modelGeneration
Create table of models to apply
featureDescription
Information needed for report generation or running on new data
labelEncode
Encode symbol target data
dataPreprocessing
Preprocess data prior to application of ML algorithms
featureCreation
Generate appropriate features based on problem type
featureSignificance
Apply feature-significance tests and return significant features
trainTestSplit
Split features and target into training and testing sets
Given the automated nature of the pipeline, it is important to ensure only datatypes handled by the current feature-extraction procedures are passed through the workflow. If any columns within the user-defined dataset contain inappropriate types, they will be removed and this will be communicated to the user via the console.
The following lists the restricted types for each problem type. In each case, these types are not handled gracefully within the feature-extraction workflow and are omitted.
FRESH guid, byte, list, character, time/date
Normal guid, byte, list, character
NLP guid, byte, list
Further, given the requirement that feature data produced in the feature extraction process must have a 1-to-1 mapping to the input target vector, target consistency is checked prior to the application of feature extraction. The logic behind this check varies for each problem type and is specified below.
FRESH number of unique combinations of aggregate columns must equal number of targets
Normal number of rows in the input table must equal number of target values
NLP number of rows in the input table must equal number of target values
The models applied in an individual run of AutoML are selected based on the user-defined problem type, paired with additional information about the target data. The models available within the framework for each problem type are:
binary-classification multi-classification regression
-------------------------------------------------------------------------------------
AdaBoostClassifier AdaBoostClassifier AdaBoostRegressor
RandomForestClassifier RandomForestClassifier RandomForestRegressor
GradientBoostingClassifier GradientBoostingClassifier GradientBoostingRegressor
KNeighborsClassifier KNeighborsClassifier KNeighborsRegressor
MLPClassifier MLPClassifier MLPRegressor
Keras binary-classifier Keras multi-classifier Keras regressor
LogisticRegression LinearRegression
GaussianNB Lasso
SVC
LinearSVC
These models can be augmented by modifying models.json
in the folder automl/code/customization/models/modelConfig/
.
Keras architectures
The Keras models first introduced in V0.1 are basic single-layer neural networks.
A user can define more complex Keras architectures as desired for the use case in question if an appropriate architecture is known.
Within the pipeline, a symbol column is encoded according to the number of unique symbols it has:
- < 10: one-hot encoded
- > 10: frequency encoded
For FRESH, the above is performed on an aggregation-bucket basis for frequency encoding rather than for an entire column. This ensures encoding on new data is as fair as possible, with each aggregation bucket associated with an individual target.
Further, both null and infinite values are replaced within the feature data, due to the inability of both sklearn and keras machine-learning models to handle this form of data. The table below shows the replacement for each of these values. The aim is to limit changes to the data distribution as far as possible.
value symbol replacement
-------------------------------------------------
positive infinity 0w maximum column value
negative infinity -0w minimum column value
null 0n median column value
Where nulls are present an additional column is added to the feature table denoting the location of the null prior to filling of the dataset, thus encoding the null location in the case it is an important signal for prediction.
Feature extraction is the process of building derived or aggregate features from a dataset in order to provide the most useful inputs for machine-learning algorithms. Within the AutoML framework, there are three types of feature extraction available – normal, NLP and FRESH.
The early-stage releases of this repository limit the feature-extraction procedures that are performed by default on the tabular data for two reasons:
- Naïve application of many relevant feature-extraction procedures (truncated singular-value decomposition/bulk transforms) while potentially informative can expand memory usage and computation time beyond an acceptable level.
- Procedures applied in one field of use may not be relevant in another. The framework is designed for you to complete feature extractions which are domain-specific if required, rather than assuming procedures to be applied are ubiquitous in all cases.
Over time the system will come to perform tasks in a way cognizant of the above limitations, and where general frameworks can be assumed to be informative.
Normal feature extraction can be applied to non-timeseries problems that have a 1-to-1 mapping between features and targets. The current implementation of normal feature extraction splits time/date type columns into their component parts. No further feature extraction is applied by default.
The NLP (Natural Language Processing) feature extraction within AutoML makes use of the KX NLP library in addition to the Python gensim
library for data preprocessing.
The following steps are applied independently to all columns containing text data:
- Use
.nlp.findRegex
to retrieve information surrounding the occurrences of various expressions, e.g. references to URLs, money, the presence of phone numbers, etc. - Apply named entity recognition to detect references to products, individuals, art, etc.
- Apply sentiment analysis to the dataset to extract information about the positive/negative/neutral and compound nature of the text.
- Apply the function
.nlp.newParser
to extract stop words, tokens and any references to numbers. Using this data, calculate the percentages of a sentence that are numeric values, stop words or particular parts of speech. - Using the corpus tokens extracted in 4, use the Python library
gensim
to create a word2vec encoding of the dataset, such that we have a numerical representation of the ‘meaning’ of a sentence.
If any other non-text based columns are present, normal feature extraction is applied to those remaining columns to ensure no relevant information is ignored.
The FRESH (FeatuRe Extraction and Scalable Hypothesis testing) algorithm is used specifically for timeseries datasets. The data passed to FRESH should contain an identifying (ID) column, which groups the timeseries into subsets, from which features can be extracted. Note that each subset must have an associated target.
The feature-extraction functions applied within the FRESH procedure are defined within the ML Toolkit. A full explanation of this algorithm is also provided there.
Default setup:
- When running
.automl.fit
for FRESH data, by default the first column of the dataset is defined as the identifying (ID) column. - By default all functions defined within
.ml.fresh.params
are applied during the feature extraction phase.
Each of the above can be modified on a per-run basis.
Once the most significant features have been selected, the data is split into training and testing sets, where the testing set is required later in the process for optimizing the best model.
To split the data, a number of train-test split procedures are implemented for the different problem types:
problem type | function | description |
---|---|---|
Normal | .ml.trainTestSplit |
Shuffle the dataset and split into training and testing set with defined percentage in each |
FRESH | .automl.utils.ttsNonShuff |
Without shuffling, the dataset is split into training and testing set with defined percentage in each to ensure no time leakage. |
NLP | .ml.trainTestSplit |
Shuffle the dataset and split into training and testing set with defined percentage in each |
For classification problems, not all the distinct target classes will necessarily appear in both the training and testing sets. This is an issue for the Keras neural-network models, which require a sample from each target class present in both splits of the data.
For this reason, the function .automl.selectModels.targetKeras
must be applied to the data split prior to model training. The function determines if all classes are present in each split of the data. If not, the Keras models will be removed from the list of models to be tested.
Below shows the expected output .automl.fit
should the same number of classes not be present in each dataset.
q).automl.fit[table;target;`normal;`class;(::)]
...
Executing node: selectModels
Test set does not contain examples of each class. Removed any multi-keras models.
...
At this stage it is possible to begin model training and validation using cross-validation procedures. To do so, the data which is currently in the training set must be further split into training and validation sets. Following the same process as before, the Keras check must be applied again if Keras models are still present.
To reduce dimensionality in the data following feature extraction, significance tests are performed by default using the FRESH feature significance function contained within the ML Toolkit.
The default procedure uses the Benjamini-Hochberg-Yekutieli (BHY) procedure to identify significant features within the dataset. If no significant columns are returned, the top 25th percentile of features is selected.
Entry point used to pass run configuration into AutoML graph
.automl.configuration.node.function config
Where config
is a dictionary with custom configuration information relevant to the present run, returns the configuration dictionary ready to be passed to the relevant nodes within the pipeline.
Add default parameters to configuration while checking dataset is suitable for AutoML
.automl.dataCheck.node.function[config;features;target]
Where
config
is a dictionary containing information related to the current run of AutoMLfeatures
is the feature data as a tabletarget
is a numerical or symbol target vector
returns modified configuration, feature and target datasets. The function will error on issues with configuration, setup, target or feature dataset.
// Non-time series (normal) regression example table
features:([]asc 100?0t;100?1f;desc 100?0b;100?1f;asc 100?1f)
// Regression target
target:asc 100?1f
// Create run configuration dictionary
config:`startDate`startTime`featureExtractionType`problemType`savedModelName!
(.z.D;.z.T;`normal;`reg;`)
// Join onto default dictionaries
config:.automl.paramDict[`general],.automl.paramDict[`normal],config
q)// Perform data checks
q).automl.dataCheck.node.function[config;features;target]
config | `startDate`startTime`featureExtractionType`problemType`saveOption`s..
features| +`x`x1`x2`x3`x4!(`s#00:06:00.139 00:12:17.160 00:43:43.460 00:45:23..
target | `s#0.009530776 0.01837959 0.0244211 0.0269967 0.03035461 0.05011425..
Preprocess data prior to application of ML algorithms
.automl.dataPreprocessing.node.function[config;features;symEncode]
Where
config
is a dictionary containing information related to the current run of AutoMLfeatures
is the feature data as a tablesymEncode
is a dictionary containing columns to symbol encode and their required encoding
returns the feature table with the data preprocessed appropriately.
q)// Non-time series (normal) example table
q)5#features:([]0n,99?1f;0.5,0n,98?1f;100?`5;100#10?`a`b`c)
x x1 x2 x3
----------------------------
0.5 kgnje b
0.9484394 fengd b
0.8146544 0.6096089 mmocm b
0.3859185 0.9320707 dmfif a
0.1676035 0.9610061 kmlcf b
q)// Configuration dictionary
q)config:`featureExtractionType`logFunc!
(`normal;.automl.utils.printFunction[`testLog;;1;1])
q)// Columns to symbol encode
q)symEncode:`freq`ohe!(`x2;`x3)
q)// Apply preprocessing
q)5#.automl.dataPreprocessing.node.function[config;features;symEncode]
Data preprocessing complete, starting feature creation
x x1 x2_freq x3_a x3_b x3_c x_null x1_null
---------------------------------------------------------
0.5359828 0.5 0.01 0 1 0 1 0
0.9484394 0.5366546 0.01 0 1 0 0 1
0.8146544 0.6096089 0.01 0 1 0 0 0
0.3859185 0.9320707 0.01 1 0 0 0 0
0.1676035 0.9610061 0.01 0 1 0 0 0
Generate appropriate features based on problem type
.automl.featureCreation.node.function[config;features]
Where
config
is a dictionary containing information related to the current run of AutoMLfeatures
is feature data as a table
returns a dictionary containing original features with any additional ones created, along with time taken and any saved models in the case of generation of a word2vec model.
q)// Non-time series (normal) example table
q)3#features:([]asc 100?0t;100?1f;100?2;100?1f;asc 100?1f)
x x1 x2 x3 x4
-----------------------------------------------
00:07:28.748 0.3852334 0 0.6497915 0.005637949
00:11:06.877 0.6474614 0 0.848094 0.009473082
00:14:38.117 0.2007083 0 0.1653044 0.009778301
q)// Configuration dictionary
q)config:`featureExtractionType`functions!
(`normal;`.automl.featureCreation.normal.default)
q)// Run feature creation
q).automl.featureCreation.node.function[config;features]
creationTime| 00:00:00.001
features | +`x1`x2`x3`x4`x_hh`x_uu`x_ss!(0.3852334 0.6474614 0.2007083 0.0..
featModel | ()
q)// NLP example table
q)3#features:([]100?1f;100?("Testing the application of nlp";"With different characters"))
x x1
------------------------------------------
0.173315 "With different characters"
0.2901047 "With different characters"
0.8082814 "Testing the application of nlp"
q)// Configuration dictionary
q)config:`featureExtractionType`functions`w2v`savedWord2Vec`seed!
(`nlp;`.automl.featureCreation.normal.default;0;0b;42)
q)// Run feature creation
q).automl.featureCreation.node.function[config;features]
creationTime| 00:00:05.978
features | +`ADJ`ADP`DET`NOUN`VERB`col0`col1`col2`col3`col4`col5`col6`col7..
featModel | {[f;x]embedPy[f;x]}[foreign]enlist
q)// FRESH example table
q)3#features:([]5000?100?0p;asc 5000?1f;5000?1f;desc 5000?10f;5000?0b)
x x1 x2 x3 x4
----------------------------------------------------------------
2001.12.07D11:54:52.182176560 0.000267104 0.3443026 9.99641 0
2000.08.06D11:21:24.919296204 0.0004898147 0.6505359 9.996256 0
2002.09.06D00:33:11.276746992 0.001093083 0.5621585 9.995282 1
q)// Configuration dictionary
q)config:`featureExtractionType`functions`aggregationColumns!`fresh`.ml.fresh.params`x
q)// Run feature creation
q).automl.featureCreation.node.function[config;features]
creationTime| 00:00:06.176
features | +`x1_absenergy`x1_abssumchange`x1_count`x1_countabovemean`x1_co..
featModel | ()
Loading of feature dataset from process/alternative data source
.automl.featureData.node.function config
Where config
is a dictionary with the location and method by which to retrieve the data, returns feature data as a table.
q)// Dictionary with information for .ml.i.loadDataset
q)config:`directory`fileName`typ`schema`separator!
("home/data";"features.csv";`csv;"FFSJ";enlist",")
q)// Load in feature dataset
q).automl.featureData.node.function config
x x1 x2 x3
---------------------------
0.831001 0.06119115 b 0
0.2386444 0.5510458 c 4
0.4626596 0.456294 a 5
0.7125073 0.6581265 b 0
0.4478417 0.9841029 b 1
0.236602 0.5767729 a 7
0.6568185 0.05316387 c 8
0.9114983 0.6573407 c 3
0.01380875 0.987491 b 2
0.4625509 0.8070207 b 1
0.6369492 0.7927667 b 8
0.5779229 0.8195301 a 7
0.1649935 0.2466953 a 6
0.9628137 0.3021382 c 0
0.8119738 0.3621913 c 7
..
Retrieve information needed for report generation and running on new data
.automl.featureDescription.node.function[config;features]
Where
config
is a dictionary containing information related to the current run of AutoMLfeatures
is feature data as a table
returns a dictionary with symbol encoding, feature data and description.
q)// FRESH regression example table
q)5#features:([]idx:100?`4;100?1f;100?`4;100?`a`b`c;100?1f)
idx x x1 x2 x3
----------------------------------
fjcj 0.1252898 ggfk c 0.6136512
pnoj 0.5655123 ghlj a 0.9441251
cphm 0.2613049 nega b 0.603996
cedn 0.004608619 dmad c 0.3888924
kihe 0.5062673 lkaj c 0.0486924
q)// Configuration with logging function and feature extraction type
q)config:`logFunc`featureExtractionType!
(.automl.utils.printFunction[`testLog;;1;1];`fresh)
q)// Run node
q)show output:.automl.featureDescription.node.function[config;features]
The following is a breakdown of information for each of the relevant columns in the dataset
| count unique mean std min max type
- | -----------------------------------------------------------------
x | 100 100 0.4860504 0.3116851 0.0035863 0.9785984 numeric
x3 | 100 100 0.5086307 0.276274 0.00969842 0.9988041 numeric
idx| 100 100 :: :: :: :: categorical
x1 | 100 100 :: :: :: :: categorical
x2 | 100 3 :: :: :: :: categorical
symEncode | `freq`ohe!(`idx`x1;,`x2)
dataDescription| `x`x3`idx`x1`x2!+`count`unique`mean`std`min`max`type!(100 10..
features | +`idx`x`x1`x2`x3!(`fjcj`pnoj`cphm`cedn`kihe`gcbf`gidn`jcen`i..
Apply feature significance logic to data and return the columns deemed to be significant
.automl.featureSignificance.node.function[config;features;target]
Where
config
is a dictionary containing information related to the current run of AutoMLfeatures
is the feature data as a tabletarget
is a numerical or symbol target vector
returns a dictionary containing a symbol list of significant features and the feature data post-feature extraction.
q)// Normal feature table
q)5#features:([]asc 100?1f;100?1f;100?1f;100?10;100?5?1f;100?10)
x x1 x2 x3 x4 x5
------------------------------------------------
0.01976295 0.831001 0.06119115 2 0.3520493 7
0.03947309 0.2386444 0.5510458 2 0.05040009 3
0.0486924 0.4626596 0.456294 8 0.5467665 1
0.06382153 0.7125073 0.6581265 9 0.7096579 4
0.06708218 0.4478417 0.9841029 0 0.05040009 0
q)// Regression target
q)target:asc 100?1f
q)// Configuration dictionary containing functions to apply (below is default)
q)config:`logFunc`significantFeatures!
(.automl.utils.printFunction[`testLog;;1;1];`.automl.featureSignificance.significance)
q)// Number of features before feature selection
q)count cols features
6
q)// Apply feature selection
q)show 5#sigFeats:.automl.featureSignificance.node.function[config;features;target][`sigFeats]#features
Total number of significant features being passed to the models = 1
x
----------
0.01976295
0.03947309
0.0486924
0.06382153
0.06708218
q)// Number of features after feature selection
q)count cols sigFeats
1
Encode non numeric target data
.automl.labelEncode.node.function[target]
Where target
is a numerical or symbol target vector, returns a dictionary containing a mapping between symbols and associated labels, if necessary, and the numerical representation of the encoded target used throughout the framework.
q)// Multi-classification target
q)show target:100?`a`b`c
`a`a`b`a`c`c`a`a`b`a`c`a`b`c`c`a`b`c`c`b`c`b`c`b`c`b`b`b`a`c`b`c`a`b`b`c`c`a`..
q)// Encode target vector
q).automl.labelEncode.node.function target
symMap| `s#`a`b`c!0 1 2
target| 0 0 1 0 2 2 0 0 1 0 2 0 1 2 2 0 1 2 2 1 2 1 2 1 2 1 1 1 0 2 1 2 0 1 1..
q)// Multi-classification non symbol target
q)show target:100?3
2 2 2 2 2 1 0 2 1 0 0 1 2 2 1 0 2 1 2 0 2 1 0 2 0 1 2 2 2 0 2 0 1 2 2 2 1 0 2..
q)// Encode target vector
q).automl.labelEncode.node.function target
symMap| ()!()
target| 2 2 2 2 2 1 0 2 1 0 0 1 2 2 1 0 2 1 2 0 2 1 0 2 0 1 2 2 2 0 2 0 1 2 2..
Create the table of models to apply based on problem type and user configuration
.automl.modelGeneration.node.function[config;target]
Where
config
is a dictionary containing information related to the current run of AutoMLtarget
is a numerical or symbol target vector
returns a table containing information needed to apply appropriate models to data.
q)// Binary-classification target
q)show target:100?0b
11110000000100001010111101101000100111101110011000000010100100111100011101011..
q)// Configuration with problem type
q)config:enlist[`problemType]!enlist`class
q)// Generate model table
q).automl.modelGeneration.node.function[config;target]
model lib fnc seed typ apply minit ..
-----------------------------------------------------------------------------..
AdaBoostClassifier sklearn ensemble `seed multi 1 {[x;y;z]..
RandomForestClassifier sklearn ensemble `seed multi 1 {[x;y;z]..
GradientBoostingClassifier sklearn ensemble `seed multi 1 {[x;y;z]..
LogisticRegression sklearn linear_model `seed binary 1 {[x;y;z]..
GaussianNB sklearn naive_bayes :: binary 1 {[x;y;z]..
KNeighborsClassifier sklearn neighbors :: multi 1 {[x;y;z]..
MLPClassifier sklearn neural_network `seed multi 1 {[x;y;z]..
SVC sklearn svm `seed binary 1 {[x;y;z]..
LinearSVC sklearn svm `seed binary 1 {[x;y;z]..
BinaryKeras keras binary `seed binary 1 {[x;y;z]..
The model table produced contains the following columns:
model name of the model to be applied
lib Python library from which the model is derived
fnc sub module within the Python library from which a model is derived
seed is the model capable of being seeded (seed for yes, null for no)
typ type of problem being solved
apply if model is to be applied within the pipeline
minit definition of the model to be applied within the workflow
Load the target dataset from process/alternative data source
.automl.targetData.node.function config
Where config
is a dictionary with the location and method by which to retrieve the data, returns a numerical or symbol target vector.
q)// Dictionary with information for .ml.i.loadDataset
q)config:`typ`directory`fileName`schema!
(`splay;system"cd";"target";"B")
q)// Load in target vector
q).automl.targetData.node.function config
11001100110011000100100001100110010011100100001010010011110100110010010110000..
Split features and target into training and testing sets
.automl.trainTestSplit.node.function[config;features;target;sigFeats]
Where
config
is a dictionary containing information related to the current run of AutoMLfeatures
is the feature data as a tabletarget
is a numerical or symbol target vectorsigFeats
is a symbol list of significant features
returns a dictionary containing data separated into training and testing sets.
q)// Non-time series (normal) multi-classification example table
q)5#features:([]asc 100?1f;desc 100?1f;100?1f;100?1f)
x x1 x2 x3
------------------------------------------
0.02242087 0.9955855 0.05310517 0.168436
0.02380346 0.9833302 0.2166502 0.9325573
0.02425821 0.9758116 0.1526076 0.7633987
0.04089465 0.9730643 0.7080534 0.5577644
0.05110413 0.9657049 0.5551035 0.05737207
q)// Multi-classification target
q)target:asc 100?5
q)// Significant features
q)sigFeats:`x`x1
q)// Configuration dictionary with TTS function and testing size
q)config:`trainTestSplit`testingSize!(`.ml.trainTestSplit;.2)
q)// Split data
q).automl.trainTestSplit.node.function[config;features;target;sigFeats]
xtrain| (0.9587189 0.08810016;0.8257742 0.2153339;0.1260167 0.8713021;0.94780..
ytrain| 4 3 0 4 0 3 1 4 3 3 4 2 2 2 3 4 3 4 2 2 4 0 3 0 0 4 4 0 2 0 3 1 2 0 2..
xtest | (0.7512502 0.3852997;0.3269671 0.6824045;0.9601101 0.05887977;0.66922..
ytest | 3 1 4 2 0 3 2 0 3 4 4 2 2 1 3 0 2 3 2 1