PipelineGridSearchCV

An optimized usage case of scikit-learn's GridSearchCV and Pipeline that avoids unnecessary calls to fit()/transform() by doing grid search in depth first search (DFS) order.

Motivation

Running GridSearchCV on a Pipeline classifier will cause a full pipeline computation for each set of parameters, despite some parameters being the same as before, causing unnecessary calls to fit/transform.

Doing 5-fold CV using GridSearchCV on a Pipeline will cause prod(param_counts)*pipe_length*5+pipe_length calls to fit(), where prod(param_counts) is the product of the number of grid search parameters for each part of the pipeline, and pipe_length is the length of the pipeline.

PipelineGridSearchCV, on the other hand, requires only the optimal (nverts-1)*5+pipe_length calls to fit(), where nverts is the number of vertices in the pipeline parameter call tree.

The idea is that the doing grid search on a pipeline estimator forms a directed tree of computation, where each grid search parameter causes the tree to split into a new subtree.

As an example, consider the following pipeline and grid search parameters:

pipe = Pipeline([
    ('pca', PCA()), 
    ('norm', Normalizer()), 
    ('svm', SVC())
    ])

cv_params = {
    'pca__n_components': [100,200],
    'norm__norm': ['l2'],
    'svm__C': [1,10,100],
    }

model = PipelineGridSearchCV(pipe, cv_params)

We have 2*1*3 = 6 parameter combinations to evaluate using grid search. Doing grid search on this pipeline will result in the following grid search tree.

Each pass from the root to a leaf corresponds to one specific selection of parameters for the estimators in the pipeline. We can see that calling model.fit() from the root of the tree for each set of parameters is unnecessary; it is sufficient to call it from nodes in depth first search (DFS) order if the parameters are updated in that order as well. For example, if we have previously called fit() from the root with the parameters {'pca__n_components': 100, 'norm__norm': 'l2', 'svm__C': 100} and then want to try the parameters {'pca__n_components': 100, 'norm__norm': 'l2', 'svm__C': 10}, then we only need to restart the pipeline computation from the SVM step; only the svm__C parameter has changed.

While GridSearchCV will call fit() from the root each time, PipelineGridSearchCV keeps track of the order of the searched parameters, and calls fit only from the nodes necessary in order to evaluate a new parameter choice, avoiding unnecessary repeated computation.

Below is a comparison of the number of calls to fit, comparing the optimized and naive approach to grid search. The pipe grid search structure for the above example corresponds to pipe param. count [2, 1, 3].

Pipe param. count	#calls PipelineGridSearchCV (A)	#calls GridSearchCV (B)	A/B
[2, 1, 3]	10	18	0.555
[10, 20, 30, 20]	126210	480000	0.263
[1, 1, 1, 10, 1, 20, 20]	4223	28000	0.151

We can see that PipelineGridSearchCV gives better performance the deeper and wider the pipeline grid search gets (smaller A/B ratio).

Usage

$ python examples/example.py
-----------------------------------
Elapsed time for doing grid search:
PipelineGridSearchCV: 12.456952095 secs
GridSearchCV: 38.9708809853 secs

Dependencies

Tested with:

python 2.7.3
numpy 1.9.2
scikit-learn 0.16.1
nosetests 1.3.4 (for testing)

Tests

Make sure that everything is working correctly by running the test suite:

$ make test

License

BSD 3-Clause (see LICENSE)

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
examples		examples
pipeline_grid_search		pipeline_grid_search
tests		tests
.gitignore		.gitignore
ISSUES.md		ISSUES.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
TODO		TODO

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PipelineGridSearchCV

Motivation

Usage

Dependencies

Tests

License

About

Releases

Packages

Languages

License

tkerola/pipeline_grid_search

Folders and files

Latest commit

History

Repository files navigation

PipelineGridSearchCV

Motivation

Usage

Dependencies

Tests

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages