Enable three-way pipelines for transformations only on training data #17

fmohr · 2023-05-04T13:37:08Z

The main problem with the standard logic of pipelines is that fit_transform, which is applied to all pre-processors in the pipeline, first applies fit and then transform, where transform uses the same logic on the training data as for other data that would pass the pipeline later in a standard transform call.

Some pre-processors of a pipeline should only be used in the transform step coupled to the fit step, i.e., only in fit_transform but not in an ordinary transform. One solution is to use three different methods: fit, transform_fitted_data, transform.

A classical example is SMOTE, whose job is to do the following things during the different phases:

fit: Memorizes the data
transform_fitted_data: Applies upsampling based on the given data
transform: Does nothing

Alternatively, one could extend the signature of the transform function with an optional parameter fitted_data: bool. The pipeline then can set this parameter to true when the fit_transform function is used. If the parameter is abscent, then no different should be made between the fitted data and other data.

The text was updated successfully, but these errors were encountered:

fmohr added the enhancement New feature or request label May 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable three-way pipelines for transformations only on training data #17

Enable three-way pipelines for transformations only on training data #17

fmohr commented May 4, 2023

Enable three-way pipelines for transformations only on training data #17

Enable three-way pipelines for transformations only on training data #17

Comments

fmohr commented May 4, 2023