Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable three-way pipelines for transformations only on training data #17

Open
fmohr opened this issue May 4, 2023 · 0 comments
Open
Labels
enhancement New feature or request

Comments

@fmohr
Copy link
Owner

fmohr commented May 4, 2023

The main problem with the standard logic of pipelines is that fit_transform, which is applied to all pre-processors in the pipeline, first applies fit and then transform, where transform uses the same logic on the training data as for other data that would pass the pipeline later in a standard transform call.

Some pre-processors of a pipeline should only be used in the transform step coupled to the fit step, i.e., only in fit_transform but not in an ordinary transform. One solution is to use three different methods: fit, transform_fitted_data, transform.

A classical example is SMOTE, whose job is to do the following things during the different phases:

  1. fit: Memorizes the data
  2. transform_fitted_data: Applies upsampling based on the given data
  3. transform: Does nothing

Alternatively, one could extend the signature of the transform function with an optional parameter fitted_data: bool. The pipeline then can set this parameter to true when the fit_transform function is used. If the parameter is abscent, then no different should be made between the fitted data and other data.

@fmohr fmohr added the enhancement New feature or request label May 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant