Pipeline in sklearn #293

PolarsDude · 2024-11-26T16:49:00Z

Hi!

Is it possible to use the polars_ds pipeline framework into sklearn pipeline? Do you have any examples on this?

abstractqqq · 2024-11-26T21:46:30Z

It is possible, though not recommended.

The benefits of wrapping it in a sklearn Pipelines is that you get some UI for free. But you will have to do extra work to make the UI informative..

The disadvantages are that:

You will have 1 more layer of abstraction
You lose the flexibility and concise syntax that polars_ds provides.
You lose the ability serialize the pipeline as json / dict.

Maybe one day I may add UI to polars_ds pipeline as well, but it is not really the priority now.

If there is a transform that you really want, please open a feature request. Thank you! You can find the dataset used in the example in ../examples/ folder on github.

import polars_ds.pipeline as pds_pipe
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin

class CustomPDSTransformer(BaseEstimator, TransformerMixin):

    def __init__(self):
        self.pipe = None

    def fit(self, df, y=None):
        # specify all the rules for the transform here
        bp = (
            pds_pipe.Blueprint(df, name = "example", target = "approved") 
            .lowercase() 
            .filter( 
                "city_category is not null" # or equivalently, you can do: pl.col("city_category").is_not_null()
            )
            .select(cs.numeric() | cs.by_name(["gender", "employer_category1", "city_category", "test_col"]))
            .linear_impute(features = ["var1", "existing_emi"], target = "loan_period") 
            .impute(["existing_emi"], method = "median")
        )
        self.pipe = bp.materialize()
        return self

    def transform(self, df, y=None):
        return self.pipe.transform(df)

# ---------------------------------------------------------------

df = pl.read_parquet("../examples/dependency.parquet")
df.head()

pipe = Pipeline(
    steps=[
        ("CustomPDSTransformer", CustomPDSTransformer())    
    ]
)
df_transformed = pipe.fit_transform(df)
df_transformed

PolarsDude · 2024-11-27T06:34:45Z

Thanks! The only reason I would use sklearn pipeline for is the fact that it can avoid data leakage between training and test data. Can polars_ds pipeline handle data leakage?

abstractqqq · 2024-11-27T15:24:24Z

Leakage is handled by splitting the df into train and test and only fit the pipeline on train. That should be how you avoid leaking. Is there anything else the sklearn pipeline does to prevent leaking?

PolarsDude · 2024-11-28T13:59:18Z

Thanks. One last question. Assume that the data has been split into test and training: X_train, y_train, X_test, and y_test. We also have our Polars pipeline named bp. How would one use the Polars pipeline in combination with ML training/prediction in sklearn? Would it look something like this?

-- Train pipeline
X_train_bp = bp.fit(X_train)

-- for training
model.fit(X_train_bp ,y_train)

-- For prediction
model.predict(bp.transform(X_trest))

abstractqqq · 2024-11-30T16:44:38Z

You don't need to separate X from y in Polars_ds pipelines. They should all come from df. Split df into df train and df test. Then you can manually create df train x and df train y, and df test x and df test y.

Btw, there is also a split_and_sample module, which provides a random splitting function

I might add the option to return two frames: df_x and df_y, instead of just df

PolarsDude · 2024-12-18T19:36:43Z

Is there any support for including a ml model in the pds pipeline, for example a xgboost? In a sklearn pipeline you could write:
pipeline = Pipeline("pds_pipeline", pds_pipe(), "model", xgb.xgboost()). For training: pipeline.fit(x_train, y_train)

I am not sure how to write this into a pure pds pipeline (without including your first example).

abstractqqq · 2024-12-18T20:16:42Z

My intention is not to include models in the pipeline.

The reason is it's important to track the raw features, transformed features, before model training.

Also if you are hyperparameter tuning, it's better to use only the transformed data and the model. This separation helps you keep track of things. (In sklearn, you do this by using cache option in a pipeline, but I think it's an extra configuration option which shouldn't be there in the first place. )

If you use a sklearn pipeline, and if you don't set the cache option, the entire pipeline will run in each step during the hyperparameter tuning.

And for scoring, you know what your model is and so you will score your own model. I don't think it's the pipeline's job to provide scoring functionalities. That will only make the object too complex.

Now it in terms of saving the pipeline and the model. It is indeed more annoying to save the pipeline and the model separately. However, this is actually good practice.

For example, xgboost provides its own save function, which can be used across many versions of xgb, and even across programming language. Once you put xgboost into a sklearn pipeline, you lose the ability to use that. You are forced to use pickle ( or joblib's pickle or something similar), which could pose a security issue because pickles are inherently unsafe. According to my experience, many companies do use pickles, despite the potential security issue.

It's a tradeoff. Data scientist typically don't care too much about the issues I mentioned above. But PDS pipeline is built with strong opinions: it should be simple and independent of model.

That said, if you really really need a data container that saves the PDS pipeline and the model together, you can simply create a dataclass that has a pipeline field and a model field. That way you still get the separation of duty and you can pickle them as a single object

PolarsDude · 2024-12-18T20:36:07Z

Alright. Thanks =)

abstractqqq · 2024-12-18T20:39:03Z

Just curious why do you want to do it with pure PDS? There must be a good reason. Is it because you want native Polars execution? I think sometimes I get strong opinions and forget what users really want

PolarsDude · 2024-12-18T21:58:21Z

Yes, it would be cool with a native polars execution. Also, I am used to implement sklearn pipelines containing data transformation and scoring. But I see your point regarding seperation of these two.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pipeline in sklearn #293

Pipeline in sklearn #293

PolarsDude commented Nov 26, 2024

abstractqqq commented Nov 26, 2024 •

edited

Loading

PolarsDude commented Nov 27, 2024

abstractqqq commented Nov 27, 2024

PolarsDude commented Nov 28, 2024 •

edited

Loading

abstractqqq commented Nov 30, 2024 •

edited

Loading

PolarsDude commented Dec 18, 2024

abstractqqq commented Dec 18, 2024

PolarsDude commented Dec 18, 2024

abstractqqq commented Dec 18, 2024

PolarsDude commented Dec 18, 2024

Pipeline in sklearn #293

Pipeline in sklearn #293

Comments

PolarsDude commented Nov 26, 2024

abstractqqq commented Nov 26, 2024 • edited Loading

PolarsDude commented Nov 27, 2024

abstractqqq commented Nov 27, 2024

PolarsDude commented Nov 28, 2024 • edited Loading

abstractqqq commented Nov 30, 2024 • edited Loading

PolarsDude commented Dec 18, 2024

abstractqqq commented Dec 18, 2024

PolarsDude commented Dec 18, 2024

abstractqqq commented Dec 18, 2024

PolarsDude commented Dec 18, 2024

abstractqqq commented Nov 26, 2024 •

edited

Loading

PolarsDude commented Nov 28, 2024 •

edited

Loading

abstractqqq commented Nov 30, 2024 •

edited

Loading