Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pipeline in sklearn #293

Open
PolarsDude opened this issue Nov 26, 2024 · 10 comments
Open

Pipeline in sklearn #293

PolarsDude opened this issue Nov 26, 2024 · 10 comments

Comments

@PolarsDude
Copy link

Hi!

Is it possible to use the polars_ds pipeline framework into sklearn pipeline? Do you have any examples on this?

@abstractqqq
Copy link
Owner

abstractqqq commented Nov 26, 2024

It is possible, though not recommended.

The benefits of wrapping it in a sklearn Pipelines is that you get some UI for free. But you will have to do extra work to make the UI informative..

The disadvantages are that:

  1. You will have 1 more layer of abstraction
  2. You lose the flexibility and concise syntax that polars_ds provides.
  3. You lose the ability serialize the pipeline as json / dict.

Maybe one day I may add UI to polars_ds pipeline as well, but it is not really the priority now.

If there is a transform that you really want, please open a feature request. Thank you! You can find the dataset used in the example in ../examples/ folder on github.

import polars_ds.pipeline as pds_pipe
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin

class CustomPDSTransformer(BaseEstimator, TransformerMixin):

    def __init__(self):
        self.pipe = None

    def fit(self, df, y=None):
        # specify all the rules for the transform here
        bp = (
            pds_pipe.Blueprint(df, name = "example", target = "approved") 
            .lowercase() 
            .filter( 
                "city_category is not null" # or equivalently, you can do: pl.col("city_category").is_not_null()
            )
            .select(cs.numeric() | cs.by_name(["gender", "employer_category1", "city_category", "test_col"]))
            .linear_impute(features = ["var1", "existing_emi"], target = "loan_period") 
            .impute(["existing_emi"], method = "median")
        )
        self.pipe = bp.materialize()
        return self

    def transform(self, df, y=None):
        return self.pipe.transform(df)

# ---------------------------------------------------------------

df = pl.read_parquet("../examples/dependency.parquet")
df.head()

pipe = Pipeline(
    steps=[
        ("CustomPDSTransformer", CustomPDSTransformer())    
    ]
)
df_transformed = pipe.fit_transform(df)
df_transformed

@PolarsDude
Copy link
Author

Thanks! The only reason I would use sklearn pipeline for is the fact that it can avoid data leakage between training and test data. Can polars_ds pipeline handle data leakage?

@abstractqqq
Copy link
Owner

Leakage is handled by splitting the df into train and test and only fit the pipeline on train. That should be how you avoid leaking. Is there anything else the sklearn pipeline does to prevent leaking?

@PolarsDude
Copy link
Author

PolarsDude commented Nov 28, 2024

Thanks. One last question. Assume that the data has been split into test and training: X_train, y_train, X_test, and y_test. We also have our Polars pipeline named bp. How would one use the Polars pipeline in combination with ML training/prediction in sklearn? Would it look something like this?

-- Train pipeline
X_train_bp = bp.fit(X_train)

-- for training
model.fit(X_train_bp ,y_train)

-- For prediction
model.predict(bp.transform(X_trest))

@abstractqqq
Copy link
Owner

abstractqqq commented Nov 30, 2024

You don't need to separate X from y in Polars_ds pipelines. They should all come from df. Split df into df train and df test. Then you can manually create df train x and df train y, and df test x and df test y.

Btw, there is also a split_and_sample module, which provides a random splitting function

I might add the option to return two frames: df_x and df_y, instead of just df

@PolarsDude
Copy link
Author

Is there any support for including a ml model in the pds pipeline, for example a xgboost? In a sklearn pipeline you could write:
pipeline = Pipeline("pds_pipeline", pds_pipe(), "model", xgb.xgboost()). For training: pipeline.fit(x_train, y_train)

I am not sure how to write this into a pure pds pipeline (without including your first example).

@abstractqqq
Copy link
Owner

My intention is not to include models in the pipeline.

The reason is it's important to track the raw features, transformed features, before model training.

Also if you are hyperparameter tuning, it's better to use only the transformed data and the model. This separation helps you keep track of things. (In sklearn, you do this by using cache option in a pipeline, but I think it's an extra configuration option which shouldn't be there in the first place. )

If you use a sklearn pipeline, and if you don't set the cache option, the entire pipeline will run in each step during the hyperparameter tuning.

And for scoring, you know what your model is and so you will score your own model. I don't think it's the pipeline's job to provide scoring functionalities. That will only make the object too complex.

Now it in terms of saving the pipeline and the model. It is indeed more annoying to save the pipeline and the model separately. However, this is actually good practice.

For example, xgboost provides its own save function, which can be used across many versions of xgb, and even across programming language. Once you put xgboost into a sklearn pipeline, you lose the ability to use that. You are forced to use pickle ( or joblib's pickle or something similar), which could pose a security issue because pickles are inherently unsafe. According to my experience, many companies do use pickles, despite the potential security issue.

It's a tradeoff. Data scientist typically don't care too much about the issues I mentioned above. But PDS pipeline is built with strong opinions: it should be simple and independent of model.

That said, if you really really need a data container that saves the PDS pipeline and the model together, you can simply create a dataclass that has a pipeline field and a model field. That way you still get the separation of duty and you can pickle them as a single object

@PolarsDude
Copy link
Author

Alright. Thanks =)

@abstractqqq
Copy link
Owner

Just curious why do you want to do it with pure PDS? There must be a good reason. Is it because you want native Polars execution? I think sometimes I get strong opinions and forget what users really want

@PolarsDude
Copy link
Author

Yes, it would be cool with a native polars execution. Also, I am used to implement sklearn pipelines containing data transformation and scoring. But I see your point regarding seperation of these two.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants