-
-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pipeline in sklearn #293
Comments
It is possible, though not recommended. The benefits of wrapping it in a sklearn Pipelines is that you get some UI for free. But you will have to do extra work to make the UI informative.. The disadvantages are that:
Maybe one day I may add UI to polars_ds pipeline as well, but it is not really the priority now. If there is a transform that you really want, please open a feature request. Thank you! You can find the dataset used in the example in ../examples/ folder on github. import polars_ds.pipeline as pds_pipe
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin
class CustomPDSTransformer(BaseEstimator, TransformerMixin):
def __init__(self):
self.pipe = None
def fit(self, df, y=None):
# specify all the rules for the transform here
bp = (
pds_pipe.Blueprint(df, name = "example", target = "approved")
.lowercase()
.filter(
"city_category is not null" # or equivalently, you can do: pl.col("city_category").is_not_null()
)
.select(cs.numeric() | cs.by_name(["gender", "employer_category1", "city_category", "test_col"]))
.linear_impute(features = ["var1", "existing_emi"], target = "loan_period")
.impute(["existing_emi"], method = "median")
)
self.pipe = bp.materialize()
return self
def transform(self, df, y=None):
return self.pipe.transform(df)
# ---------------------------------------------------------------
df = pl.read_parquet("../examples/dependency.parquet")
df.head()
pipe = Pipeline(
steps=[
("CustomPDSTransformer", CustomPDSTransformer())
]
)
df_transformed = pipe.fit_transform(df)
df_transformed |
Thanks! The only reason I would use sklearn pipeline for is the fact that it can avoid data leakage between training and test data. Can polars_ds pipeline handle data leakage? |
Leakage is handled by splitting the df into train and test and only fit the pipeline on train. That should be how you avoid leaking. Is there anything else the sklearn pipeline does to prevent leaking? |
Thanks. One last question. Assume that the data has been split into test and training: X_train, y_train, X_test, and y_test. We also have our Polars pipeline named bp. How would one use the Polars pipeline in combination with ML training/prediction in sklearn? Would it look something like this? -- Train pipeline -- for training -- For prediction |
You don't need to separate X from y in Polars_ds pipelines. They should all come from df. Split df into df train and df test. Then you can manually create df train x and df train y, and df test x and df test y. Btw, there is also a split_and_sample module, which provides a random splitting function I might add the option to return two frames: df_x and df_y, instead of just df |
Is there any support for including a ml model in the pds pipeline, for example a xgboost? In a sklearn pipeline you could write: I am not sure how to write this into a pure pds pipeline (without including your first example). |
My intention is not to include models in the pipeline. The reason is it's important to track the raw features, transformed features, before model training. Also if you are hyperparameter tuning, it's better to use only the transformed data and the model. This separation helps you keep track of things. (In sklearn, you do this by using cache option in a pipeline, but I think it's an extra configuration option which shouldn't be there in the first place. ) If you use a sklearn pipeline, and if you don't set the cache option, the entire pipeline will run in each step during the hyperparameter tuning. And for scoring, you know what your model is and so you will score your own model. I don't think it's the pipeline's job to provide scoring functionalities. That will only make the object too complex. Now it in terms of saving the pipeline and the model. It is indeed more annoying to save the pipeline and the model separately. However, this is actually good practice. For example, xgboost provides its own save function, which can be used across many versions of xgb, and even across programming language. Once you put xgboost into a sklearn pipeline, you lose the ability to use that. You are forced to use pickle ( or joblib's pickle or something similar), which could pose a security issue because pickles are inherently unsafe. According to my experience, many companies do use pickles, despite the potential security issue. It's a tradeoff. Data scientist typically don't care too much about the issues I mentioned above. But PDS pipeline is built with strong opinions: it should be simple and independent of model. That said, if you really really need a data container that saves the PDS pipeline and the model together, you can simply create a dataclass that has a pipeline field and a model field. That way you still get the separation of duty and you can pickle them as a single object |
Alright. Thanks =) |
Just curious why do you want to do it with pure PDS? There must be a good reason. Is it because you want native Polars execution? I think sometimes I get strong opinions and forget what users really want |
Yes, it would be cool with a native polars execution. Also, I am used to implement sklearn pipelines containing data transformation and scoring. But I see your point regarding seperation of these two. |
Hi!
Is it possible to use the polars_ds pipeline framework into sklearn pipeline? Do you have any examples on this?
The text was updated successfully, but these errors were encountered: