-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add features to the DatetimeEncoder
#907
Comments
Thanks for opening this! Both hoidays and temporal patterns sound useful holidays: As you say there is the problem that we would need another dependency to get the holidays. However maybe at first we could have the users provide their own holidays, similarly to what is done in the polars seasonal patterns: in your experience, are those features mostly useful for linear models, or do they also improve the performance of gradient boosting? And what dimensionality do you think is typically useful? Another addition I would like to see for the DatetimeEncoder is the option to output some of its current features as Categorical dtypes, or as one-hot encoded categories. In particular we may want to encode the day of the week and probably the month as categories rather than as a floating-point numbers (as is done at the moment). Also note that important changes are made to the DatetimeEncoder in #902 (among others, adding support for polars and making it accept a single column rather than a dataframe). The first version you have in mind would be in the form of changes to the DatetimeEncoder, or a stand-alone prototype? |
I can't recall a public benchmark that I can share, but I have heard of many anekdotes that folks have used this technique after I presented it at a PyData many years ago. I can't imagine why it wouldn't benefit an ensemble technique but there is some wiggle room here due to the
That depends on the preference of folks. The quickest way would be for me to build something solo and to maybe run a few benchmarks to confirm that it works for non-linear models as well. If we prefer a benchmark before doing the proper implementation here this could be a reasonable avenue to explore. Open to suggestions tho! |
I didn't have in mind the ensemble aspect but rather the fact that non-linear models might be able to cope better with the raw features by themselves. For example given the hour a linear model would need the splines or some other feature engineering to separate out the lunch break period, but a non-linear model such as gradient boosting could do it from the original feature. (they can be a good addition to the DatetimeEncoder in any case, I was just wondering if you had insights about the settings where these feaatures are most often used) |
Ah good that you point that out, it's a subtle difference. I guess even with a non-linear model the featurization technique can also be seen as a way to steer the model. Kind of as a 'you can ignore these features, but it may be really helpful in getting good fit'-kind of way. I guess another benefit to mention is that the spline-y features are more smooth. So maybe less step functions in the output and more smooth predictions instead. A lot of this would still depend on the hyperparameters tho. If there are no extra concerns I'll try to figure out some time to run some benchmarks. I think my Kaggle datasets have a few examples in them where this might be relevant. |
Ah good that you point that out, it's a subtle difference. I guess even with a non-linear model the featurization technique can also be seen as a way to steer the model. Kind of as a 'you can ignore these features, but it may be really helpful in getting good fit'-kind of way. I guess another benefit to mention is that the spline-y features are more smooth. So maybe less step functions in the output and more smooth predictions instead.
That's a good point.
A lot of this would still depend on the hyperparameters tho. If there are no extra concerns I'll try to figure out some time to run some benchmarks. I think my Kaggle datasets have a few examples in them where this might be relevant.
That's great, it will be super useful to get a sense of the settings where this boosts prediction and the accuracy vs time & memory tradeoffs. One more thing to look out for is that we don't support sparse data (because polars doesn't and most likely never will), and I guess depending on the chosen hyperparameters the dimensionality of the spline features can get really high.
The kaggle datasets sound great; another one that I was thinking could be useful for the example gallery is the bike rental one used in scikit-learn examples:
https://scikit-learn.org/stable/auto_examples/applications/plot_cyclical_feature_engineering.html
I was thinking we could use it to rewrite our current datetime encoder
example:
https://skrub-data.org/stable/auto_examples/03_datetime_encoder.html#sphx-glr-auto-examples-03-datetime-encoder-py
which at the moment uses a dataset where the different datetime features don't bring a lot of information.
|
I have some results. I have a timeseries task with these contents in When I run this with a bunch of base settings, I see these CV results (the table is wide so you may need to zoom in). There are different algorithms (xgboost, lightgbm, sklearn hist boost and ridge) and different featurization settings (table vectorizer, table vectorizer that drops an id column and tablevectorizer with the seasonal date feature). Across all the algorithms it seems that adding the seasonal feature improves things. This improvement may not be incredibly substantial, but it doe seem consistent. |
There is another one of these datasets that seems to have similar results. I want to add one caveat here because these datasets are synthetic. Kaggle reports that they are based on actual datasets but this benchmark is based on simulated data in the end. That said, the improvement again seems to be consistent. |
We discussed it this morning during the skrub meeting (you're welcome to join whenever you want by the way, it's every Monday 10:30 to 11:00 in Europe/Paris, if you're interested I'll send you the link). I think there is a consensus that skrub should provide the "seasonal patterns" you describe. However @GaelVaroquaux raised the point that splines can be a bit tricky to parametrize and we were wondering: in your experience are sine/cosine transforms easier to work with and how do they perform? |
In this example the sine features seem to perform worse than the splines or than simple one-hot encoding of the hour |
The splines aren't perfect for sure, but they've thusfar always seemed simple enough and also pragmatic in the sense that it's a simple thing to reason about. I do recall that regularisation on the model that follows can be very good tho. I have never really tried the sine features because the spline trick always kind of worked pretty well for the season. I guess a good next question is ... how might we want to implement this? |
The splines aren't perfect for sure, but they've thusfar always seemed simple enough and also pragmatic in the sense that it's a simple thing to reason about. I do recall that regularisation on the model that follows can be very good tho.
I just worry that they are not a simple 2-liner implementation. Rather, it's hiding something with a lot of subtleties and corresponding hyperparameters in the DateTimeEncoder. I don't like that.
|
That's a fair concern, but I am not sure what the user might expect besides "sensible defaults". The most general seasonal pattern feels like it might be to do 'something something monthly', so maybe setting Part of me worries that there may not something simpler to configure that those |
I agree, holiday/weekend are a different issue -- they're just a 1D indicator that says if each time point falls during a holiday. So let's discuss them in #710 instead, and focus on the splines/cyclical features here |
Discussing a bit with @ogrisel and @glemaitre we were thinking that for most things that are likely to be relevant, the shape of the splines that are flat with a peak will capture them more easily than sines. for example "lunch break" can be nicely captured by one spline with a width of roughly 1 h, whereas its representation in the frequency domain will have many coefficients. So with splines we may get away with smaller dimension, and have more interpretable models and defaults that are easier to set |
I also wonder if the current interface of the DatetimeEncoder is suitable for adding those features or if parameters should be in terms of "which cycles to represent" rather than the current "resolution" |
That's a good point. For my own tools sofar I've often resorted to an API similar to: make_union(
SeasonalFeaturizer(date_col="datetime", kind="hour_per_day", knots=24),
SeasonalFeaturizer(date_col="date", kind="day_of_year", knots=12)
) Something about doing multiple of 'em feels nice when you're doing things manually ... but there might be something that we can infer if the dataframe going in gives us a datetime vs. a date? |
this feature was also requested in #1127 (comment) |
Problem Description
Skrub currently encodes some features in the datetime encoder, but there are a few that feel missing.
It feels like these might be great candidates to consider for our
DatetimeEncoder
.Feature Description
The seasonal patterns can both be generated using the
SplineTransformer
in scikit-learn under the hood that uses theperiodic
setting. There's a demo of this technique here. In one case we'd model the features over the ordinal day of the year, while in the other case we'd use time of day.The holidays might be a bit trickier because we'd need to rely on a 3rd party library to capture all of them. Then again, polars does support some business day features, so we might just be able to leverage something there.
Alternative Solutions
I'm planning on making a first version of such a component so that it becomes easier to shoot at it. I can imagine that maybe we don't want to support all of these features but maybe just a subset. It can also be the case that business/holiday features should go into another estimator.
Additional Context
No response
The text was updated successfully, but these errors were encountered: