potential to specify time series splitter #23

almostintuitive · 2023-05-24T12:13:53Z

Hi! Thank you for the super useful library!
We'd love to use it on time series tasks, but for that, we'd need the internal splitter to respect the temporal dimension of the data.
We're happy to contribute a PR that makes this possible.
Is there anything specific regarding the public API that we should keep in mind of, in order for our PR to be accepted?

Thank you!

ThomasBury · 2023-05-25T12:20:51Z

Dear @almostintuitive, thank you for your kind words. I'm glad to hear that you find it valuable.

Regarding GrootCV, the intended usage of the cross-validation scheme is indeed for tabular data. While it is possible to add additional columns (lags) to the data, as you pointed out, the RepeatedKFold approach will not work in that case.

However, if you can devise a way to modularize the fold generator, specifically the code snippet you referred to here, the remaining code should remain unaffected.

In terms of coding standards, I strive to adhere to PEP and scikit-learn conventions, as well as leveraging existing generators and objects. Regarding the preprocessing, I have duplicated (and improved a bit) some transformers from Feature-Engine. The reason for this is that I needed sample weights. Although ARFS may not be on par with the same level of quality, I make efforts to align with those standards.

At the moment, I don't have comprehensive unit testing in place. However, I do provide NB (notebooks) for ad-hoc testing purposes, which serve as integration tests for the new feature(s) and I re-run them to check that they don't break existing ones and to update the documentation and tutorial at the same time.

If that sounds reasonable, please feel free to submit a PR (or several). It would be best if the PR is kept simple and focused, with smaller changes that are easier to track. This approach will be especially helpful since I mainly work on this package during nighttime/evening. Thank you for your understanding and for contributing!

Jnorm911 · 2023-07-13T22:10:57Z

Bump

CMobley7 · 2023-08-25T21:34:00Z

Optuna's implementation here, https://optuna.readthedocs.io/en/stable/reference/generated/optuna.integration.lightgbm.LightGBMTunerCV.html, allows you to send in your own folds; so,

tscv = TimeSeriesSplit(n_splits=cv if cv is not None else 5)
folds = list(tscv.split(X_train))

Though mlxtend's SFS, https://rasbt.github.io/mlxtend/api_subpackages/mlxtend.feature_selection/#sequentialfeatureselector, allows you to use TimeSeriesSplit directly with the cv argument. Optuna's would probably be a quick add, while mlxtend's would likely take longer to add. Not sure though.

notuntoward · 2024-05-13T01:11:48Z

Tabular methods are very commonly used for time series forecasting -- boosted trees are quite often the best time series forecasting methods available. So it would be very useful if ARFS had a way to specify causal cross-validation.

As @cmobley mentioned, optuna allows causal cross validation, and so does the tuner I use with boosted trees, scikit-optimize. For example, skopt.BayesSearchCV accepts a standard sklearn TimeSeriesSplit object.

Would be great if ARFS function did too.

ThomasBury · 2024-05-22T16:51:28Z

Hello, I finally had time to add support for user-defined splitter 3771d49

notuntoward · 2024-05-23T17:44:08Z

Thanks, should we be seeing this change on the website docs yet, or in pip?

ThomasBury · 2024-05-23T18:38:30Z

Thanks, should we be seeing this change on the website docs yet, or in pip?

Hi @notuntoward , both:

documentation:
https://arfs.readthedocs.io/en/latest/notebooks/arfs_timeseries.html
release ARFS 2.3.0: https://pypi.org/project/arfs/

ThomasBury added the question Further information is requested label May 25, 2023

ThomasBury self-assigned this May 25, 2023

ThomasBury closed this as completed May 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

potential to specify time series splitter #23

potential to specify time series splitter #23

almostintuitive commented May 24, 2023

ThomasBury commented May 25, 2023 •

edited

Loading

Jnorm911 commented Jul 13, 2023

CMobley7 commented Aug 25, 2023

notuntoward commented May 13, 2024

ThomasBury commented May 22, 2024

notuntoward commented May 23, 2024

ThomasBury commented May 23, 2024

potential to specify time series splitter #23

potential to specify time series splitter #23

Comments

almostintuitive commented May 24, 2023

ThomasBury commented May 25, 2023 • edited Loading

Jnorm911 commented Jul 13, 2023

CMobley7 commented Aug 25, 2023

notuntoward commented May 13, 2024

ThomasBury commented May 22, 2024

notuntoward commented May 23, 2024

ThomasBury commented May 23, 2024

ThomasBury commented May 25, 2023 •

edited

Loading