Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Develop player projection system #2

Open
ak-gupta opened this issue Feb 21, 2021 · 4 comments
Open

Develop player projection system #2

ak-gupta opened this issue Feb 21, 2021 · 4 comments
Milestone

Comments

@ak-gupta
Copy link
Owner

Once I've established the SPA ratings per game, I'd like to build out a player projection system. At a base level, we have a hierarchical time-series forecasting problem. However, I think that using hierarchical clustering with the time series can help develop the groups within the player-level data (perhaps we can create groups according to position, play-style, etc.) before looking at using optimal reconciliation. Open questions:

  • Unified time labels: we'll have players with different ages/experience. How should we align them?
  • SPA transformations: should we transform the raw SPA ratings?
@ak-gupta
Copy link
Owner Author

ak-gupta commented Feb 21, 2021

@ak-gupta
Copy link
Owner Author

ak-gupta commented Nov 26, 2021

After thinking about this some more -- the problem with hierarchical time-series modelling is that you need models for every time-series at the lowest level (i.e. one model per player). This is... not ideal. I could

  • create a larger model and use the James-Stein encoder to treat player identifier. This way, we have one large projection model. I'd investigate
    • Simple regression with age and player ID predicting impact (xgboost as well as elastic net and spline models), and
    • Vector auto-regression models (VAR) with age and player ID predicting impact.
  • use unsupervised clustering with the dynamic time-warping and barycenter averaging to create clusters of players based on their current career arc. Then, I can investigate cluster-specific projection models with no contrast encoding.

One large model will present some challenges with train/test splitting. Contrast encoders like James-Stein use knowledge about the target to create ordering between the categorical levels (the players). Ideally, you would fit your encoder on some training set and then only transform your test set. For multiple time-series projection, this means we can't exclude players from the training set at all, only specific observations. We would have to implement cross-validation similar to this article by Hydnman; each fold in cross-validation would contain successively larger training sets (could use this splitter by scikit-learn).

Building several smaller models based on player clusters would solve this problem, since we wouldn't encode any player identifier. In this case, we can combine Hyndman's rolling window cross-validation with a "holdout" strategy, where each fold contains multiple iterations. I.e., if we have 5 players in the pool,

  • Fold 1 trains on players 1-4; tests on player 5
    • Iteration 1 trains using 3 seasons of data, projects to season 6
    • Iteration 2 trains using 4 seasons of data, projects to season 7
    • Iteration 3 trains using 5 seasons of data, projects to season 8
    • Iteration 4 trains using 6 seasons of data, projects to season 9
    • ...
  • Fold 2 trains on players 1-3, 5; tests on player 4
  • Fold 3 trains on players 1, 2, 4, 5; tests on player 3
  • ...

This iterative approach tests how well the model generalizes across players but also how well it forecasts based on how much data it gathers -- we can use scikit-learn's RFE model selection methodology for inspiration on how we can handle the iterative nature.

@ak-gupta
Copy link
Owner Author

* create a larger model and use the [James-Stein encoder](https://kiwidamien.github.io/james-stein-encoder.html) to treat player identifier.

...
One large model will present some challenges with train/test splitting. Contrast encoders like James-Stein use knowledge about the target to create ordering between the categorical levels (the players). Ideally, you would fit your encoder on some training set and then only transform your test set. For multiple time-series projection, this means we can't exclude players from the training set at all, only specific observations.

More things to consider with this approach:

  • We'd have to create our own "rolling" James-Stein contrast encoder since we're analyzing the same players over time. For example, the James-Stein encoding for a given player at age 30 should only use an encoding value based on previous performance.
  • Refitting the encoder -- we would need to refit the encoder when a new player enters the pool. This might not be an issue (maybe we refit the entire model every time we generate projections so inaccuracies from the previous round of projections aren't carried into the next set).

@ak-gupta
Copy link
Owner Author

ak-gupta commented Nov 27, 2021

Looking at the documentation for target encoding and the James-Stein encoder -- it looks like they both assume the target variable has a normal distribution. It might be worth creating a transformer for beta target encoding (their implementation is here).

NOTE: I should read up on beta target encoding from this paper. There are improvements to this procedure proposed here.

UPDATE

I've read up on the procedure. In the initial paper,

  1. Empirically evaluate the posterior distribution for the target variable,
  2. Choose the conjugate prior distribution based on the observed posterior,
  3. Use the formulation for how to parameterize the posterior distribution based on the conjugate prior
  4. For each categorical variable,
    • For each level in the variable
      • Calculate the posterior distribution.
      • Represent the categorical variable by Q moments from the posterior distribution for the level.

In the proposal, steps 1-3 are the same, and

  1. For each categorical variable,
    • For each level in the categorical variable
      • Calculate the posterior distribution
  2. Generate K copies of the training set, where the each categorical variable is encoded by sampling from the posterior distribution described by the level.

In this scenario, our final prediction for the target will be the average of the prediction from the K submodels.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant