-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GKs performance estimation? #1
Comments
Hello @DiTo97, I've not made further analysis on that, but the problem with goalkeepers is that there is only one per team (of 20), and this results to the dataset being made by 20 groups of entries, where the N entries in the group have the same "player stats" and "team stats" parts and only differ on the "opposing team stats" part. My guess is that more attempts have to be done for selecting a better network structure, and more effort should be put in selecting less and more relevant features for goalkeepers. |
It took me some time to understand everything, but after digging inside each notebook I think I have a clearer view of your work now. IIUC, your suspicion is that overfitting is the root cause of the problem, as the number of goalkeepers is much smaller than the number of outfield players making a dataset with many rows very similar or even identical for a good portion of the features (the player and team features that you were mentioning). This could be absolutely true, but I have a few more concerns after analysing the notebooks culminating with the fact that the GK model seems to be the one performing best of the two. I will list them in points so that you can address each of them individually, albeit not reporting the name of the corresponding notebook as I honestly do not remember them all.
|
3/4) One thing that maybe is not clear, and it is a limit of the features, is that for a given matchday the stats for players and teams are the ones averaged on the whole current season, and not the stats that take into account the games only up to that matchday. This is due to the kind of data I could scrape from FBRef. Then, there isn't really a temporal component in features. And I agree that it would be better if it was present.
Ps: I wrote you an email with additional questions |
I will answer quoting your points:
Then we should look for some other metric, other than
The rationale seems sound as the performance in the current season becomes the more informative variable as the season goes on, while at the beginning of the season we have little to no information besides past performance. This is similar to the cold start problem in recommendation systems. To better model the current form of each player you could define a decay function to model the temporal dependency and the effect that each past perfomance has on the form. For instance, the following decay function would give max weight (null decay) to events happened in a week time, and decays everything else following a sinusoidal function: import numpy as np
from scipy.interpolate import interp1d
T_secs_hour = 60 * 60
T_secs_day = T_secs_hour * 24
T_secs_week = T_secs_day * 7
@np.vectorize
def timestamp_decay(secs: int) -> float:
"""The timestamp decay function in [0, 1]
It weighs two subsequent events depending on how close they have happened
"""
if secs < 0:
raise ValueError("The interval must be non-negative")
if secs > T_secs_week:
return 0.
freq = np.pi / T_secs_week
proj = interp1d([-1, 1], [0, 1])
return proj(np.cos(secs * freq)) The time decay function could be a parametric function in itself (e.g., a neural network), but such a formulation is a good enough baseline.
I agree, and there would be many different ways to approach it starting from getting play data from different competitions if none in Serie A was present for the current season. Personally, I would use the fantaindex estimated by the fantaGOAT service in absence of data, as it would avoid you navigating through thousands of features to model the cold start problem as they have already done.
I am not familiar with the FBRef service, but is it a matter of how they present the data or of how your scraping code is constructed? Anyhow I know a few other sources that would give you the data at this granularity, which could greatly favour such a time series problem. |
Hi @uPeppe,
In the accompanying blog post you stated how the performance estimation model was good on movement players, but subpar on goalkeepers (GKs). Have you further experimented on the root causes of this phenomenon?
The text was updated successfully, but these errors were encountered: