- Quelle: Historical Football Results and Betting Odds Data(https://www.football-data.co.uk/data.php)
- Row data from Bundesliga in session 2017-2021 ( Feature explanation see /data/Notes for football data)
- Basic backroud:
- In Bundeslig matches in a year:
- 306 matches
- 34 'Spieltag'(week)
- 9 matches one week (full 18 teams in matching)
- Initally select 5 basic features:
- 'HomeTeam': home team name
- 'AwayTeam': away team name
- 'FTHG': Full Time Home Team Goals
- 'FTAG': Full Time Away Team Goals
- 'FTR': Full Time Result (H=Home Win, D=Draw, A=Away Win)
- based on these features created intermediate features:
- HTGD: Home team goals difference (Calculate the cumulative goal difference per team per week)
- ATGD: Away team goals difference
- HTP: Home team cumulative points (Cumulative points for each team{ win: 3p; draw: 1p; lose: 0p})
- ATP: Away team cumulative points
- HM1,HM2,HM3 represents results of last 1/2/3 matches from home team(W:'win', D:'draw', L:'lose')
- AM1,AM2,AM3 represents results of last 1/2/3 matches from away team
- MW: in which week(Spieltag)
- Get mean value divided by 'week', then override HTGD, ATGD ,HTP and ATP
- freatures sorting:
- drop the intermediate features, drop first 3 weeks data(not enough information)
- drop'HTP','ATP'(which are highly correlated with HTGD, ATGD)
- dummy encodeing
- got final features: 'HTGD', 'ATGD', 'HM1_D', 'HM1_L', 'HM1_W', 'AM1_D', 'AM1_L', 'AM1_W', 'HM2_D', 'HM2_L', 'HM2_W', 'AM2_D', 'AM2_L', 'AM2_W', 'HM3_D', 'HM3_L', 'HM3_W', 'AM3_D', 'AM3_L', 'AM3_W'
-
Evaluate following models and compare the apperance
- Random Forests Model
- XGBoosting Model
- Support Vector Machines
- Gradient Boosting Classifier
- K-nearest neighbors
- Gaussian Naive Bayes
- Logistic Regression
-
got the sore comparations:
- use grid-search for parameter selecting
|
Train-dataset |
Test-dataset |
||
Model |
F1 |
Acc |
F1 |
Acc |
Random Forests |
0.9988 |
0.9990 |
0.5172 |
0.5990 |
XGBoosting |
0.9905 |
0.9918 |
0.5255 |
0.5776 |
SVM |
0.5066 |
0.6527 |
0.4952 |
0.6253 |
Gradient Boosting |
0.7345 |
0.7807 |
0.5395 |
0.5967 |
Knn |
0.7226 |
0.7664 |
0.4057 |
0.5036 |
Gaussian Naive Bayes |
0.5543 |
0.6260 |
0.5323 |
0.5847 |
Logistic Regression |
0.5369 |
0.6465 |
0.5291 |
0.6134 |
- Features in Datasets are rouoghly divided 3 parts(Results data, Match Statistics, Betting odds data)
- features now in Pipline are mainly Results-data based.
- Evaluation from models shows best Performance:
- Baseline: If predict the winner is Home-team, the accuracy is about 0.43921568627450985
- Random Forests/XGBoosting on Train-data (obviously overfitting)
- reason: not enough features?
- SVM and Logistic-gregression shows better on test-data
- Next step?
- create more features from Match Statistics
- consider odds from betting data
- consider participating players and their health information (Information from social media/tweets?)
- consider relevant news information
- Discussion:
- baseline(add Random variable in Features)?
- overfitting()
- Baseline
- coin flip(win-lose 50%)
- average (win-draw-lose 33%)
- naive guess home-team win (accuracy about 43.9% occording to 5 years results)
- Update feature engineering
- using Betting odds data create more features (finish): scores are increaseing
- idea:
- Prob_odds: using the formular convert from decimal odds to probability, the smaller means larger wins probability
- CV(coefficient of variation): considering the 'Divergence index', the smaller means the opinions are more consitant from different companies
- idea:
- using Match Statistics create more features (coding)
- using Tweets data create more features (coding)
- using Betting odds data create more features (finish): scores are increaseing
Model |
Train-dataset |
Test-dataset |
||
F1 |
Acc |
F1 |
Acc |
|
Random Forests |
0.9988 |
0.9990 |
0.5817 |
0.6332 |
XGBoosting |
0.9905 |
0.9918 |
0.5066 |
0.5702 |
SVM |
0.5903 |
0.6683 |
0.5953 |
0.6533 |
Gradient Boosting |
0.7345 |
0.7792 |
0.5443 |
0.6017 |
Knn |
0.7195 |
0.7664 |
0.5256 |
0.5759 |
Gaussian Naive Bayes |
0.6245 |
0.6367 |
0.6371 |
0.6246 |
Logistic Regression |
0.5807 |
0.6549 |
0.5894 |
0.6447 |