Bundesliga-prediction-ML

Pipline

Daten understanding

Quelle: Historical Football Results and Betting Odds Data(https://www.football-data.co.uk/data.php)
Row data from Bundesliga in session 2017-2021 ( Feature explanation see /data/Notes for football data)
Basic backroud:
- In Bundeslig matches in a year:
- 306 matches
- 34 'Spieltag'(week)
- 9 matches one week (full 18 teams in matching)

Feature engineering

Initally select 5 basic features:
- 'HomeTeam': home team name
- 'AwayTeam': away team name
- 'FTHG': Full Time Home Team Goals
- 'FTAG': Full Time Away Team Goals
- 'FTR': Full Time Result (H=Home Win, D=Draw, A=Away Win)
based on these features created intermediate features:
- HTGD: Home team goals difference (Calculate the cumulative goal difference per team per week)
- ATGD: Away team goals difference
- HTP: Home team cumulative points (Cumulative points for each team{ win: 3p; draw: 1p; lose: 0p})
- ATP: Away team cumulative points
- HM1,HM2,HM3 represents results of last 1/2/3 matches from home team(W:'win', D:'draw', L:'lose')
- AM1,AM2,AM3 represents results of last 1/2/3 matches from away team
- MW: in which week(Spieltag)
- Get mean value divided by 'week', then override HTGD, ATGD ,HTP and ATP
freatures sorting:
- drop the intermediate features, drop first 3 weeks data(not enough information)
- drop'HTP','ATP'(which are highly correlated with HTGD, ATGD)
- dummy encodeing
got final features: 'HTGD', 'ATGD', 'HM1_D', 'HM1_L', 'HM1_W', 'AM1_D', 'AM1_L', 'AM1_W', 'HM2_D', 'HM2_L', 'HM2_W', 'AM2_D', 'AM2_L', 'AM2_W', 'HM3_D', 'HM3_L', 'HM3_W', 'AM3_D', 'AM3_L', 'AM3_W'

Model training and selecting

Evaluate following models and compare the apperance
- Random Forests Model
- XGBoosting Model
- Support Vector Machines
- Gradient Boosting Classifier
- K-nearest neighbors
- Gaussian Naive Bayes
- Logistic Regression
got the sore comparations:

Model parameter tuning

use grid-search for parameter selecting

Wrap-up and reflections

	Train-dataset		Test-dataset
Model	F1	Acc	F1	Acc
Random Forests	0.9988	0.9990	0.5172	0.5990
XGBoosting	0.9905	0.9918	0.5255	0.5776
SVM	0.5066	0.6527	0.4952	0.6253
Gradient Boosting	0.7345	0.7807	0.5395	0.5967
Knn	0.7226	0.7664	0.4057	0.5036
Gaussian Naive Bayes	0.5543	0.6260	0.5323	0.5847
Logistic Regression	0.5369	0.6465	0.5291	0.6134

Features in Datasets are rouoghly divided 3 parts(Results data, Match Statistics, Betting odds data)
- features now in Pipline are mainly Results-data based.
Evaluation from models shows best Performance:
- Baseline: If predict the winner is Home-team, the accuracy is about 0.43921568627450985
- Random Forests/XGBoosting on Train-data (obviously overfitting)
  - reason: not enough features?
- SVM and Logistic-gregression shows better on test-data
Next step?
- create more features from Match Statistics
- consider odds from betting data
- consider participating players and their health information (Information from social media/tweets?)
- consider relevant news information
Discussion:
- baseline(add Random variable in Features)?
- overfitting()

Baseline
- coin flip(win-lose 50%)
- average (win-draw-lose 33%)
- naive guess home-team win (accuracy about 43.9% occording to 5 years results)
Update feature engineering
- using Betting odds data create more features (finish): scores are increaseing
  - idea:
    - Prob_odds: using the formular convert from decimal odds to probability, the smaller means larger wins probability
    - CV(coefficient of variation): considering the 'Divergence index', the smaller means the opinions are more consitant from different companies
- using Match Statistics create more features (coding)
- using Tweets data create more features (coding)

Model	Train-dataset		Test-dataset
	F1	Acc	F1	Acc
Random Forests	0.9988	0.9990	0.5817	0.6332
XGBoosting	0.9905	0.9918	0.5066	0.5702
SVM	0.5903	0.6683	0.5953	0.6533
Gradient Boosting	0.7345	0.7792	0.5443	0.6017
Knn	0.7195	0.7664	0.5256	0.5759
Gaussian Naive Bayes	0.6245	0.6367	0.6371	0.6246
Logistic Regression	0.5807	0.6549	0.5894	0.6447

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data		data
.DS_Store		.DS_Store
README.md		README.md
bundesliga_predict_ml.ipynb		bundesliga_predict_ml.ipynb
poisson_bundesliga_predict_ml.ipynb		poisson_bundesliga_predict_ml.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bundesliga-prediction-ML

Pipline

Daten understanding

Feature engineering

Model training and selecting

Model parameter tuning

Wrap-up and reflections

About

Releases

Packages

Languages

xuefeng-hao/Bundesliga-prediction-ML

Folders and files

Latest commit

History

Repository files navigation

Bundesliga-prediction-ML

Pipline

Daten understanding

Feature engineering

Model training and selecting

Model parameter tuning

Wrap-up and reflections

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages