Skip to content

xuefeng-hao/Bundesliga-prediction-ML

Repository files navigation

Bundesliga-prediction-ML

Pipline

Daten understanding

  • Quelle: Historical Football Results and Betting Odds Data(https://www.football-data.co.uk/data.php)
  • Row data from Bundesliga in session 2017-2021 ( Feature explanation see /data/Notes for football data)
  • Basic backroud:
    • In Bundeslig matches in a year:
    • 306 matches
    • 34 'Spieltag'(week)
    • 9 matches one week (full 18 teams in matching)

Feature engineering

  • Initally select 5 basic features:
    • 'HomeTeam': home team name
    • 'AwayTeam': away team name
    • 'FTHG': Full Time Home Team Goals
    • 'FTAG': Full Time Away Team Goals
    • 'FTR': Full Time Result (H=Home Win, D=Draw, A=Away Win)
  • based on these features created intermediate features:
    • HTGD: Home team goals difference (Calculate the cumulative goal difference per team per week)
    • ATGD: Away team goals difference
    • HTP: Home team cumulative points (Cumulative points for each team{ win: 3p; draw: 1p; lose: 0p})
    • ATP: Away team cumulative points
    • HM1,HM2,HM3 represents results of last 1/2/3 matches from home team(W:'win', D:'draw', L:'lose')
    • AM1,AM2,AM3 represents results of last 1/2/3 matches from away team
    • MW: in which week(Spieltag)
    • Get mean value divided by 'week', then override HTGD, ATGD ,HTP and ATP
  • freatures sorting:
    • drop the intermediate features, drop first 3 weeks data(not enough information)
    • drop'HTP','ATP'(which are highly correlated with HTGD, ATGD)
    • dummy encodeing
  • got final features: 'HTGD', 'ATGD', 'HM1_D', 'HM1_L', 'HM1_W', 'AM1_D', 'AM1_L', 'AM1_W', 'HM2_D', 'HM2_L', 'HM2_W', 'AM2_D', 'AM2_L', 'AM2_W', 'HM3_D', 'HM3_L', 'HM3_W', 'AM3_D', 'AM3_L', 'AM3_W'

Model training and selecting

  • Evaluate following models and compare the apperance

    • Random Forests Model
    • XGBoosting Model
    • Support Vector Machines
    • Gradient Boosting Classifier
    • K-nearest neighbors
    • Gaussian Naive Bayes
    • Logistic Regression
  • got the sore comparations:

Model parameter tuning

  • use grid-search for parameter selecting

Wrap-up and reflections

 

Train-dataset

Test-dataset

Model

F1

Acc

F1

Acc

Random Forests

0.9988

0.9990

0.5172

0.5990

XGBoosting

0.9905

0.9918

0.5255

0.5776

SVM

0.5066

0.6527

0.4952

0.6253

Gradient Boosting

0.7345

0.7807

0.5395

0.5967

Knn

0.7226

0.7664

0.4057

0.5036

Gaussian Naive Bayes

0.5543

0.6260

0.5323

0.5847

Logistic Regression

0.5369

0.6465

0.5291

0.6134

  • Features in Datasets are rouoghly divided 3 parts(Results data, Match Statistics, Betting odds data)
    • features now in Pipline are mainly Results-data based.
  • Evaluation from models shows best Performance:
    • Baseline: If predict the winner is Home-team, the accuracy is about 0.43921568627450985
    • Random Forests/XGBoosting on Train-data (obviously overfitting)
      • reason: not enough features?
    • SVM and Logistic-gregression shows better on test-data
  • Next step?
    • create more features from Match Statistics
    • consider odds from betting data
    • consider participating players and their health information (Information from social media/tweets?)
    • consider relevant news information
  • Discussion:
    • baseline(add Random variable in Features)?
    • overfitting()

  • Baseline
    • coin flip(win-lose 50%)
    • average (win-draw-lose 33%)
    • naive guess home-team win (accuracy about 43.9% occording to 5 years results)
  • Update feature engineering
    • using Betting odds data create more features (finish): scores are increaseing
      • idea:
        • Prob_odds: using the formular convert from decimal odds to probability, the smaller means larger wins probability
        • CV(coefficient of variation): considering the 'Divergence index', the smaller means the opinions are more consitant from different companies
    • using Match Statistics create more features (coding)
    • using Tweets data create more features (coding)
<style> </style>

 

Model

Train-dataset

Test-dataset

F1

Acc

F1

Acc

Random Forests

0.9988

0.9990

0.5817

0.6332

XGBoosting

0.9905

0.9918

0.5066

0.5702

SVM

0.5903

0.6683

0.5953

0.6533

Gradient Boosting

0.7345

0.7792

0.5443

0.6017

Knn

0.7195

0.7664

0.5256

0.5759

Gaussian Naive Bayes

0.6245

0.6367

0.6371

0.6246

Logistic Regression

0.5807

0.6549

0.5894

0.6447

 

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published