The FIFA World Cup 2018 Russia is here! In this post, I will share how I use Python machine learning to develop betting strategies for the 1st Group Stage games. I will provide:
- my predictions of the 1st Group Stage games
- the data sources and preparation steps are taken to perform the analysis and predictions
- the team stats derived and selected as model inputs to predict match outcomes
- overview of the different machine learning models adopted
- how we assess the accuracy of our predictive model
- how we use the model outputs to inform our betting decision
Predictions for 1st World Cup Group Stage games and recommended value bets
If you are not interested in the data science behind the recommended betting decisions and want to get straight to our suggested betting matches, here are the predictions and recommended bets:
If you bet based on the predicted outcomes, the green shaded matches are our recommended value bets. The value bets are identified based on the betting odds and the probability of the predicted outcome being correct.
Preparing data sources for World Cup match analysis and predictions
We use 3 data sources for our predictive model:
- match results from 1995 for all international games and tournaments (including friendlies) – obtained from a Kaggle user (thank you Mart!) and stored as a CSV file
- list of international football teams and their respective confederations (extract this information from the Wikipedia)
- Elo Ratings of international football teams at the beginning of 1995
In most data analysis projects, there will be some data cleansing to do. It is highly likely that you will need to massage your data when you have multiple data sources.
In our project, we have 3 different sources and team names are captured differently. For example, we have the South Korean captured as South Korea in one data source and captured as the Korea Republic in another data source.
We use Python and Pandas (a python library) to process this data and create a standardised team name and code for the entire project.
This allows us to link and combine data across the various data sources and create a data set that has match details, team confederation details and their ELO ratings at the beginning of 1995.
What is ELO?
ELO ratings are a system for rating teams. It was first used in rating players in international chess competitions. After each match, each team gains or loses ELOs. The amount gained or lost depends on the margin of victory (goal difference), the strength of the opponent (based on ELO), type of tournament and home advantage.
The advantage of ELO is that it takes into account all the above factors.
Team Stats used to predict matches
This is the most important part of the process. In this process, we identify the important stats for each team that will help us in predicting the outcome. This process requires domain knowledge.
Outside of machine learning, we do this all the time. We look at previous performance stats of a particular team to form our view of whether a team will win, lose or draw.
Machine Learning helps formalise this process and has the following advantages:
- standardised historical stats for all teams
- allows calculating the appropriate stats to quantify consistency of team performance
- take into consideration multiple stats and identify their relationship with match outcomes
- data abundance and computing power enable calculation of various stats of each team and allow us to identify if they are significantly related to match outcomes (our human brain can only process so much data)
- use of standardised metrics takes the emotion out of the decision making
In our project, we use the following historical team stats:
- Wins in previous X non-friendly matches ( we decided on 7 matches as international teams do not play many competitive matches, 7 competitive matches can span across ~ 2 years)
- Losses in previous X non-friendly matches (we use 7 matches)
- ELO ratings of each team at each point in time (e.g. if Germany faces Brazil on 1st Jul 1998, we calculate the number of ELOs Germany and Brazil have as at that date)
- Wins against in previous 7 matches against top 20 ELO ranked teams (we determine team rankings by ELO at a particular date, and tabulate the wins against the top 20)
- Goals Scored, Goals Conceded and Elos Gained or lost in the previous 7 matches (we also tabulate the median and variance to quantify consistency of teams)
Applying Python Supervised Learning Models to make predictions and selecting the most effective
Time to make our predictions using Supervised Learning models provided by Python module SciKit Learn
We apply supervised learning models provide by the Python library SciKit learn
The models are supervised as each model will create predictions based on our historical dataset of matches with teams and team stats and results. (The model is being trained and learns from known outcomes)
We use classification models as we are looking at classifying matches into Win / Draw / Loss
We run various classification models based on different algorithms and pick the one with the best accuracy.
In our project, we apply the following models:
- Random Forests
- AdaBoost Classifier
- Nearest Neighbour
- Gaussian Naive Bayes
- Logistic Regression
Our results indicate AdaBoost has the best accuracy.
How accurate are our machine learning predictions?
When assessing the accuracy of classification models, we create a matrix called the CONFUSION MATRIX.
In this matrix, we are overlaying the actual outcomes vs predicted outcomes and calculating % of predicted outcomes that are accurate.
In our confusion matrix example:
- 59% of total predictions occurred
- 64% of win predictions were accurate
- 54% of loss predictions were accurate
- 34% of draw predictions were accurate
Using machine learning predictions to identify value bets
We have our predictions and have a view on how accurate they are.
How can we use these insights to inform us of our betting?
To do this, we need to derive the expected gain from making a bet on a particular match based on the predicted outcome.
Expected Gain = Probability of Predicted Event Occurring * (Profit when it occurs) + Probability of Predicted Event Not Occurring * Loss (the amount you bet)
Looking at the table above, we identify value bets as those having an Expected Gain > $0.10 (based on a $1 bet)
- We used the overall probability of win/draw/loss to identify value bets. This is a generalisation and each match may have a different probability based on its characteristics
- For predictions that align with a bookmaker’s odds favourite, there is no value in betting as the expected gain is below zero. It appears that bookmakers identify and price a premium for favourites
We will continue to post and track our model’s performance throughout this World Cup.
Applying Classification Supervised Learning Models in Business
Classification models are useful and can be applied to numerous business scenarios.
For example, a business may want to be proactive and identify and provide offers to weekly customers who may Upgrade/Downgrade/Churn from their services. This is similar to our match prediction of Win/Draw/Loss outcomes.
Similarly, developing the model will require domain knowledge of:
- customer interactions and relationship with the events (for example, the volume of calls to service staff may be related to customers churning)
- identification of data points that enable one to describe the interactions effectively (how do we capture the volume of calls, what’s the appropriate metrics to describe the volume of calls)
Once the model is developed and the accuracy is quantified, we can overlay:
- the potential gains (ie. gain in customer lifetime revenue) vs
- potential losses (cost of the offer) vs
- probability predicted event occurring
to evaluate the potential expected gain from our investment.
In future posts, I will share more details of the Python code involved in developing and applying Python machine learning modules.