These days it seems everyone has their own xG model, and I thought, why not me? So I contacted Dave Willoughby, Head of Football at Stratagem Technologies, makers of StrataBet sports betting software, and he kindly set me up with some detailed shot tracking data for the Scottish Premiership (xy coordinates and such).
Over the course of the 2017-18 SPFL season, I will be developing and fine tuning my own xG model using StrataBet data. I will use my blog to describe in detail (or in greater detail than allowable on Twitter) various aspects of the model in a series of relatively short posts (hence, Part I).
In this first post, I will introduce the machine learning technique, Random Forests, which is the approach I will be using to generate xG values for individual shots. Subsequent posts will focus on data/model inputs, preliminary validation, and other topics.
Before I begin, let me just say that there already exists a very fine xG model for the Scottish Premiership, developed by Matt Rhein and Christian Wulff, also using StrataBet data. Matt and Christian's model uses fixed values based on historical conversion rates under specific conditions, such as location on the pitch.
My approach is different in that I will be using machine learning via Random Forests to estimate xG values for each shot instead of fixed values.
So what exactly is Random Forests? According to Chen et al.:
"Random forest (Breiman, 2001) is an ensemble of unpruned classification or regression trees, induced from bootstrap samples of the training data, using random feature selection in the tree induction process. Prediction is made by aggregating (majority vote for classification or averaging for regression) the predictions of the ensemble."
Okay, what the heck does that mean? Let's try to unpack it.
For illustrative purposes, here's a relatively simple example of a classification tree to categorize shot attempts as either “goal” or “no goal” based on StrataBet xy coordinates.
IF (location_y >= 82) THEN Class = no goal
IF (location_y < 82) AND (location_x >= -14) THEN Class = no goal
IF (location_y < 82) AND (location_x < -14) AND (location_x >= 16) THEN Class = no goal
IF (location_y < 82) AND (location_x < -14) AND (location_x < 16) THEN Class = goal
In the StrataBet coordinate system, the left and right goalposts are demarcated by x = 15 and x = -15 respectively, while the edge of the 18-yard box is indicated by y = 81.
So our classification tree is showing that shot attempts from central locations inside the penalty area are more likely to be scored than long distance and/or wide shots. This is rather obvious to anyone who's watched football, but what is not so obvious is that a computer algorithm can identify this pattern automatically by learning from the data.
Don’t worry about exactly how the algorithm works, just note that Random Forests aggregates the output of hundreds of classification trees (larger and more complex than the example given), and each tree is built using a random sample of predictors and cases from the original data set. This randomization process makes Random Forests much more reliable than just using a single classification tree.
For our purposes, the xG value of a shot is the percentage of trees that “vote” for the outcome goal. So if 17.6% of all trees in an ensemble classify a shot as a goal, then xG = 0.176 for that attempt. Given that the goal probability of an average shot is around 10%, this example would be regarded as a relatively high-quality attempt (although, by no means a sure thing).
Next time, I will describe the data and inputs for my Random Forests xG model.
Chen, C., Liaw, A., & Breiman, L. (2004). Using random forest to learn imbalanced data. University of California, Berkeley, 110.