In this post, I take a closer look at the granddaddy of advanced soccer stats, James Grayson's Total Shots Ratio (TSR). In particular, I will assess the potential utility of TSR as a predictor of match outcome in Scottish football.
To do this, I downloaded some match results data for the Scottish top-flight (2000-2014) from my usual source, XMLSOCCER.COM and wrangled it into R using the XML package.
To begin, the plot below shows that there is a positive relationship between TSR and goal difference at the match level, as expected. This relationship is statistically significant with TSR explaining ~20% of the variation in goal difference at the match level.
Next, I was curious to see how well TSR predicted the outcome of a match (i.e, which team wins). So I built a simple logistic regression model in R using the glm function.
First, I transformed the continuous variable home goal difference into a binary response variable for use in the logistic regression (1 = home win, 0 = home loss or draw). Then I split the sample into separate training (n=1911) and validation (n=1273) data sets.
The logistic regression was built using the training data, and then applied to the validation data to estimate prediction accuracy.
Here is the resulting confusion matrix for the validation data.
As you can see, the logistic regression model does not do a very good job of predicting match winners in the validation sample; it has poor sensitivity (282/565=0.50). However, the model does a much better job of correctly identifying losses/draws, which means it has high specificity (543/708=0.77).
Observed loss/draw
|
Observed win
|
|
Predicted loss/draw
|
543
|
283
|
Predicted win
|
165
|
282
|
As you can see, the logistic regression model does not do a very good job of predicting match winners in the validation sample; it has poor sensitivity (282/565=0.50). However, the model does a much better job of correctly identifying losses/draws, which means it has high specificity (543/708=0.77).
Thus, the overall accuracy of the model is 65%, which is significantly better than chance.
So as a predictor (or at least retrodictor) of the outcome of a Scottish top-flight football match, a simple logistic regression with TSR as the sole independent variable performs much better than flipping a coin. In fact, the model is right about 2/3 of the time.
Not bad.
An additional insight from this analysis is that having a relatively high TSR gives a team a 50/50 chance of winning, but having a relatively low TSR gives a team a >75% chance of NOT winning.
This highlights the strong role of randomness in football.
This highlights the strong role of randomness in football.