Friday, September 12, 2014

Total Shots Ratio as a predictor of match outcome

In this post, I take a closer look at the granddaddy of advanced soccer stats, James Grayson's Total Shots Ratio (TSR). In particular, I will assess the potential utility of TSR as a predictor of match outcome in Scottish football. 

To do this, I downloaded some match results data for the Scottish top-flight (2000-2014) from my usual source, XMLSOCCER.COM and wrangled it into R using the XML package.

To begin, the plot below shows that there is a positive relationship between TSR and goal difference at the match level, as expected. This relationship is statistically significant with TSR explaining ~20% of the variation in goal difference at the match level.



Next, I was curious to see how well TSR predicted the outcome of a match  (i.e, which team wins). So I built a simple logistic regression model in R using the glm function.

First, I transformed the continuous variable home goal difference into a binary response variable for use in the logistic regression (1 = home win, 0 = home loss or draw). Then I split the sample into separate training (n=1911) and validation (n=1273) data sets.

The logistic regression was built using the training data, and then applied to the validation data to estimate prediction accuracy.

Here is the resulting confusion matrix for the validation data.

Observed loss/draw
Observed win
Predicted loss/draw
543
283
Predicted win
165
282

As you can see, the logistic regression model does not do a very good job of predicting match winners in the validation sample; it has poor sensitivity (282/565=0.50). However, the model does a much better job of correctly identifying losses/draws, which means it has high specificity (543/708=0.77).

Thus, the overall accuracy of the model is 65%, which is significantly better than chance. 

So as a predictor (or at least retrodictor) of the outcome of a Scottish top-flight football match, a simple logistic regression with TSR as the sole independent variable performs much better than flipping a coin. In fact, the model is right about 2/3 of the time.

Not bad.

An additional insight from this analysis is that having a relatively high TSR gives a team a 50/50 chance of winning, but having a relatively low TSR gives a team a >75% chance of NOT winning.

This highlights the strong role of randomness in football.

Monday, September 1, 2014

The best of the rest in Scotland 2000-2014, Part II: Home teams

Last week I looked at the away performances of Scottish top-flight clubs since 2000, which revealed some surprising results.

This week I will do the same analysis, but for home teams. As before, I downloaded the data from XMLSOCCER.COM's free demo API and wrangled it using the XML package in R. All subsequent analyses were also done in R.

I focused on two measures of performance, average home goal difference (GD) and average Total Shots Ratio (TSR) per match. The former measures both points earned and win quality. While the latter is a measure of the degree to which one team controls the ball (in a subsequent post, I will show that these two variables are correlated at the match level).

As with my previous analysis, I hypothesized that one of the bigger non-Old-Firm clubs would be the "best of the rest," for example Hearts, Hibs, Aberdeen, Dundee United, or Motherwell.

As you can see in the bar plot below, my hypothesis was confirmed as the Edinburgh club Hearts has the 3rd best average home GD, followed by Aberdeen and Hibs.



A similar pattern can be seen in the bar plot of average TSR below; Hearts is the best of the rest again.


Unlike my analysis of away team performances, the current analysis did not reveal any major surprises. Heart, Hibs and Aberdeen are all relatively big clubs by Scottish standards and one would expect a bigger club to have a better record over time than a smaller club with fewer resources.

However, in my previous analysis of away teams, Inverness Caledonian Thistle (ICT) and Falkirk were the best of the rest with regard to average GD and average TSR respectively, despite both clubs being relatively small by Scottish standards.

It is important to note that Hearts was near the top of the heap in the away team analyses too. Their best-of-the-rest rankings both home and away are as follows.


Best of the Rest Rankings for Hearts (2000-2014)
Average Home Goal Difference
1st
Average Away Goal Difference
2nd
Average Total Shots Ratio (Home)
1st
Average Total Shots Ratio (Away)
2nd


Thus, I think it's safe to say that between 2000 and 2014, Hearts was, on the whole, the best team in Scotland, outside of the Old Firm.

Too bad they got relegated...

Wednesday, August 27, 2014

The best of the rest in Scotland 2000-2014, Part I: Away teams

Typically when we think of "big" clubs in Scotland other than the Old Firm, the usual suspects are Hearts, Hibs, Aberdeen, Dundee United, and Motherwell. So it would be reasonable to hypothesize that one of these clubs would have the 3rd best away record in the top-flight over time.

To investigate this hypothesis, I grabbed some free SPL match results data from the demo version of the XMLSOCCER.COM API and wrangled it into R using the XML package. The dataset contains > 3100 matches going back to the 2000-2001 season, and dozens of match-related variables.

By the way, how lucky am I to be a Scottish football fan and data enthusiast? Just ask XMLSOCCER.COM's website.
"If you for some odd reason is [sic] only interested in the Scottish League anyway, lucky you - you just found a great API for FREE!"
Jackpot.

I used the variables AwayGoals and HomeGoals to calculate the average goal difference for each away team and created a simple bar plot to visualize the results.

Goal difference is a better measure of team peformance than points earned because it takes into account the quality of the win or loss. For example, a 1-0 loss suggests a close match, while a 6-1 loss suggestes a pummeling. This is an important qualitative difference not captured by simply using points earned as a metric, since in both cases the losing side would get zero points.

As expected, Rangers and Celtic are the only Scottish clubs to have positive average goal differences away from home since 2000, because they tended to win most of their away matches, typically by a relatively large margin.

But the club in 3rd place may surprise you.



It's Inverness Caledonian Thistle (ICT), a "small" club from the Scottish highlands. The Edinburgh club Hearts is a close 4th.

This is a remarkable observation, not just because ICT has fewer resources than bigger clubs such as Hearts (historically speaking), but because the median driving distance for ICT to away locations is 156 miles!

The histogram below shows the frequency of away locations for ICT within different distance "bins" (based on Google Maps). As you can see, most of the locations the highlanders would have to travel to are > 100 miles away. Only Ross County travels further, but the Dingwall club's away record is worse than their highland rivals.



This result calls into question the commonly held notion in football that traveling far for a match puts the away team at a disadvantage. By this logic ICT should have one of the worst away records in the SPL. On the contrary, the highlanders are in fact the best of the rest when it comes to away goal difference.

What if we look at Total Shots Ratio (TSR) instead? TSR is probably the best measure of what a team does on the pitch. According to TSR pioneer James Grayson, "the higher a team's TSR the more they control the ball." This simple index is calculated as follows.

TSR = Total shots for/(Total shots for + Total shots against)

Take a look at the bar plot below. Again, third place is a surprise.

Hello Falkirk!

The Bairns had a decent run in the top flight between 2004-2010. And despite having a relatively high average TSR away from home during this time period, Falkirk was relegated to what is now the Scottish Championship.

Of course, this calls into question the predictive value of TSR...but that's a post for another time.