April 14, 2009

A few different predictions

I've collected a lot of data over the years. I now have results from 244 playoff games (3 seasons). Using past results I looked for factors that resulted in wins (of individual games not the series). Using the data I created two different models (EQ1 & EQ2). I wont be changing the predictions that are included with my series summary images (as the sample size for that model was much larger), this is just for those who are interested.


EQ1 uses both "expected goals" and "actual goals" to predict playoff results
EQ2 uses only a team's "expected goals" to predict results.
The equations can be found on the bottom of this page.

Two teams that might be happy with the table above: Montreal & Calgary, by the second equation both teams have a lot better chance than many will give them credit for. (Boston has done really well this season largely due to amazing goaltending and outperforming their expected scoring rates). Calgary's expected goal differential is much better than Chicago's.
Notation: (all in #/game)
E = Expected
G = Goals
H = Home
A = Away
F = for (last letter)
A = against (last letter)

In order to predict the expected scoring rates I combine the home teams scoring rates with the away team goals against rates. (As per Alan Ryder's Poisson Toolbox p. 28):
Predicted "goals for" per game in a game where the home team scores at GFH and the away teams allows GAA (2.8 = average goals/game)
GF = GFH*GAA/2.8
and GA:
GA = GAH*GFA/2.8

Similarly for Expected goals:
and GA:

Using the above numbers and combining them with a Pythagorean expectation:
pct = EG_pct = EGF*EGF/(EGF*EGF+EGA*EGA)
G_pct = GF*GF/(GF*GF+GA*GA)
which can be used in equation 1 & 2 below:

Equation 2 effectively ignores the effects of goaltending, due to sample size (goaltending effects are not huge in comparison to luck effects...). The G_pct factor in EQ1 was statistically not significant (at about 80%).

If all the math is giving you a headache then you can enjoy the compiled figures in the table above and ignore the explanation.


JLikens said...

Excellent stuff.

One question:

Which of the two factors has more predictive value in terms of future results? In other words, which factor better fits the data?

JavaGeek said...

EQ1 (two variables) has more variables and therefor produced better results compared to EQ2 (one variable).

As noted in the post, the G_pct factor was not statistically significant.

I wish I understood logistic regressions better, but
EQ1 was 62.2% Concordant & 36.2% Discordant (remainder are ties)
Somers' D = 0.26
EQ2 was 60.8% Concordant & 37.8% Discordant (remainder are ties)
Somers' D = 0.23

Somers' D - Somer's D is used to determine the strength and direction of relation between pairs of variables. Its values range from -1.0 (all pairs disagree) to 1.0 (all pairs agree). It is defined as (nc-nd)/t where nc is the number of pairs that are concordant, nd the number of pairs that are discordant, and t is the number of total number of pairs with different responses. In our example, it equals the difference between the percent concordant and the percent discordant divided by 100: (85.6-14.2)/100 = 0.714.