October 19, 2006


One thing that’s always fun to do is predict hockey games. HockeyAnalysis.com has developed a neat little prediction pool. Predicting from the first few games is hard you get a lot of error, plus teams can get unlucky with their starting opposition, making the data quite erroneous. That being said there are a number of ways to looking at the present data to figure out who’s the best team. One could use a rank system like Sagarin Ratings. I want to keep this simple and so I’ll look only at the standard Pythagorean prediction: gf2/ (gf2+ga2).

I’m not 100% satisfied with goals for, because it’s not all controlled by the one team, just because you don’t get goals doesn’t me you shouldn’t have gotten goals. Buffalo got 9 goals for against Philadelphia, shouldn’t be worth the same as 1 goal for against Anaheim. There are a number of options, but what I like to look at is expected goals for as it factors out the opponents goalie (but not their defense). Expected goals are not always the best indicator as offense, due to the fact there are a few (marginal amount) of players who can out score their expectations on a consistent basis, however it is much closer to offense than just goals for so I’ll use that as my goals for. I should qualify my use of these expected goals as there are only statistical estimates of goals and not actual goals. In fact if you compare these numbers to actual goals you’ll see:
that they match the expected error of ±15 (95% Confidence interval: [-30, 30]). This doesn’t indicate that expected goals for have no error themselves, but should indicate the error is smaller than the error with goals for and that’s why it’s a useful measure. The expected error for expected goals for per game is: √(3600*SFPG/3600*(1-SFPG/3600))*0.092 = 0.5 goals per game.

Defense is much more complicated to measure accurately, as mentioned above if a 9-1 game occur early in the season the teams’ statistics defensively are heavily affected by those 9 goals against and as such will likely appear worse than they really are. One cannot look at expected goals against, because they ignore the quality of the goaltending they have. So I’m stuck using standard goals against. As above you can estimate the error as √(3600*GAPG/3600*(1-GAPG/3600)) = 1.58 goals per game (remember that there are around 2.5 goals per game or that’s 63% of goals scored). Take note that expected goals for are three times more accurate than goals.

You can use addition and subtraction rules for values with standard deviation as well as for multiplication and division to get the error for an average team (50%):

GF2: √(2*(0.5/2.5)2)*2.52 = 1.76
GA2: √(2*(1.58/2.5)2)*2.52 = 5.58
GF2 + GA2: √(1.762 + 5.582) = 6.12 (or 12 ± 6)
GF2/(GF2 + GA2): √( [6.12/(2*2.52)] 2 + [1.76/(2.52)] 2)*0.5 = 0.36.

Now in order to implement this error estimate for higher winning percentages (I want linear) I need a method that doesn’t predict a team will win more than 100% of the time or lose more than 0% of the time that has the above error. This can be seen in the little graph.

Now I’m not going to show this, but binomial error decreases at √(n)/n (or 1/√(n)), where n is the number of games played. The same approximations can be made above and you get a decreasing winning percentage error as teams play more games. So I now have an error estimate as well as the winning percentages. Using the Poisson toolbox and I can get a prediction for every game based on goals for and against and then I can apply the above error rules to appropriately distribute the errors.

Note all these calculations are assuming that the teams do not change, which is in fact not true, because of injury and trades, but it’s the best estimate.

Now that you know how I’m determining who wins and loses each game I can now explain how I came up with my predictions. In order to get a average I need to cycle through the calculations below 10,000 times (take 45 minutes), I could do more, but I calculated the error to be around 0.5%, or probably around 1 point so 10,000 is a good balance between time and accuracy. I want to know 3 things: expected points, probability of making playoffs and probability of winning division.

First thing I calculate a random value for how much better or worse a team could be using the normal distribution and using the number of games played as a scale factor to lower the error to appropriate values. In order to calculate the error for an individual game I scale the linear errors from the teams to non-linear win prediction error (min: 0, max 1), following the concept of the little graph above, so games with a high probability of being wont are more likely be shifted down and similarly games with a low probability of being won are shifted up a little. However, most games (between 30% and 70%) are not affected at all by these things estimates as they are very close to linear. It should be clear that I look at every game individually so teams schedule difficulty are included in this estimate.

So no one care how I get there, but I now I have predictions for making the playoffs and winning the division many people remember from the end of last season. Here are the results for the west and for the east. You can see there are a number of teams with over and under estimate still, and that some teams have performed so bad that this system never predicts them in the playoffs.

No comments: