Friday, November 9, 2012

Math works

For those into statistics enough to truly enjoy the movie Moneyball, the performance of Nate Silver during this latest presidential campaign was especially fun to watch. Silver got his game together analyzing baseball so the idea that his polling analysis was merely Moneyball applied to politics is not the least far-fetched.  Here a guy named Bob O'Hara describes just how Silver got it so right.  The whole article is pretty involved but then, statistical analysis is not for the faint of heart.

How did Nate Silver predict the US election?

A blogger called Nate Silver accurately called the outcome of the election. Biostatistician Bob O'Hara thinks he knows how

One of the surprises of the American presidential election was the attacks from the Republican side. Not that they were attacking Obama (hey, unless the airwaves were full of attack ads from both sides, how would we know there was an election on?), but rather that they were attacking a statistician, Nate Silver. But now Mr Silver is having the last laugh, having predicted every state correctly even as most media were saying that the race was tied (or that it may possibly be drifting ever so slightly in Obama's favour). But how did Mr Silver predict the presidential race so accurately? What was this dark magic that he used?

For the Nate-haters, here’s the 538 prediction and actual results side by side

Now, I don't have any inside knowledge about Nate Silver's method, but an outline of the approach is fairly easy to guess at, since this is similar to the methods used by votamatic. It is also the same approach that has become widely used in statistics over the last 20 years: I have used similar ideas to look at scientific problems like divergent natural selection and cycling voles. So, although some of my outline is probably wrong (and I've simplified some of the process in my explanation for clarity's sake), I hope my discussion gives you a feel for the types of statistical models used and how they work.

The problem – choosing the US president – is a national one, but it involves voting at the state level (residents in each state vote for the candidate they support, and the winning candidate gets all of the state's electoral votes). The polls are also arranged at both state and national level, so one way or another both need to be taken into account. This makes the problem inherently hierarchical, and rather conveniently there is an area of statistics called hierarchical modelling.

It is also worth splitting the model into two parts: the process (i.e. the percentage of the population who intend to vote for Obama), and the sampling (how the polls are affected by the actual voting intention, and other factors). The mathematics (which I will not discuss in detail) allows us do this, and the separation of the model nicely reflects the separation of the processes that create the data we see.

Basically, we are trying to model an unobserved variable, i.e. the actual intended voting behaviour in each state. This unobserved variable is then used to predict the actual vote, which we do observe.

It's also worth noting that although we are ultimately interested in how people will vote on election day, the data we get is based on how peoplethink they will vote at the time they are asked, which may be months before the election. What people think changes over time, so this variable has to be incorporated into this model. This means we have to include a temporal component: in short, we must generate a time series.

To make things simple for this discussion, I am ignoring third party candidates, so in this model, only Obama and Romney are in the race, and whoever gets more than 50% of the vote in a state wins. more

No comments:

Post a Comment