Nate Silver is getting hammered by the right for daring to use math-based math rather than faith-based math. In his defense, I thought it might not hurt to take a quick review, for non-math types, of the mathy side of things.
This will be old hat for some of you. No quiz, I promise. More after the orange ampersands' act of public indecency.
Update: Uh-oh, Community Spotlight! Now I have to fix my typos!
Suppose you've got a cubic yard of M&Ms of various colors, and you want to know how many are blue. The only way to know for sure is to count them all, kinda like that study of the four thousand potholes in Blackburn, Lancashire that John Lennon mentions.
But you can make a statistical assumption - that if you look at a smaller sample, and count the number of blue M&Ms in that, you can derive an estimate of the total, as long as you're willing to allow for some wiggle room. In a sense, the statistics of polling is all about defining what "wiggle room" means.
That wiggle room comes in two flavors: margin of error, and level of confidence. Here's exactly what that means.
AP recently issued the results of a poll showing that 18% of Americans think Obama is a Muslim. Up at the top of the first page, in tiny print, we're told: the sample size is 1037, the margin of error is ±3.04, and the confidence is 95%.
What that means, in English, is this: "We can't ask everybody, but we asked enough people (1037) that, according to the standard mathematics of sampling, we can conclude that there is a 95% chance that the actual number we'd get if we did ask everybody would be somewhere between 18-3.04 and 18+3.04."
These three numbers -- sample size, margin of error, and level of confidence -- are all related, but in a way that requires calculus rather than algebra to define, so I won't drag you through it. The basic trends, though, are what you'd expect: increase the sample size, and you lower the margin of error; to increase the level of confidence, you need to increase the sample size; given the margin of error you want, you can calculate the sample size you need; geekery geekery geekery abounds.
These three numbers are the Patty, Maxine, and Laverne of polling; each only makes sense in the context provided by the other two. But there is also a convention that says, if a confidence level is not mentioned, it's the default value of 95%.
What 95% also means is that even the bestest of the bestest poll is going to get it wrong one time out of twenty. Even the best poll throws a piston every now and then. It's the price we pay for not having to count all the M&Ms. And it's also the reason statisticians tell you never to take any single poll as The Truth, because you might be looking at one of those thrown pistons.
We've got a built-in assumption here that the sample we're counting really reflects the whole. And this is where polling methodology matters. It's easy to, say, shake up a cubic yard of M&Ms (if you have ready access to a fork lift), and then grab a sample. With people, it's harder. How do you make sure your sample really does match your general population?
This is where variables -- human poster versus robot, land lines versus cell phones, time of day, etc. etc. -- come into play. Generally, the sample population isn't going to match the general population exactly, and polling firms have to apply corrections, based on their own assumptions of what the general population is actually like. Different polling firms make different assumptions.
Those assumptions, along with the variables I mentioned above, give each polling firm a "house effect" -- that is, a bias. If you know that, for example, Rasmussen is always going to lean two percent more Republican than other polls, you can do one of two things. You can decide to chuck Rasmussen results altogether, or -- Nate's approach -- you can estimate the house bias and subtract it for any given poll result. The same holds true for left-leaning polls.
So what Nate's doing when he figures out, in any given day, how things are in, say, Ohio, he's not averaging the direct poll results, as the Republican talking points suggest he is, but he's calculating the results based on the corrected poll results, after the house effect has been estimated and removed from each poll.
Why Nate Silver is The Bomb
So then there's a pile of states, each of them having a calculated percent of Obama support and a margin of error. Once you've got that, and throw in some basic stuff about the Gaussian bell curve, you can calculate the odds that Obama wins a given state.
How does that stack of probabilities turn into a single probability figure for the whole country? The answer there is to use a technique that comes from -- not making this up -- the bomb makers at Los Alamos. Faced by a calculation that was too tricky to work out via pure math on the chalkboard, they tried another approach: doing repeated simulations, with a certain amount of randomness thrown in, and then averaging the results. The more simulations you try, the smaller the margin of error in your final result. Given that each calculation includes, by design, a slice of random chance, In 1946 the Los Alamos mathematician Stanislaw Ulam named it the Monte Carlo method, after the casinos there.
So Silver runs an experiment. He randomly generates election results for each state consistent with the polling data, counts the resulting EVs for each state, and checks to see who won. Then he does it again, with different results that are both random and consistent with the polling data. And again and again, until he's got ten thousand trial elections, all different. The percentage of those trials that Obama wins? That's Silver's top-level number. It won't be right on the nose, but its margin of error can be calculated. (There is a related level of confidence here too - the more trials, the less blurry the result.)
He can also check those ten thousand trials for questions like: did Romney win the popular vote but lose the electoral college? Would this result have been different if Ohio went the other way (i.e. was Ohio the tipping point)? Will Obama lose any states he won in 2008? And by checking how many of those ten thousand trials fit the criterion, he can estimate what their probabilities are.
There's another level
The way I've described it presumes that every state is independent -- that is, that the way Wisconsin goes has nothing to do with the way Ohio goes, or vice versa. But there is a correlation. In the real world, the odds of, say, Florida going blue are going to be related to the odds that Ohio goes blue; a red Ohio means a blue Florida is less likely than if there were a blue Ohio.
So Nate also includes for each state factors like this, although I don't think he's described it in detail. What matters is that it isn't being done subjectively but numerically; it's not something by definition works for or against Democrats.
Nate also has some guides on how the undecided voters are going to finally pull the proverbial lever, based on a variety of political and economic indicators.
The Republican Attack
Since the results are showing a pretty solid probability of an Obama win -- by definition, the methodology can at best generate probabilities -- various faith-based math types are trying to attack Silver's methodology as being inherently biased somehow. This seems to be the year that the GOP taught its low-information voters the word "oversampling," for example. They're also claiming some ridiculous numbers for how independent voters are breaking. And they are in turn attacking Nate for the assumptions he's making on how the undecided are going to land.
But, again, the assumptions Nate makes aren't partisan, but based on analysis of previous elections, and he's gone to some trouble over the last few months to spell these assumptions out.
But think like a Republican for a moment. You've spent four years complaining about the Marxist Kenyan Welfare-State Food-Stamp Socialist in the White House. Along comes Mister Math Guy saying: it's gonna be four more years! Now, which is easier to do at that moment - accept hard reality about the President You Hate So Much, or just attack the math guy, 'cuz you were never really all that fond of math guys in the first place?