Cross-posted from Overdetermined.net
Last week, I discussed some of the sampling problems that can crop up in pollsters' work. My first article showed how the problem of bias is intrinsic to stratified sampling. My second article expanded the discussion to demographic weighting and cluster sampling.
Today, I'm going to resume the discussion of polling problems by looking at the media's new tactic of using polls-of-polls to resolve the sometimes conflicting information from polls that get reported on an hourly basis.
Sometimes having too much data is just as problematic as not having enough. To me, the classic example of this problem is Sept. 11th, 2001. American security agencies had all of the information necessary to identify the terrorists involved and stop the attacks. Why didn't they? Among other reasons, because authorities failed to connect the dots. (I find it perversely amusing that the government's response to 9/11, then, was to try to increase the number of 'dots' collected, not learn to connect them better)
Polling can work much the same way. If we have too much information, we don't know how to put it all together. Say Pew releases a poll showing Obama 53 - McCain 36. IBD/TIPP pegs the race at Obama 47 - McCain 44. Newsweek puts the split at Obama 53 - McCain 41. So what's the "real" state of the race? Which poll is "correct"? They can't all be correct, can they?
Well, actually, they can.
Polls don't measure voter sentiment, they approximate voter sentiment. The only thing that measures voter sentiment is the election itself. If we could measure voter sentiment without running an election, why would we go to all the trouble and expense to have the election at all?
Polls don't measure; they approximate. What do I mean when I say this? Let me walk you through it.
We have a group of people: voters. We're going to measure this group's opinions on Election Day, but we'd like to get some idea what they think beforehand. The problem is, we don't know exactly who is and isn't part of this group of people - who will turn out to vote on or before Election Day. We know who's registered to vote. We know what the US Census tells us about the makeup of the American population. We know who turned out to vote in 2004. But we don't know who will vote in 2008 - and we can't know until they've already voted.
So we have to make guesses about who makes up this group of people - voters. Here's where we have registered voter (RV) and likely voter (LV) distinctions. How does a poll choose to think about the makeup of the voting group? Likely voter projections are inherently dodgy, predicated on individual pollster assumptions about who is more or less likely to show up at the polls (and perhaps wait in line). But even registered voter projections can beconfounded by conflicting registration numbers, when (as in 2008) one party registers more new members than another party. Defining the party split in registrations can become as problematic as estimating demographic splits among likely voters, and each pollster may have his/her own estimates of those splits.
So each poll, in its own way, approximates voter sentiment. Each poll works from a series of assumptions about what the pool of voters looks like in a given year, and projects results based on the nature of the voting pool make-up. When we see a Pew poll, an IBD/TIPP poll, and a Newsweek poll all showing very different numbers, they can all be correct. Why? Because these polls aren't all measuring the same thing. The projected voter pool in Pew, IBD/TIPP, and Newsweek are all different.
Some of you may be wondering why I haven't mentioned the margin of error. The reason is simple, but I should probably take a moment to say this. Margin of error (MoE) is not germane to this conversation. The error described in MoE is random variability in measurement - it reflects the degree to which we're sure about a poll's measurements. MoE has nothing to do with the differences in pollster's assumptions about who will vote. The differences between poll estimates have more to do with assumptions than random error. How do I know? Because error is random, and if the differences between polls were due to random error, then any poll would have an equal chance of reporting a number above or below the average of all polls. This is not the case:
As can be seen above, although there is variation in individual polls, there are clear differences in what the polls report. ABC News / Washington Post and CNN, for example, consistently report higher support for Obama than GWU / Battleground and FOX News. NBC News / WSJ stays pretty much in the middle of all the other polling estimates. As I said above, these differences cannot be accounted for by MoE alone. The results above reflect systematic differences between pollster estimates.
Polls-of-polls are the new favorite tactic of combining the myriad polling data into a coherent whole that the media can easily digest. A poll-of-polls is basically an average of polling reports. Some meta-pollsters (like Nate Silver at 538) jazz up their methodology by giving a weighted average, or tossing in their own data as a correction againstthe polls. Others are content to just average the polls together without resorting to more complex tools.
There is, however, one fundamental problem with polls-of-polls. They all assume that polling reports are directly comparable - that all polls are measuring the same construct. And while all polls purport to give an accurate approximation of voter sentiment, as we can see above, pollsters don't agree about how best to approximate it.
Under these circumstances, taking a poll-of-polls is tantamount to saying that you think the "correct" approximation of voter sentiment is the average of all approximations in use by pollsters. A meta-pollster under these conditions cannot have an opinion about the best way to approximate voter sentiment. He/she is constrained by the approximation methods used by all other pollsters.
Even Nate Silver is constrained by this problem. He weights his average with reliability scores for pollsters based on their previous predictive success, but in the end this just means he trusts pollsters who have made passable past approximations to continue making decent approximations. I'll be reviewing Nate's methodology at length later this week - it's probably the best methodology out there, but Nate is basically trying to force one analytic framework on a problem that calls for an entirely different approach.
For today, however, the upshot is this - the simple poll-of-polls methodology is designed to minimize the measurement error in any single poll. It is not designed to combine polls which use different methods for approximating voter sentiment. Polls-of-polls tend to ignore methodological differences and blithely hope that everything comes out in the wash.
There are better methods (random-effects meta-analysis; Bayesian estimation procedures), and I'll be back soon to discuss them.