The following article is cross-posted from Overdetermined.net
This week, I'm going to be spending some time discussing some of the problems with current polling methodology. The numbers we're seeing in today's polls may not accurately represent the real shape of the electoral landscape. The stratified sampling problem has been almost completely ignored, in favor of more easily digestible problems like the Bradley Effect and cell-phone user undersampling. Unfortunately, the stratified sampling problem may prove to be more influential than either of those concerns.
I'm going to do my best to keep this post understandable for anyone with an interest in polling, but be forewarned: I'm dealing with a deep-seated technical problem that suffuses the entire polling industry. If the stratified sampling problem were easy to understand and manage, we wouldn’t be having this conversation.
So first, what is a stratified sampling poll? We all have some idea of how election polls work: a pollster randomly chooses a number of registered or likely voters, contacts them (usually by phone), and asks who they plan to vote for.
Sound good?
Well, that depends on your definition of "random".
Purely for the sake of argument, let's assume the population is 20% black and 80% white. Let's also assume that 95% of black voters plan to vote for Obama, and 40% of white voters plan to vote for Obama. (I'll repeat this in a second, for the mathematically disinclined)
Now, let’s assume we poll 100 people and ask them if they plan to vote for Obama or McCain. We pick these 100 people "randomly". That means:
- There’s a 1 in 5 chance any individual is black.
- IF that individual is black, there’s a 95% chance they’ll vote for Obama.
- IF that individual is white, there’s a 40% chance they’ll vote for Obama.
Clearly, it’s very important how many black voters we survey. A black voter is almost certainly a vote for Obama, but a white voter is more likely to favor McCain by a small margin. But the chances of us getting exactly 20 black voters and 80 white voters are less than 1 in 10 (.0993 to be precise). So let’s look at three different scenarios.
- Our sample is 30 black voters and 70 white voters. Then we expect Obama to receive (30*.95) + (70*.40) = 56.5% of the vote.
- Our sample is 20 black voters and 80 white voters (population average). Then we expect Obama to receive (20*.95) + (80*.4) = 51% of the vote.
- Our sample is 10 black voters and 90 white voters. Then we expect Obama to receive (10*.95) + (90*.40) = 45.5% of the vote.
From this example, we can see that if two different demographic groups vote in very different ways, small errors in the representation of each group within a random sample can have profound effects on the outcome of the poll.
The solution to this problem is to sample demographic groups independently. This is called stratified sampling. Instead of randomly picking 100 voters from an 80% white 20% black population, we control for demographic groups. We pick 80 white voters and 20 black voters. By making sure our groups match the size of the population, we can curtail the errors we get from oversampling one group over another.
Okay, that was a lot of numbers. Take a break. Breathe. What did we learn?
When we’re faced with demographic groups that vote differently, we should treat each group independently. The problem is biggest when the difference between how groups vote is biggest. If blacks and whites vote almost the same, we don’t gain a lot of information by treating them separately. But if blacks and whites vote very differently, it’s important that we distinguish groups in our polls.
For statistics nerds, the importance of distinguishing groups is directly proportional to the (tetrachoric) correlation between group membership and voting choice. If there’s a large correlation, it’s important to distinguish. If there’s a small correlation, the groups are pretty much the same.
---------------------------------
Stratified sampling is the modus operandi of most or all modern polls. It helps eliminate error created through non-representative sampling techniques. What pollsters do is to survey individuals in a number of demographic groups, and then assign those demographic groups proportional shares in the electorate. For instance, a pollster would assume that the black vote accounts for a large proportion of votes in South Carolina (where 30% of the population is black); but a pollster would predict a smaller share of black voters in Iowa (where less than 3% of the population is black). By modelling the electorate, instead of pure random sampling, pollsters are able to parcel out some unwanted error that would come from uneven demographic distributions in the polling sample.
So stratified sampling is good, right? It helps us learn more about how voters act in different demographic groups. It gives us more detailed information.
Yes, it does, but there’s a very real problem with this sort of approach: no one knows the true sizes of the demographic groups.
Sure, the US Census Bureau has race and gender data, but that’s for the population at large. For polls, we only care about the people who vote on Election Day. No one knows what voter turnout will look like beforehand. The best we can do is guess that it will look the same as it did four years ago. This may not, however, be a reasonable assumption.
Consider this election season. We know anecdotally that Barack Obama energizes young voters and black voters. We can make a reasonable assumption that turnout will be higher for both of these groups, but we don’t know how much higher. There are many, many new registered voters this year, and we’re only starting to get data on how this may change the polling landscape. Furthermore, while Sarah Palin energizes the conservative base, many conservatives have always been wary of John McCain. It’s difficult to predict how much of the Republican base will come out to support McCain on Election Day.
This is not a hopeless cause here – pollsters can make educated guesses about the shape of the electorate. The problem is, there’s no way to be certain whose demographic projections are most accurate. And as we saw in the example above, even relatively minor shifts in demographic representation can have profound consequences in topline numbers when demographic groups are split in which candidates they prefer.
With about 88% of Democrats supporting Obama and 88% of Republicans supporting McCain, the biggest question is, what will be the relative sizes of Democratic and Republican voting blocs on Nov. 4th? The correlation between party identification and voting intention is by far the biggest demographic split. A poll that predicts about equal turnout for Democrats and Republicans will paint a very different picture from a poll that predicts a 38% DEM / 30% GOP split.
Stratified sampling is extremely problematic this election cycle. Some demographic groups show big correlations with voting intentions. And no one is really sure what the electorate will look like on Nov. 4th, which voters will come out to cast their ballots. Without a good model for turnout, our polls aren’t much better than unstratified random samples.
In addition, the margin of error on polls doesn’t include error from incorrect turnout predictions. Poll margins of error are based solely on the size of a sample. It’s entirely possible that two polls can disagree by 10 points, and yet represent exactly the same demographic trends. Look at the example above, again. When we predict 30% of voters will be black, Obama scores almost 10 points higher than when we predict only 10% will be black. The two polls follow identical demographic trends. The only difference, the sole explanation for that 9-point discrepancy, is demographic representation within the sample.
The media has spent considerable time focusing on poll accuracy – on whether voters are lying to pollsters (the Bradley Effect), or on whether some groups are being adequately sampled (cell-phone users). But there’s one important question that the media keeps dodging, over and over again.
How much could our polls be wrong, simply for predicting the wrong type of turnout on Election day?