So in early 2011 I was pretty excited to notice that the Daily Kos/SEIU/PPP poll had changed their racial demographic question to include Asian as a category. This meant that eventually, after aggregating the data over a year or so, we would be able to see the opinions of a fast-growing and diverse segment of the population that nonetheless is too small to generate decent data in any individual poll. And in January of this year I set about aggregating that data for the 965 Asian respondents in 2011.
But then I noticed something odd.
Only 35% of Asians in the Daily Kos poll said they lived in the West. But the census shows that 55% of Asian registered voters live in the West. Meanwhile, the Daily Kos poll showed 17% of Asians saying they lived in the Midwest, compared to 8% in the census.
Uh oh.
Now, yesterday we saw that about 5-9% of respondents enter the wrong geographic location for various reasons, depending on region, so instead of categorizing location based on what respondents said, I based it on area code using the same regional definitions for the census and polling data... and found 37% living in the West and 15% living in the Midwest.
Perhaps it was just too few respondents? But now, in September 2012, with many more respondents, the numbers are... 37% living in the West, 16% in the Midwest (by area code; all further geographic numbers in this post are based on area code).
Meanwhile, among Hispanic registered voters, the poll shows 30% living in the West and the Census says 41%. The poll shows 17% living in the Midwest and the census says 8%. (American Indian numbers are also messed up but that will be a whole separate post. Eventually.)
What is going on?
We already know that around 5-9% of the poll respondents (after weighting by age) enter the incorrect geographic region, only a small minority of which can be attributed to accident. What if a similar amount enter the wrong racial code?
Looking back at the geographic data, there is a pattern to what number is selected for wrong answers. Averaged out, about 4% chose option 1, 3% 2, 2% 3, and 1% 4. This actually seems like what you might get if people were 'randomly' choosing a button (we're not so random as we think - assuming something akin toballot order effects occurs during a phone poll).
The Daily Kos poll starts with about 1500 completed calls, around 1200 of which claim to be white, and then around 500 mainly older, white respondents are removed at random to produce the raw polling data which is then weighted by age. If we ignore for the moment the weighting by age, and assume a similar error rate for the race question as the geography question, that would mean about 80 respondents in each poll are labeled as minorities but are actually white - about a third of the minorities. You may notice a problem here.
We can do a simplistic little simulation by region now, assuming equal response rates among different racial minority groups and different regions (not necessarily valid assumptions!) And what we get is... 43% of Asians in the West, 14% in the Midwest; and 30% of Hispanics in the West, 18% in the Midwest. That's damn close to what we see in the polling data.
Here's graphs comparing the census data, the polling data, and the simulation data:
(The geographic distributions of white and African-American voters is about the same for the census, the poll, and the simulation, so I didn't bother to show it. For the record though, the polling results match the census slightly better than the simulation does for African-Americans.)
Remember, I made some big assumptions and ignored weighting by age, but even so we can conclude that the strange geographic distribution in the poll is consistent with about 9% of respondents pressing buttons 'randomly' on the race question in the same pattern as they do for the geography question. (Of course, other explanations are still possible too.)
This would imply that a large proportion of minorities in PPP polling, of substantial but uncertain magnitude, are indeed people who are just messing with the poll or pressed the wrong button by mistake. I would assume that a similar phenomenon would be seen for all automated polling firms, and perhaps for some live interviewers as well.
What proportion, exactly? The simulation gives me numbers, but with a fair amount of uncertainty because I'm not too sure about the proportion of people incorrectly pressing each number on the race question. But I can also calculate a proportion by using what I think are 'true' polling numbers (from sources listed below that I believe have good track records and/or good methods such as multilingual polling) and comparing them to what we observe in the aggregate 2012 DailyKos polling.
Despite all the different sources of error for these estimates, both methods give numbers in the same ballpark: somewhere around half of respondents choosing Hispanic and Asian, and (only!) about a fifth of those choosing African-American, are
not the race indicated by their choice.
Now we can say with confidence that two independent methods of estimating the percent 'fake' minorities in the Daily Kos poll come up with the answer of 'lots' - which while not a very specific number is good enough for us to know we shouldn't trust absolute values of the crosstabs for minority demographics in PPP polls and most likely all automated polls. And, we have one hypothesis consistent with the data for why there are lots of incorrect responses to the race question: people are doing about the same thing they did with the geography question.
I think this would be a good time to remind the reader that PPP and other (legitimate) automated pollsters still get their final numbers pretty close, despite this. Also, this analysis is only possible because of PPP's transparency in releasing all their raw data.
But for all those people who have said to beware the crosstabs... you're absolutely right.
Data Sources:
Pew. Averaged over the summer, they have it at about 67-25 for Obama among Hispanic voters and 94-3 Obama among Black voters. They also have a lot of interesting data for different demographic groups in their Religion & Public Life series.
Latino Decisions. Latino Decisions is releasing a poll of 300 respondents every week. The error is high, but the numbers are clear: over the past four weeks, they've had it at about 66% Obama, 28% Romney. Latino Decisions has also released some state polls.
Fox News Latino. In March they had Obama up 70-14 over Romney; this week they have it at 60-30.
USA Today/Gallup Latino Poll From the spring, Obama favored 72/19 among immigrants and 63/28 among US born.
Asian American Justice Center and APIAVote. A poll last spring - the first ever national poll of Asian-Americans - showed Obama ahead of Romney 59-13.
BET Poll. Frustratingly, I can't seem to find the full release of this poll anywhere, but it does say in one of the press releases that only 2% of African-Americans in battleground states support Romney.
___________
Beyond the Margin of Error is a series exploring problems in polling other than random error, which is the only type of error the margin of error deals with.
Previously:
Why Don't People Know Where They Live in the DKos Poll? A small number of respondents - around 5-9% - press the wrong button when answering the geography question on the Daily Kos poll. This is far greater than than can be explained by observed rates of misunderstandings or data entry errors.
Why State Polls Look More Favorable For Obama than National Polls. In the spring and summer, lack of support in Blue States was bringing down Obama's performance in national polls, while Swing States and Red States were polling about the same as 2008.
Presidential Polls Are Almost Always Right, Even When They're Wrong. How the presidential polls in red and blue states are off, sometimes way off, and how to predict how far off they'll be.
When Polls Fail, or Why Elizabeth Warren Will Dash GOP Hopes. Why polls for close races for Governor and Senate are sometimes way off, and how to predict how far off they will be.