Stats 101: Missing data and polls

by plf515

Community

(This content is not subject to review by Daily Kos staff prior to publication.)

Sunday, Jan. 13, 2008 Sunday, Jan. 13, 2008 at 7:23:40am PST

New Hampshire polls were off. Way off. There's lots of theories as to why they were way off (both here on dailyKos, and at pollster.com and elsewhere). One problem that all polls have is missing data....that is, people who don't answer the questions.

How to deal with missing data?

Join me, below the fold

The most important thing to know when trying to account for missing data is why the data was missing. There are three broad reasons (they are called 'missing data mechanisms'). Data can be missing completely at random (MCAR), missing at random (MAR), or nonignorable (aka not missing at random, NMAR)

sidebar: I didn't make up these terms. They're confusing, but the
seminal book on missing data used them, and we're stuck with them.

OK, so what do these mean?

Data is MCAR if the reason for it being missing is totally random. One of the hard disks failed, or something like that. Data that is MCAR causes few problems. Essentially, you can just discard it, and it's as if you never had it. Your sample size goes down, so your margin of error goes up, but that's about it. All your conclusions are valid.

Data is MAR if the reasons for it being missing can be fully accounted for by things you know about. For example, it is known that people in different income groups are more or less likely to answer the phone. If, in your poll, you ask about income, and if you know the incomes of people who actually vote, then you can account for this by weighting. The problems here are that we don't know, in advance, who is going to vote. We can look at data from previous elections, and guess that things will be similar this time, but just because 55% of the voters last time were women (for instance) doesn't mean 55% will be women this time. This is a modeling error, and pollsters have different models. For instance, a pollster might guess that more women will vote in 2008 than in 2004, since a woman is running. Another pollster might guess that that won't happen.

The right way to deal with this (and I don't think any pollster does this) is through what's called multiple imputation. Suppose you know that Joe Shmo didn't answer the poll. You also know certain things about Joe Shmo. You might know where he lives (or have a guess at it, from his phone number). You might know his racial ethnic group (or take a guess, from his name). You can take a good guess as to sex. etc. Or, you might know a lot more, depending on where you got his phone number and what your resources are. How does all that information help you? Well, you have a model of how likely people who live in certain areas, are of certain racial/ethnic backgrounds, and so on, are going to vote. So, you pretend that you have Joe's answer. Let's say you know Joe is a college educated Black male. You think such people are quite likely to vote for Obama. So, you pretend that Joe answered "Obama". That's single imputation, becuase you've imputed one response. But that underestimates margin of error, because, while you may be right that college educated Black men are likely to be Obama supporters, you could, of course, be wrong about Joe. So, rather than imputing one response, you impute multiple responses, creating multiple data sets.

There are then statistical techniques to deal with those multiple data sets and come up with valid answers.

But what about nonignorable missing data? That's when the reason for data being missing is something you do not know about. For example, suppose Clinton supporters are more like to not respond, even after you've accounted for age, race, etc). Then you are in trouble. There is no good way of dealing with this. Might this be the case? Sure. Why might it be the case? Well, I've seen polls showing that more of Hillary's support was 'soft', and it makes sense to me that 'soft' supporters would be less likely to stay on the phone. Also, Hillary had strong support among moms of school age kids. These women are more likely than other voters to be home - and, if you are home, you get more calls, and thus get more annoyed. They are also more likely to be busy at home. (Anyone, male or female, who's had the phone ring while they are changing a diaper or giving a kid a bath or getting a kid to sleep will know what I mean).

Is this the reason the polls were way off?

I don't know. I don't have data to back it up. But I think it makes sense. And I do have some evidence. For one thing, the pollsters didn't get Obama's numbers wrong, only Clinton's. Look at this post by Charles Franklin. Some of the polls underestimated Obama, some overestimated him. On average, the polls got his percentage correct. But every poll underestimated Hillary's numbers.