Sample error and model error - why Kerry may be way ahead

by sck5

Community

(This content is not subject to review by Daily Kos staff prior to publication.)

Thursday, Sep. 23, 2004 Thursday, Sep. 23, 2004 at 8:08:55am PDT

With the new ARG poll out there are questions about the relative importance of sampling error and model error. Several commentators have mentioned that it is likely to be model error that is causing the wide disparity in results rather than sampling error. This is almost certainly true, but many dont know what the difference between the two. Here it is:

SAMPLING ERROR - usually reported as the "margin of error" tells you what confidence you have that the value computed from the sample you picked actually represents the true value in the population as a whole, ASSUMING THAT YOU HAVE THE CORRECT MODEL FOR HOW THE CHOICE IS MADE IN THE FIRST PLACE. For example, if we look at coin tosses, the "correct model" is that there is a 50-50 chance of heads vs. tails. If I toss the coin ten times I may or may not get this 50-50 split from my sample of tosses. The more tosses, the more likely I am to get something close to 50-50 again ASSUMING THAT I AM RIGHT THAT THE CORRECT MODEL IS 50-50.

The usually reported "moe's" are correctly understood to mean the following: If I repeated my sample 20 times on the same population, then 19 times out of 20 the "true" value for the number lies within the range given by my estimate plus or minus the moe.

MODEL ERROR - is far more important when we dont really know the true nature of the model and are only guessing at it. The "true" parameter for Bush vs. Kerry is whatever percentage we would get if we could actually ask every single person who really is going to vote in November who they would vote for and they actually told us the truth and then actually did go vote that way. Where could this go wrong in an actual poll?

They could lie to the pollster - e.g. - They call at night and only get people when their spouse is listening
The pollsters could be failing to randomly sample the whole population. Some examples:

They call during the day and only get stay-at homes and nobody who has a day job
They never get people with caller ID because they wont pick up
They never get people with cell phones at all

An important note - These factors will only bias the result if there is some relation between the groups excluded or oversampled and supporters of one candidate or the other. e.g. if most people who only have cellphones support Kerry then missing them will skew the results toward Bush

Much has been written about the failings of the Gallup (and others) likely voter model. Why is this a problem?

First, it should be said that it is only a problem if their method for arriving at a "likely voter" is more likely to exclude supporters of one side than the other. Here are some reasons why this might skew the results toward Bush

If Dems are more fired up this year than last time, then "true" likely voters are more likely to say that they didnt vote last time and/or havent figured out yet where their polling place is. This might get them excluded from the sample.
Knowing that they have many or all of the problems above, some pollsters (e.g. Gallup) try to correct for them by weighting their poll to make sure they have the "right" proportions of Repubs. vs. Dems. This is really not statistical sampling at all, but is more in the category of imposing your own prior beliefs on the sample to get a "right" answer.

Why might this lead them astray? First, they might use the wrong percentages. The most common percentages to use are the actual percentages of party affiliation from the last election or from actual voter registration. These numbers are tilted slightly in favor of the Dems (by about 3-4 points). Gallup actually uses figures tilted THE OTHER WAY. Where did they get these? I can only say, "out of their ass"

Why might even those LV models based on "right" percentages be biased against Kerry? A big reason is that I think (and many others too) that a lot more Dems are going to come out and vote this year than they did in 2000. So the ratio of Likely Voter Dems to Registerd Dems is going to be a lot higher in fact than it would seem to be in the historical record.

This is why a clever Republican will do everything possible to depress Democrats and inspire a defeatist attitude among them. The more they can do this, the more they can depress the turnout of fired up Dems.

The bottom line? ALL statistics relies on history in the sense that it is an attempt to see what people are doing based on the assumption that they are behaving the way they always did. This is what "model error" is - When people either arent behaving the way they always did or the pollster never really captured their "true" behavior in the first place.

What do I think? I think this year is different. I dont think people are behaving like they always did. I think the Dems are very very pissed off. I think the independents are too. This means that polls are not an accurate predictor of "truth" and the "margin of error" aint even the half of it.