How to read a poll: Confidence intervals and all that

by plf515 for Math and Statistics Geeks

Community

(This content is not subject to review by Daily Kos staff prior to publication.)

Tuesday, Jul. 12, 2011 Tuesday, Jul. 12, 2011 at 1:06:15pm PDT

Whenever we see a poll, we see a margin of error, or confidence interval. These are always wrong. They are wrong, even if there are only two candidates, and they are even more wrong if there are more than two candidates. But they are simple.

The truth is complicated.

This complication exists even if we assume that the sample is a perfectly random sample of the population of voters. This assumption is ludicrous, but without it, things get really hairy. In fact, the truth is more complicated than this diary makes it out to be

If you have only two candidates then the results follow what is known as a binomial distribution. If you have more than two they follow what is known as a multinomial distribution. "Distribution" is itself a statistical term. It means an assignment of probability to each possible outcome; in this case, the proportion of the vote a candidate will get. In sampling, we try to estimate a population distribution from a sample distribution. Of course, our estimate isn't perfect, but, again assuming it's random, we can estimate how badly off it might be.

There are a few problems with the way margins of error (MoE) are usually presented in polls.

First, we interpret them wrongly. Even if we used the right MoE (see below) our interpretation is off. A confidence interval (CI) is given by the estimate plus or minus the MoE. The correct interpretation of a 95% confidence interval is that, if the population value was X, 95% of the time, the sample value would be in the 95%CI. What we usually assume is that, since the sample estimate is XXX, we can be 95% sure that the population value is within the 95% CI. That's wrong. This interpretation is VERY common; I've even fallen into it myself.

A second wrong interpretation is that we assume either a) That all values within the CI are equally likely or b) That values outside the CI are impossible. Neither is correct. If our poll estimates that 52% will vote for Joe Shmo, then the most likely result is 52%; the farther you go from 52%, the less likely. The likelihood of any particular result is given by the likelihood function - and ANY result from 0 to 100 is possible, it's just that when you get far from 52%, they are very unlikely. (You COULD flip a fair coin 100 times and get 100 heads; it's not LIKELY, but it's POSSIBLE).

But we also give the wrong MoE, because we give a single MoE for each poll, and that's not right. The classical formula for a 95% MoE is

1.96*(pq/n)^.5,

where p is the proportion saying something, q = 1-p and n is sample size.

This is approximately accurate, and the approximation is pretty good for results from polls where n is usually pretty big and we aren't interested in very rare events. It doesn't work well for estimating very rare things, like prevalence of rare diseases, but it's OK for polls. But it gives a different MoE for each candidate. But when there are two candidates who get all (or almost all) of the votes, then this difference doesn't matter too much. For example, if we poll 400 people and 60% say they will vote for Obama, 35% for Bachmann (should she be the Repub. nominee) and 5% for someone else, then the MoE for these three are
Obama 4.88%
Bachmann 4.78%

But the pollsters like to give ONE MoE, so they use an even simpler formula:
0.98/n^.5; this is only exactly correct if p = .5

For the above, it would give
Obama 4.9%
Bachmann 4.9%

not far off.

But what if we are polling a primary? A recent Iowa poll of 500 Repubs gave these results

Bachmann 25%
Romney 21%
Pawlenty 9%
Cain 9%
Paul 6%
Gingrich 4%
Santorum 2%
Huntsman 1%

It said the MoE was 4.4%; that uses the simple formula .98/n^.5. But the right ones, with the formula 1.96*(pq/n)^.5 are different for each candidate and they are

Bachmann 3.8%
Romney 3.6%
Pawlenty 2.5%
Cain 2.5%
Paul 2.1%
Gingrich 1.7%
Santorum 1.2%
Huntsman 0.9%

There are still problems with Huntsman's, but these are much more reasonable figures. They are asymptotically accurate.