This is my first diary, and I thought I would use it to dig a little deeper into the polls; not their content, but their general meaning.
Today we're inundated with polls; every day we see polls from Rasmussn, Politico, Gallup and now DKos, all trying to put some perspective on the current race for president. After spending a long time glued to them, and even longer trying to avoiid them, I thought I would actually pick at them a little and try to explain precisely why they shouldn't be trusted (with apologies to DKos!)
Bear with me; there's a little math and some explanation of statistics here, but I'll keep it simple. Read on...
So, what's the problem with polls? I'm not calling their methodology into question, and nor will I subscribe to the view that pollsters 'fix' their polls by sifting through the samples to find the desired outcome. True, some polls (especially the more partisan ones) will tend to provide larger brackets, or assign larger proportions of one party affiliation over another, but generally, for the sake of argument, I will assume their methodology is correct.
What I will say is that generally, as a population, we do not truly understand what the poll is saying. I do not really ascribe this to ignorance on the part of the reader, but rather a general attitude in the population that statistics, in general, do not lie. Well, they don't (usually) but we do need to explain what they actually mean.
Terms
There are several terms that need explanation here;
• Population — the people being sampled from
• Sample — the number of people who answer questions in a given poll
• Confidence level — the level of certainty you have regarding your results (generally 95% in a poll)
• Confidence interval — the "error bars" of a given poll (usually expressed as ± x%)
Statistics, Damned Statistics and More Statistics
Now, in the 2004 General Election, there were a total of 122,295,345 votes cast
122,295,345 votes cast; the method for determining sample size within acceptable confidence levels gives a sample size of around 1250, which is the average sample size of the average poll published by most pollsters, within a few hundred. Recently, we have seen the polls are incredibly close, usually within the confidence interval of ±2.5 — 2.9%. We take it as read that this means that the election is close, but is it really?
The general assumption that has to be made in any poll is that the population is relatively homogeneous, in other words, Republicans and Democrats are evenly spread and commingled throughout the population. This is patently untrue, of course; there are red strongholds in blue states and vice versa. This should immediately sound alarm bells when you read any poll, regardless of what it says.
Generally speaking, weighting is applied based on voter registrations and other polls that give the relative strengths of the parties in the country in general. With no access in most polls to information regarding where the respondents reside, any extraplation based on raw data is meaningless. If half of the sample were in Democratic strongholds in otherwise Republican states, the weighting will effectively diminish or even wipe out those results. Although pollsters generall try to reduce that effect, there is no meaningful way of doing so. It can be mitigated, but not removed.
So What?
The problem of interpretation rears its ugly head in all polls, but in no case so clearly as in the current crop of opinion polls where the scores are essentially 'tied'. Let us examine an average poll, say the latest DKos poll (and I stress I am not impugning DKos' methodology!):
RESULTS FROM SINGLE DAY SAMPLE (361 respondents, MoE 5.1)
361 is a very small sample size, hence the large error bars (confidence interval). Like all polls, the confidence level is around 95%. The results (McCain 47%, Obama 47%) seem to indicate a tied race, but with a CI of 5.1, you cannot be sure that the true picture is not McCain 41.9%, Obama 52.1% (or vice versa!). This is not an indictment of polling, or of the methodology, but rather a statement of fact; there is, in fact, no method that can determine the true picture of what is happening, except to increase the sample size.
But by how much? Well, to get a 99% confidence level with the current methodology, you'd need to increase your sample size to 2169. To improve your error bars to one percent, you'd need to poll 16,639 people! There's a great tool here that gives you a great insight into how small differences make a huge impact on sample sizes.
Bear this in mind every time you read a poll; 95% confidence sounds good, but it ain't. 2.8% MoE sounds nice, but it's simply a statement of the limitations of all polls, a consequence of statistics, not a methodology.
If you add confidence intervals, confidence levels and the non-uniform distribution of voting populations into the mix, it becomes apparent that polls can only give you the most marginal of images of what is happening on the ground. And I won't even get started on the influx of new voters, which changes the whole sample size enormously!
Anyway, that's my first diary. If it's unhelpful or pointless, I will remove it, but I thought I might give a brief discussion to help people understand that the polls are not, and never can be, a truly reliable picture of what the country is like; at best they give a very general impression. I'm not saying it's not close, but I want you to come away with the knowledge, at least, that even a crop of "panic polls" is truly no cause for alarm, but should always be a rallying call to action.