Yes, you read that right. I have just calculated that there is a one in 9,432,472,254 chance that the exit polls could have been wrong by chance.
And that is not working from screen-shots, that is working from the data presented in the Edison Mitofsky report.
Not only that, but it happened in 1996 too! A one in 268 probability of occurring by chance!
And in 1992! A one in 5007 probability of being due to chance!
And in 1998! A one in 49,827 probability of being due to chance!
Er, that's funny, what happened in 2000? Oh, here it is - yes, polls were wrong again! Only 1 in 3 probability this time, though.
But get this - they all made the same error! Yes, that's right, they all over estimated the Democratic vote! Even in 2000.
So what can we conclude from this?
Read on.....
What this tells us is that in every single one of the last five elections, the exit polls have consistently over-estimated the Democratic vote, significantly so (regarding 1 in 20 as "significant", an arbitrary criterion adopted by social scientists) in every year except 2000.
So the next question to ask is: was the over-estimate significantly greater in 2004?
Another preliminary way of asking the question is to say: does the degree of Democratic overe-estimate vary significantly from year to year? We can do this using a repeated-measures ANOVA (analysis of variance), and the answer is yes: there are significant differences between years in the amount of Democratic over-estimate in the polls (probability of this being a chance finding? One in a billion, since you ask).
So we then ask: was this year significantly out of line? So we perform a "planned comparison" and compare 2004 with all previous years. And yes, this year was worse. Significantly worse. There is a one in 30,000 probability that this year would have been different just by chance. However, if we look at the "least significant difference" i.e. compare this year with the worst of the four previous years, we find that this year was not significantly worse than 1992. 1992 was however, significantly worse than the remaining three years (2000, 1996 and 1998). Probability value for this 1992 being worse than the other three years simply by chance? 1 in 19 million.
I hope the question you are now asking yourselves (those of you who are not acquainted with the weird and whacky quirks of parametric statistics) is:
How come the Democratic overestimate 2004 was significant at a probability of 1 in nine million, but was not significantly worse than the Democratic overestimate in 1992 which was significant at a probability of only one in five thousand?
Well the partial answer is that probability values are a very poor proxy for effect size. The probability value tells you how confident you can be that the difference was not due to chance. It does not tell you the size of the difference. The mean "within precinct error" (WPE - the difference between the exit poll estimate of the vote and the actual vote for each precinct) in 2004 was -6, (negative value tells you it was a Democratic over-estimate) whereas in 1992 it was -5.11. Not a big difference. However it was more significant in 2004.
To backtrack a bit: there are two sorts of error we are concerned with. One is sampling error. Imagine you have a bag full of red balls and a bag full of blue balls, and you keep selecting ten at a time, at random, and chucking them back in. Let's also say we have a scoring system whereby a red ball gives you a point, but a blue ball knocks a point off. On average you will get five of each colour, and your score will be zero. However, sometimes you will get more of one than the other. However, your average score, over several goes, will be 0. However, say you have a touch of ESP, and you can sense the red colour through your fingertips (or someone has snuck extra red balls into the bag). Sometimes you will still get 4 red ones and 6 blue ones (unless your ESP is really brilliant). Sometimes you will get 7 red ones and 3 blue ones. But the average over your picks will be greater than zero, because you will tend to get more red balls than blue ones.
The sampling error is the error you get just because you won't always get the same score for your pick, even if you pick completely at random, and there are equal numbers of each colour in the bag. Sometimes you will get a minus score and sometimes a plus score. If we knock the minus signs of the minus scores, we get the absolute or unsigned error - which just tells you how inaccurate your samples are. The more balls you pick each time (say a hundred balls per go) the smaller your unsigned error will be. However, if you keep the minus signs in, and average your signed error, it will not tell you your sampling error, it will tell you your sampling bias. If you end up with an average score of 5, we know you have ESP for blue balls (or that someone snuck extra blue balls in the bag). If your score is -5 we know your ESP sucks (it tells you the balls are red when they are blue and vice versa), or that someone snuck extra red balls in the bag. Another analogy is a shooting range: your average distance from the target will tell you how badly you suck at hitting the target; your average distance above the target will tell you whether you tend to shoot high or not. You might be very consistent at shooting 1 inch higher than the target; or you might be like me and shoot completely at random.
Right. Back to the Mitofsky Edison WPEs. EM have provided us with the average WPE (signed) for every state, for the last five years. Some are positive (Bush overestimate) and some are negative (Kerry overestimate). However, the average of the averages (across all states), in every year is less than 0. Remember, if there is only sampling error, and no sampling bias, each state's average should be close to 0, and the average of the state averages should be even closer to 0. However, what we find is that in all the years except 2000 the average of the states WPEs is significantly less than 0. How significant it is depends partly on the size of the signed error (the bias) but also on the size of the unsigned error (the sampling error). The more accurate the sampling, the more significant the bias. You could even argue significance of this year's bias may be a tribute to the accuracy of the sampling, but I'm not going to let EM get away with that.
So to summarise so far: there was Democratic overestimate each of the last five elections. There was a significant Democratic overestimate in four out the last five elections. In 2004 bias was significantly larger than you would expect, given the variability of the bias over the last five elections, though not significantly larger than in 1992. And it was massively significant.
So why was there a bias? This is the really important question, not the probability value of the bias. There are two main contenders:
- Bush voters were shy, and managed to avoid the inexperienced pollsters, leading to sampling bias.
- There was no sampling bias at all, and that what was wrong was the count. In other words, Democratic voters were not over-sampled at all, it was their votes that were undersampled. They thought they'd voted and they hadn't.
It strikes me there is historical evidence for both, though here I am no expert, not being an American. I think I am right in saying that spoilage rates tend to be higher in poorer and more Democratic areas. If people think they've voted for a Democrat, but their vote is thrown out, this may well be reflected in apparent "over-sampling" of Democrats at the exit polls. On the other hand, it doesn't explain the fluctuation from year to year, and especially the lack of apparent bias in 2000, where we know there were anomalies, most famously the Democrats who thought they'd voted for Gore on the butterfly ballot, but had actually voted for Buchanan. So there is an argument for a built-in Democratic sampling bias, historically.
As the mean signed WPE for most states is not zero, I decided to see whether there was a relationship between state "colour" (as measured by the margin between Democratic and Republican candidate) and the direction of the WPE in each year. To do this I simply subtracted the percentage of votes cast in each state for the Republican candidate from the percentage cast for the Democrat. Blue states therefore have a positive margin, and Red states have a negative margin. I did this for each of the years for which EM have given us the WPE. I did it in a number of ways, and I won't bore you with the details at this stage. Suffice it to say that when the years are pooled, there is a very strong significant negative correlation between the signed WPE and the state colour (1 in 2711 probability of occurring by chance. This means that the bluer the state, the greater the Democratic "over-estimate" in the poll. Moreover, the effect is strongest in the two years in which the sampling bias was greatest (2004 and 1992). Using Spearman's rho (a non-parametric statistic that is less subject to leverage by outlying data points) for 2004, the correlation was -0.386, and for 1992 it was -.410. The probability (if you still care) of correlation occurring by chance in 2004 was 1 in 177; in 1992, the probability was 1 in 261.
Moreover, according to EM, in these two years turnout was high (55% in both years, around 50% in the other years) and the percentage of voters "paying a lot of attention to the campaign" was also high (over 66%, in both years as compared with under 50% in the other years). Recall that Perot was on the ballot in 1992.
What this appears to be saying is that in years in which the election has a high profile, the signed WPE has been significantly more negative (Democratic over-estimate), and that this bias has been significantly greater the more Democratic the state. One interpretation of this is that Bush voters more inclined to avoid being polled in these years, and that they are even more inclined to do so in Blue states than Red states, which would make some psychological sense. Another interpretation is that where the election is seen as critical, someone stuffs the ballots with Red votes.
Unfortunately, the statistics cannot tell us which is which, although the fact that 2000 comes up clean to my mind argues against the fluctuating fraud theory. So my hunch - and this is simply hunch, not stats - is that the shy Bush voter theory has legs.
This does NOT mean it is the only factor affecting the WPE (although remember the WPE is not significantly worse this year than in 1992). Moreover, these analyses do not rule out a bit of electronic ballot stuffing getting under the statistical radar. There is plenty of unexplained WPE variance still to account for (in fact, reducing the noise due to this "colour" effect actually increases the significance of the Kerry over-estimate in 2004 relative to previous years).
But to get any closer, we have to look at individual precinct level data. Unfortunately we do not have the uncorrected weights for the precinct level data that EM have released, and we only have the results EM's analyses (with no proper methods section) on the WPEs, which are presumably based on the uncorrected weights. However, the precinct level data may still yield some gold.
I'm working on it - I hope lots of others are too.
And as a final comment - I got into my first mini-flame war on an exit poll diary earlier today. I am not claiming special expertise, and if anyone cares to fault my stats, I am only too willing to be corrected. But I thought it was time some of this was put in perspective, particularly the significance of significance values (which are not very significant....). I am not a statistician or a pollster - my bachelors were in music and architecture, and I am now a "mature" PhD student working in the field of cognitive neuroscience, for which I use a lot of multivariate statistics. I also coach stats to undergrads, especially to dyslexic undergrads. That's all my credentials. Oh, and I would really like to see fraud proven and Bush impeached.
Boy, would I.
Update: got my stuffed votes the wrong colour: amended. Hypthesis is stuffing of Red votes. In the UK red votes are the Labour votes, blue votes the Tories. It trips me up every so often. Apologies.