A bit over two weeks ago, a group of statistic wizards (Mark Grebner, Michael Weissman, and Jonathan Weissman) approached me with a disturbing premise -- they had been poring over the crosstabs of the weekly Research 2000 polling we had been running, and were concerned that the numbers weren't legit.
I immediately began cooperating with their investigation, which concluded late last week. Daily Kos furnished the researchers with all available and relevant information in our possession, and we made every attempt to obtain R2K's cooperation -- which, as I detail in my reaction post here -- was not forthcoming. The investigators' report is below, but its conclusion speaks volumes:
We do not know exactly how the weekly R2K results were created, but we are confident they could not accurately describe random polls.
The full report follows -- kos
R2K polls: Problems in Plain Sight
Mark Grebner, Michael Weissman, and Jonathan Weissman
For the past year and a half, Daily Kos has been featuring weekly poll results from the Research 2000 (R2K) organization. These polls were often praised for their "transparency", since they included detailed cross-tabs on sub-populations and a clear description of the random dialing technique. However, on June 6, 2010, FiveThirtyEight.com rated R2K as among the least accurate pollsters in predicting election results. Daily Kos then terminated the relationship.
One of us (MG) wondered if odd patterns he had noticed in R2K's reports might be connected with R2K's mediocre track record, prompting our investigation of whether the reports could represent proper random polling. We've picked a few initial tests partly based on which ones seemed likely to be sensitive to problems and partly based on what was easy to read by eye before we automated the data download. This posting is a careful initial report of our findings, not intended to be a full formal analysis but rather to alert people not to rely on R2K's results. We've tried to make the discussion intelligible to the lay reader while maintaining a reasonable level of technical rigor.
The three features we will look at are:
- A large set of number pairs which should be independent of each other in detail, yet almost always are either both even or both odd.
- A set of polls on separate groups which track each other far too closely, given the statistical uncertainties.
- The collection of week-to-week changes, in which one particular small change (zero) occurs far too rarely. This test is particularly valuable because the reports exhibit a property known to show up when people try to make up random sequences.
1. Polls taken of different groups of people may reflect broadly similar
opinions but should not show any detailed connections between minor random details. Let's look at a little sample of R2K's recent results for men (M) and women (F).
6/3/10 Favorable Unfavorable Undecided
Question Men Women Men Women Men Women
Obama 43 59 54 34 3 7
Pelosi 22 52 66 38 12 10
Reid 28 36 60 54 12 10
McConnell 31 17 50 70 19 13
Boehner 26 16 51 67 33 17
Cong. (D) 28 44 64 54 8 2
Cong. (R) 31 13 58 74 11 13
Party (D) 31 45 64 46 5 9
Party (R) 38 20 57 71 5 9
A combination of random sampling error and systematic difference should make the M results differ a bit from the F results, and in almost every case they do differ. In one respect, however, the numbers for M and F do not differ: if one is even, so is the other, and likewise for odd. Given that the M and F results usually differ, knowing that say 43% of M were favorable (Fav) to Obama gives essentially no clue as to whether say 59% or say 60% of F would be. Thus knowing whether M Fav is even or odd tells us essentially nothing about whether F Fav would be even or odd.
Thus the even-odd property should match about half the time, just like the odds of getting both heads or both tails if you tossed a penny and nickel. If you were to toss the penny and the nickel 18 times (like the 18 entries in the first two columns of the table) you would expect them to show about the same number of heads, but would rightly be shocked if they each showed exactly the same random-looking pattern of heads and tails.
Were the results in our little table a fluke? The R2K weekly polls report 778 M-F pairs. For their favorable ratings (Fav), the even-odd property matched 776 times. For unfavorable (Unf) there were 777 matches.
Common sense says that that result is highly unlikely, but it helps to do a more precise calculation. Since the odds of getting a match each time are essentially 50%, the odds of getting 776/778 matches are just like those of getting 776 heads on 778 tosses of a fair coin. Results that extreme happen less than one time in 10228. That’s one followed by 228 zeros. (The number of atoms within our cosmic horizon is something like 1 followed by 80 zeros.) For the Unf, the odds are less than one in 10231. (Having some Undecideds makes Fav and Unf nearly independent, so these are two separate wildly unlikely events.)
There is no remotely realistic way that a simple tabulation and subsequent rounding of the results for M's and F's could possibly show that detailed similarity. Therefore the numbers on these two separate groups were not generated just by independently polling them.
This does not tell us whether there was a minor "adjustment" to real data or something more major. For that we turn to the issue of whether the reports show the sort of random weekly variations expected to arise from sampling statistics.
2. Polls taken by sampling a small set of N respondents from a larger population show sampling error due to the randomness of who happens to be reached. (The famous "margin of error" (MOE) describes this sampling error.) If you flip a fair coin 100 times, it should give about 50 heads, but if it gives exactly 50 heads each time you try, something is wrong. In fact, if it is always in the range 49-51, something is wrong. Although unusual poll reproducibility can itself occur by accident, just as exactly 50 heads can happen occasionally, extreme reproducibility becomes extremely unlikely to happen by chance.
To see whether the results showed enough statistical variation, we use several techniques to isolate the random sampling fluctuations from changes of opinion over time. First, we focus on small demographic subsets, since smaller samples show more random variation. Second, we consider the differences between categories whose actual time course should approximately match to remove those shared changes, making it easier to see if there is enough statistical noise. Finally, we make use of the different time courses of opinion change and random statistical noise. The former tends to show up as slow, smooth, cumulative changes, while the latter appears as abrupt movement in poll numbers up or down that aren't sustained in subsequent polls. An easy way to separate the fast from the slow is to look only at the differences from week to week, in which the fast changes are undiminished but the slow changes hardly show up.
At one point R2K changed its target population, sample size (N), and the categories used for crosstabs. For the first 60 weeks, R2K reported N=2400, and the set of questions changed occasionally; during the final 14 weeks, N=1200 and the same 11 three-answer questions were used each week. We analyzed the two sets of results separately since most simple statistical tests are not designed for that sort of hybrid.
We took advantage of the small numbers with reported political party affiliations of "independent" (about 600 per week) and "other" (about 120 per week) in the first 60 weeks. We tracked the difference between Obama's margin (Fav-Unf, the percent favorable minus the percent unfavorable) among independents and "others". This quantity should have a large helping of statistical noise, not obscured by much systematic change.
We quantify the fluctuations of the difference between those margins via its variance, i.e. the average of the square of how far those margins differ from their own average. (The units for variance here are squared-percent, which may take some getting used to.) The expected random variance in either margin is known from standard statistical methods:
{variance}=(100*(Fav+Unf)-(Fav-Unf)2)/N. (1)
(Essentially the same calculation is used all the time to report an MOE for polls, although the MOE is reported in plain percents, not squared. )
The expected variance of the sum or difference of two independent random variables is just the sum of their expected variances. (That simple property is the reason why statisticians use variance rather than say, standard deviation, as a measure of variation.) For the Obama approval rating (the first and only one we checked) the average expected variance in this difference of margins over the first 60 weeks was 80.5, i.e. a range of +/- 9% around the average value.
With that mind, consider what R2K reported, in their first 8 weekly polls:
Attitude toward Obama among "Independents" and "Other"
Week ended Independents Other Diff
Fav Unfav Marg Fav Unfav Marg
1/ 8/09 68 - 25 = 43 69 - 23 = 46 -3
1/15/09 69 - 24 = 45 70 - 22 = 48 -3
1/22/09 82 - 15 = 67 84 - 13 = 71 -4
1/29/09 80 - 16 = 64 82 - 16 = 66 -2
2/ 5/09 73 - 21 = 52 72 - 23 = 49 +3
2/12/09 72 - 20 = 52 71 - 22 = 49 +3
2/19/09 71 - 21 = 53 72 - 22 = 50 +3
2/26/09 70 - 20 = 57 75 - 21 = 54 +3
There's far less noise than the minimum predicted from polling statistics alone.
Looking over the full 60-week set, the variance in the reports was only 9.947. To calculate how unlikely that is, we need to use a standard tool, a chi-squared distribution, in this case one with 59 degrees-of-freedom. The probability of getting such a low variance via regular polling is less than one in 1016, i.e. one in ten million billion.
What little variation there was in the difference of those cross-tab margins seemed to happen slowly over many weeks, not like the week-to-week random jitter expected for real statistics. Since the weekly random error in each result should be independent of the previous week, the squared random weekly changes should average twice the variance. (Remember, for independent variables the variance of the sum or difference is just the sum of the variances.) That's 2*80.5= 161 in this case. The actual average of the square of the weekly change in the difference between these reported margins was 1.475. It is hardly necessary even to do statistical analysis to see that something is drastically wrong there, much worse even than in the reported variances, which were already essentially impossible.
So far we have described extreme anomalies in the cross-tabs. We have not yet directly described the top-lines, the main results representing the overall population averages. Could these have been obtained by standard methods, regardless of how the cross-tabs for sub-populations were generated? The top-line statistics require more care, because they are expected to have less statistical jitter and because there is no matching subset to use to approximately cancel the non-random changes over time.
For the data from the first 60 weeks, before the change in N, we found no obvious lack of jitter in the top-lines. For the next 14 weeks, the top-line margins give the immediate impression that they don't change as much in single week steps as would be expected just from random statistics. A detailed analysis does show very suspiciously low weekly changes in those margins, but the analysis is too complex for this short note.
3. We now turn instead to another oddity in the week-to-week changes in the top-lines. For background, let’s look at the changes in the Obama Fav from Gallup's tracking poll, with 162 changes from one independent 3-day sample of N=1500 to the next. There is a smooth distribution of changes around zero, with no particular structure. That’s just as expected.
Now let’s look at the same for the weekly changes in R2K's first 60 weeks. There are many changes of 1% or -1%, but very few of 0%. It's as if some coin seemed to want to change on each flip, rarely giving heads or tails twice in a row. That looks very peculiar, but with only 59 numbers it's not so extremely far outside the range of what could happen by accident, especially since any accidental change in week 2 shows up in both the change from week 1 and the change to week 3, complicating the statistics.
If we now look at all the top-line changes in favorability (or other first-answer of three answers) for the last 14 weeks, we see a similar pattern.
A very similar missing-zeros pattern also appears in the complete first 60-week collection, as we see here for the 9 3-answer questions given each of those weeks. (Here, just to be extra cautious, we only show every other weekly change, so each weekly result shows up in only one change.)
How do we know that the real data couldn't possibly have many changes of +1% or -1% but few changes of 0%? Let's make an imaginative leap and say that, for some inexplicable reason, the actual changes in the population's opinion were always exactly +1% or -1%, equally likely. Since real polls would have substantial sampling error (about +/-2% in the week-to-week numbers even in the first 60 weeks, more later) the distribution of weekly changes in the poll results would be smeared out, with slightly more ending up rounding to 0% than to -1% or +1%. No real results could show a sharp hole at 0%, barring yet another wildly unlikely accident.
By "unlikely" we mean that, just looking at the last 14 weeks alone, and counting only differences between non-overlapping pairs of weeks just to make sure all the random changes are independent, the chances are about one in a million that so few would be 0% out of just the results that came out +1%, 0%, or -1%. When those first 60 weeks of data are also included, the chances become less than 1 in 1016. (There are some minor approximations used here, but not ones of practical import.)
The missing zeros anomaly helps us guess how the results were generated. Some fancy ways to estimate population percentages based on polling results and prior knowledge give more stable results than do simple polls. So far as we are aware, no such algorithm shows too few changes of zero, i.e. none has an aversion to outputting the same whole number percent in two successive weeks. On the other hand, it has long been known that when people write down imagined random sequences they typically avoid repetition, i.e. show too few changes of zero.{Response-Tendencies in Attempts to Generate Random Binary Series. Paul Bakan, The American Journal of Psychology, Vol. 73, No. 1 (Mar., 1960), pp. 127-131.}
People who have been trusting the R2K reports should know about these extreme anomalies. We do not know exactly how the weekly R2K results were created, but we are confident they could not accurately describe random polls.
We thank J. M. Robins and J. I. Marden for expert advice on some technical issues and Markos Moulitsas for his gracious assistance under very trying circumstances. Calculations were done in Matlab, except for calculations of very low probabilities, which used an online tool.
Postscript: Del Ali, the head of R2K, was contacted by Markos concerning some of these anomalies on 6/14/2010. Ali responded promptly but at the time of this writing has not yet provided either any explanations or any raw data. We sent a draft of this paper to him very early on 6/28/2010, with an urgent request for explanations or factual corrections. Ali sent a prompt response but, as yet, no information.
Mark Grebner is a political consultant.
Michael Weissman is a retired physicist.
Jonathan Weissman, a wildlife research technician. is opening a blog with Michael for more accessible in-depth explanations, Q, arguments, etc. on the technical side of these poll forensics.