Barack Obama lost the state of South Carolina in 2008, by a wide margin, to John McCain, about 1.03M to 862K votes. It's a very simple process to show that Barack Obama, although he lost in SC, benefited from disproportionate support from the black population.
It goes like this: some counties in SC have as little as 8.6% non-white (mostly black) registered voters, while others have as great as 71%. There's a simple statistical test (called a Pearson test) which takes two columns of numbers and determines if they are likely to be corrlelated (or, rather, if they are unlikely to be uncorrelated, which is almost the same thing).
So I checked this for the Presidential General Election of 2008 in South Carolina, and find Obama received disproportionate support from non-white voters (surprising, probably, nobody). But I also checked this, for the Democratic Primary Election of 2010 in South Carolina, where Alvin Greene inexplicably and unexpectedly won. And guess what the results show?
Analysis after the jump.
I searched for correlation between the percentage of non-white voters in each county in South Carolina, and the percentage of votes Obama received in each county. The test uses a formula to combine numbers in column 1 with their partner in column 2, and then adds up that combination, producing a "figure of merit". If the figure of merit is high, it means the numbers are more likely to be correlated.
And guess what? The analysis shows that Obama unquestionably was disproportionately supported by black voters in in South Carolina. More importantly, it shows that Alvin Greene was not.
The Pearson Test
The Pearson test asks a simple question: assuming there's no relationship between these two columns of numbers, what's the probability that a correlation as strong (or stronger) would be observed between them? When the probability is very small, it means it is unlikely that the two columns of numbers are uncorrelated -- and more likely that there's a relationship between the two.
Sure enough: that probability for the 2008 General Election -- testing whether Obama received disproportionate support from non-white voters in South Carolina -- is 5e-24. That's a probability which is about 1 hundred billion times smaller than you winning the California Lottery 6/49 on a single ticket. With a high degree of certainty, Barack Obama received disproportionately many votes from non-white voters.
However, what about the Democratic Primary Election 2010? The same analysis, for Alvin Greene, finds a probability is only 0.17 -- basically, a 1 out of 7 shot, meaning it's no more unexpected than calling a dice shot. This means there is no evidence that Alvin Greene received disproportionate support of black voters in South Carolina.
Yesterday, I wrote this diary which describes a statistical analysis I performed, to investigate the suggestion that Alvin Greene received votes because he was black, from black voters. The conclusion of that analysis was that there's absolutely no evidence for this in the election results published by county, in comparison with the 2000 US Census figures for blacks by county.
In the discussion which followed, someone suggested that the 2000 US Census figures were not accurate enough for the test, to establish the percentage of black democratic voters by county, and that I should use the published breakdown of Democratic non-white voters by county instead. South Carolina does not publish the party breakdown of voters by race, county and party simultaneously; so, as a proxy, I used the published numbers of voters by race and county.
I then did the following simulation, identical to what was described in my previous diary. I assumed the total number of votes in each county was given by the total 2-man contest votes (Rawl+Greene). I excluded from this simulation Charleston county, which is Rawl's home county, which Rawl won, and where it would be safe to assume he was best known. I assumed the percentage of white voters and non-white voters given by the South Carolina Election Commission, by county, as applying to the Democratic primary. And, I then assumed that white voters had a 50% chance of voting for Greene; and that black voters had some probability which started at 50% for Greene. I ran that simulation 100 times, and generated 100 realizations of the election, using a Poisson deviate of the expected number of votes for Greene, with Rawl taking the remaining votes. Then, I increased the probability that black voters supported Greene over Rawl to 51%, then 52%, 53%, producing 100 realizations for each percentage. With each realization, I calculated the Pearson test probability. And then I plotted that against the total state-wide number of votes that Greene would have gotten. This plot is shown below.
What you see in this plot is the following: the X-axis is the number of votes the simulation produced for Greene, and the Y-axis is corresponding the Pearson probability (actually, the logarithm of that probability). In the upper left hand corner are realizations of the simulation where we assumed non-white voters had a 50% chance of voting for Greene. There, Greene receives about half the total vote (88,000 votes), and there Pearson probability is very large (about 0.01--0.1), indicating that the simulation finds no evidence of a correlation between the percentage of non-white voters in a county, and the % of votes Greene received in that County.
Then, the simulation increased the likelihood for non-white voters to select Greene over Rawl, 51%, 52%.. on up to 100%. As this percentage increases, the Simulated Number of votes for Greene also increases, as we would expect. But, importantly, the Pearson probability also begins to plummet. By the time the simulation reaches 60% (about 94,000 votes for Greene statewide), the typical Pearson probability has dropped below 0.0001, which is a level most statisticians would be comfortable with declaring a "significant correlation" between the percentage of non-white voters in a county, and the percentage of votes Greene received in that county.
However, when we look at the plot, we notice that the election results were that Greene received about 100300 votes. In order to get to that many statewide votes, the non-white voters would have to have preferred Greene over Rawl by about 72% to 28%, on average. And, as you can see on the graph, that would typically produce a Pearson probability of about 1e-19 which is about one million times less likely than your winning the California State Lottery 6/49 with a single ticket.
In other words, what this graph shows is, for Greene to have won solely on the preferential support of black voters, we should see a significant correlation between the percentage of non-white voters in a county and the percentage of vote that Greene received in that county.
But look up at that blue dot. That blue dot is the actual election result: Greene with about 100300 votes (again, excluding Charleston County), but where the Pearson probability of only 0.17 -- that is, with no detectable correlation between the percentage of non-white voters in a county, and the percentage of votes Greene received. It is simply not possible -- or rather, not within the realm of credible likelihood -- that black voters provided the disproportionate support to get Greene to 100,000 votes, without producing a correlation between the percentage of non-white voters by county, and the percentage of votes Greene received in that county.
What this figure is saying makes sense when you think about it: if black voters disproportionately voted for Greene then, the more they supported him, the more detectable a correlation between the percentage of black voters in a county and the percentage of the votes Greene received in that county should be. But, what this figure also shows is, there is no such correlation between the percentage of black voters in a county and the percentage of the votes Greene received in that county, even though the statistical test applied is sensitive enough to detect it.
Based on this result, I conclude that the the votes that Greene received are not due to his being disproportionately supported by black voters.
What does this mean? It means that one of the reasons that commentators have thrown out there -- that people noticed the spelling of his name, concluded he was black, and voted accordingly -- has absolutely no support in the election results. Greene therefore received support in equal proportion of non-white voters, and white voters alike.
The reason for Greene's nomination must lay elsewhere.
More analysis to come.
References
Update Monday June 14 12:09 EDT I've been asked to comment
on a similar analysis made by jeffmd over at swingstateproject.com.
There are (at least) two steps in correlation characterization. First, determining that a correlation in the underlying population exists. This can be called "detection". This is, essentially, answering the question that I answer in the diary above: "Assuming no correlation exists in the underlying population, what is the probability of producing an extreme Figure-of-merit as great or greater than observed in these data?" The answer to that question here is about 0.17, a value so high that no statistician would consider it a statistically significant detection. The conclusion one draws from that is that there is no evidence for correlation between the percentage of non-white voters in a county and the percentage of vote Greene received in that county.
The second step in characterizing a correlation is measuring the magnitude of the correlation, what is called "characterization of the signal". Doing so answers the question, "If I were to continue taking data, at what rate will the magnitude of the signal increase?" This question assumes the existence of a correlation in the underlying population. This is the question answered in that previous work: they quote an R^2 value from a linear regression analysis. That is a logical error: (almost) any two columns of real-world data will produce a non-zero R^2 value. The author of the previous work (Jeffmd) mistakes a non-zero R^2 value as being the same as demonstrating the correlation is "detected" (as I define it above). But it's not the same thing at all.
Thus, those previous results do not demonstrate a significance of detection. Without a significance of detection, the magnitude of the R^2 value has no meaning, except as a limit on the possible correlation which is undetected in the present sample size, but which may be detectable in a larger sample.