Malark-O-Meter's mission is to statistically analyze fact checker rulings to make comparative judgments about the factuality of politicians, and to measure our uncertainty in those judgments. Malark-O-Meter's methods, however, have a serious problem. To borrow terms made popular by Nate Silver's new book, Malark-O-Meter isn't yet good at distinguishing the signal from the noise. Moreover, we can't even distinguish one signal from another. I know. It sucks. But I'm just being honest. Without honestly appraising how well Malark-O-Meter fulfills its mission, there's no way to improve its methods.
Note: if you aren't familiar with how Malark-O-Meter works, I suggest you visit the Methods section.
The signals that we can't distinguish from one another are the real differences in factuality between individuals and groups, versus the potential ideological biases of fact checkers. For example, I've shown in a previous post that Malark-O-Meter's analsis of the 2012 presidential election could lead you to believe either that Romney is between four and 14 percent more full of malarkey than Obama, or that PolitiFact and The Fact Checker have on average a liberal bias that gives Obama between a four and 14 percentage point advantage in truthfulness, or that the fact checkers have a centrist bias that shrinks the difference between the two fact checkers to just six percent of what frothy-mouthed partisans believe it truly is. Although I've verbally argued that fact checker bias is probably not as strong as either conservatives or liberals believe, no one...NO ONE...has adequately measured the influence of political bias on fact checker rulings.
In a previous post on Malark-O-blog, I briefly considered some methods to measure, adjust, and reduce political bias in fact checking. Today, let's discuss the problem with Malark-O-Meter's methods that we can't tell signal from noise. The problem is a bit different than the one Silver describes in his book, which is that people have a tendency to see patterns and trends when there aren't any. Instead, the problem is how a signal might influence the amount of noise that we estimate.
Again, the signal is potential partisan or centrist bias. The noise comes from sampling error, which occurs when you take an incomplete sample of all the falsifiable statements that a politician makes. Malark-O-Meter estimates the sampling error of a fact checker report card by randomly drawing report cards from a Dirichlet distribution, which describes the probability distribution of the proportion of statements in each report card category. Sampling error is higher the smaller your sample of statements. The greater your sampling error, the less certain you will be in the differences you observe among individuals' malarkey scores.
To illustrate the sample size effect, I've reproduced a plot of the simulated malarkey score distributions for Obama, Romney, Biden, and Ryan, as of November 11th, 2012. Obama and Romney average 272 and ~140 rated statements per fact checker, respectively. Biden and Ryan average ~37 and ~21 statements per fact checker, respectively. The difference in the spread of their probability distributions is clear from the histograms and the differences between the upper and lower bounds of the labeled 95% confidence intervals.
(Click here to view the image.)
The trouble is that Malark-O-Meter's sampling distribution assumes that the report card of all the falsifiable statements an individual ever made would have similar proportions in each category as the sample report card. And that assumption implies another one: that the ideological biases of fact checkers, whether liberal or centrist, do not influence the probability that a given statement of a given truthfulness category is sampled.
In statistical analysis, this is called selection bias. The conservative ideologues at PolitiFactBias.com (and Zebra FactCheck, and Sublime Bloviations; they're all written by at least one of the same two guys, really) suggest that fact checkers could bias the selection of their statements toward more false ones made by Republicans, and more true ones made by Democrats. Fact checkers might also be biased toward selecting some statements that make them appear more left-center so that they don't seem too partisan. I'm pretty sure there are some liberals out there who would agree that fact checkers purposefully choose a roughly equal number of true and false statements by conservative and liberal politicians so that they don't seem partisan. In fact, that's a common practice for at least one fact checker, FactCheck.org. The case for centrist bias isn't as clear for PolitiFact or The Fact Checker.
I think it will turn out that fact checkers' partisan or centrist biases, whether in rating or sampling statements, are too weak to swamp the true differences between individuals or groups. It is, however, instructive to examine the possible effects of selection bias on malarkey scores and their sampling errors. (In contrast, the possible effects of ideological bias on the observed malarkey scores are fairly obvious.)
My previous analysis of the possible liberal and centrist biases of fact checkers was pretty simple. To estimate the possible partisan bias, I simply compared the probability distribution of the observed differences between the Democratic and Republican candidates to ones in which the entire distribution was shifted so that the mean difference was zero, or so that the difference between the parties was reversed. To estimate possible centrist bias, I simply divided the probability distribution that I simulated by the size of the difference that frothy-mouthed partisans would expected, which is large. That analysis assumed that the width of the margin of error in the malarkey score, which is determined by the sampling error, remained constant after accounting for fact checker bias. But that isn't true.
There are at least two ways that selection bias can influence the simulated margin of error of a malarkey score. One way is that selection bias can diminish the efficiency of a fact checkers' search for statements to fact check, leading to a smaller sample size of statements on each report card. Again, the smaller the sample size, the wider the margin of error. The wider the margin of error, the more difficult it is to distinguish among individuals, holding the difference in their malarkey scores constant. So the efficiency effect of selection bias causes us to underestimate, not overestimate, our certainty in the differences in factuality that we observe. So the only reason why we should worry about this effect is that it would diminish our confidence in observed differences in malarkey scores, which might be real even though we don't know the reason (bias versus real differences in factuality) that those differences exist.
The bigger problem, of course, is that selection bias influences the probability that statements of a given truthfulness category are selected into an individual report card. Specifically, selection bias might increase the probability that more true statements are chosen over less true statements, or vice versa, depending on the partisan bias of the fact checker. Centrist selection bias might increase the probability that more half true statements are chosen, or that more equal numbers of true and false statements are chosen.
The distribution of statements in a report card definitely influences the width of the simulated margin of error. Holding sample size constant, the more even the statements are distributed among the categories, the greater the margin of error. Conversely, when statements are clumped into only a few of the categories, the margin of error is smaller. To illustrate, let's look at some extreme examples.
Suppose I have an individual's report card that rates 50 statements. Let's see what happens to the spread of the simulated malarkey score distribution when we change the spread of the statements across the categories from more even to more clumped. We'll measure how clumped the statements are with something called the Shannon entropy. The Shannon entropy is a measure of uncertainty, typically measured in bits (binary digits that can be 0 or 1). In our case, entropy measures our uncertainty in the truthfulness category of a single statement sampled from all the statements that an individual has made. The higher the entropy score, the greater the uncertainty. Entropy (thus uncertainty) is greatest when the probabilities of all possible events are equal to one another.
We'll measure the spread of the simulated malarkey score distributed by the width of its 95% confidence interval. The 95% confidence interval is the range of malarkey scores that we can be 95% certain would result from another report card with the same number of statements sampled from the same person, given our beliefs about the probabilities of each statement.
We'll compare six cases. First is the case when the true probability of each category is the same. The other five cases are when the the true probability of one category is 51 times greater than the probabilities of the other categories, which would define our beliefs of the category probabilities if we observed (or forced through selection bias) that all 50 statements were in one of the categories. Below is a table that collects the entropy and confidence interval width from each of the six cases, and compares them to the equal statement probability case, for which the entropy is greatest the confidence intervals are widest. Entropies and are rounded to the nearest tenth, confidence interval widths to the nearest whole number, and comparisons to the nearest tenth. Here are the meanings of the column headers.
- Case: self explanatory
- Ent.: Absolute entropy of assumed category probabilities
- Comp. ent.: Entropy of assumed category probabilities compared to the case when the probabilities are all equal, expressed as a ratio
- CI width: Width of 95% confidence interval
- Comp. CI width: Width of 95% confidence interval compared to the case when the probabilities are all equal, expressed as a ratio
And here is the table:
Case |
Ent. |
Comp. ent. |
CI width |
Comp. CI width |
Equal |
2.3 |
1.0 |
18 |
1.0 |
All true |
0.5 |
0.2 |
12 |
0.66 |
All mostly true |
0.5 |
0.2 |
9 |
0.5 |
All half true |
0.5 |
0.2 |
7 |
0.4 |
All mostly false |
0.5 |
0.2 |
9 |
0.5 |
All false |
0.5 |
0.2 |
12 |
0.66 |
For all the clumped cases, the entropy is 20% of the entropy for the evenly distributed case. In fact, the entropy of all the clumped cases are the same because the calculation of entropy doesn't care about which categories are more likely than others. It only cares whether some categories are more likely than others.
The lower entropy in the clumped cases corresponds to small confidence intervals relative to the even case, which makes sense. The more certain we think we are in the probability that any one statement will be in a given report card category, the more certain we should be in the malarkey score.
This finding suggests that if fact checker bias causes oversampling of statements in certain categories, Malark-O-Meter will overestimate our certainty in the observed differences if the true probabilities within each category are more even. This logic could apply to partisan biases that lead to oversampling of truer or more false statements, or to centrist biases that oversample half true statements. The finding also suggests that a centrist bias that leads to artificially equivalent probabilities in each category will cause Malark-O-Meter to underestimate the level of certainty in the observed statements.
Another interesting finding is that the confidence interval widths that we've explored follow a predictable pattern. Here's a bar plot of the comparative CI widths from the table above.
Click her for the image.
The confidence interval is widest in the equal probability case. From there, we see a u-shaped pattern, with the narrowest confidence intervals occurring when we oversample half true statements. The confidence intervals get wider for the cases when we oversample mostly true or mostly false statements, and wider still for the cases when we oversample true or false statements. The confidence interval widths are equivaelent between the all true and all false cases, and the all mostly true and all mostly false cases.
What's going on? I don't really know yet. We'll have to wait for another day, and a more detailed analysis. I suspect it has something to do with how the malarkey score is calculated, which results in fewer malarkey score possibilities when the probabilities are more closely centered on half true statements.
Anyway, we're approaching a better understanding of how the selection bias among fact checkers can influence our comparative judgments of the factuality of politicians. Usefully, the same logic applies to the effects of fact checkers' rating biases in the absence of selection bias. You can expect Malark-O-Meter's honesty to continue. We're not here to prove any point that can't be proven. We're here to give an honest appraisal of how well we can compare the factuality of individuals using fact checker data. Stay tuned.