Skip to main content

Last year, while living away from my family for a year to do ethnographic fieldwork in a remote village on a tiny Lesser Antillean island, I kept myself sane and connected to the political news in my home country by creating a new hobby. I applied my knowledge of inferential statistics and computational simulation to use fact checker reports from and The Fact Checker at the Washington Post to comparatively judge the truthfulness of the 2012 presidential and vice presidential candidates, and (more importantly) to measure our uncertainty in those judgments.

The site (and its syndication on the Daily Kos) generated some good discussion, some respectable traffic, and (I hope) showed its followers the potential for a new kind of inference-driven fact checking journalism. My main conclusions from the 2012 election analysis were:

(1) The candidates aren't as different as partisans left or right would have us believe.

(2) But the Democratic ticket was somewhat more truthful than the Republican ticket, both overall, and during the debates.

(3) It's quite likely that the 2012 Republican ticket was less truthful than the 2008 Republican ticket, and somewhat likely that the 2012 Democratic ticket was less truthful than the 2008 Democratic ticket.

Throughout, I tempered these conclusions with the recognition that my analyses did not account for the possible biases of fact checkers, including biases toward fairness, newsworthiness, and, yes, political beliefs. Meanwhile, I discussed ways to work toward measuring these biases and adjusting measures of truthfulness for them. I also suggested that fact checkers should begin in earnest to acknowledge that they aren't just checking facts, but the logical validity of politicians' arguments, as well. That is, fact checkers should also become fallacy checkers who gauge the soundness of an argument, not simply the truth of its premises.

Now, it's time to close up shop. Not because I don't plan on moving forward with what I'm proud to have done here. I'm closing up shop because I have much bigger ideas.

I've started writing up an master plan for a research institute and social media platform that will revolutionize fact checking journalism. For now, I'm calling the project Sound Check. I might have to change the name because that domain name is taken. Whatever its eventual name, Sound Check will be like FiveThirtyEight meets YouGov meets PolitiFact meets RapGenius: data-driven soundness checking journalism and research on an annotated social web. You can read more about the idea from this draft executive summary

Anyway, over the next three years (and beyond!), I hope you're going to hear a lot about this project. Already, I've started searching for funding so that I can, once I obtain my PhD in June 2014, start working full time on Sound Check.

One plan is to become an "Upstart". Upstart is a new idea from some ex-Googlers. At Upstart, individual graduates hedge their personal risk by looking for investor/mentors, who gain returns from the Upstart's future income (which is predicted from a proprietary algorithm owned by Upstart). Think of it as a capitalist, mentoring-focused sort of patronage. Unlike Kickstarter or other crowd-funding mechanisms, where patrons get feel-good vibes and rewards, Upstart investors are investing in a person like they would invest in a company.

Another plan is, of course, to go the now almost traditional crowd-funding route, but only for clearly defined milestones of the project. For example, first I'd want to get funding to organize a meet-up of potential collaborators and investors. Next I'd want to get funding for the beta-testing of the sound checking algorithm. After that I'd get funding for a beta-test of the social network aspect of Sound Check. Perhaps the these (hopefully successfully) crowd-funded projects would create interest among heavy-hitting investors.

Yet another idea is to entice some university (UW?) and some wealthy person or group of people interested in civic engagement and political fact checking to partner with Sound Check in a way similar to how grew out of the Annenberg Public Policy Center at University of Pennsylvania.

Sound Check is a highly ambitious idea. It will need startup funding for servers, programmers, administrative staff, as well as training and maintaining Sound Checkers (that's fact checkers who also fallacy check). So I've got my work cut out for me. I'm open to advice and new mentors. And soon, I'll be open, along with Sound Check, to investors and donors.


Science journalist Chris Mooney recently wrote about "The Science of Why Comment Trolls Suck" in Mother Jones magazine. His article covers a study by researchers at the George Mason University Center for Climate Change Communication, who asked over a thousand study participants to read the same blog article about the benefits and risks of nanotechnology. The comment section that subjects experienced varied from civil discussion to name-calling flame war. The researchers found that witnessing flame wars caused readers' perceptions of nanotechnology risks to become more extreme.

Mooney argues that these findings don't bode well for the public understanding of climate science. He also argues that this... not your father's media environment any longer. In the golden oldie days of media, newspaper articles were consumed in the context of…other newspaper articles. But now, adds Scheufele, it's like "reading the news article in the middle of the town square, with people screaming in my ear what I should believe about it."
Finally, based on his interpretation of the evidence, Mooney advocates that we ignore the comments section.

I agree with Mooney that flame wars are detrimental to rational discourse, and that the George Mason study highlights the pitfalls of what Daniel Kahneman calls "System 1" thinking. Yet I counter that a purely civil discussion about climate change may also be counter-productive. Furthermore, the comments section doesn't mark a huge departure from "your father's media environment". Finally, the George Mason University study demonstrates that there is good reason to pay very close attention to the comments section, even if it is littered with bridges beneath which trolls dwell.

Read more below the fold.

Continue Reading

This week, two political science blog posts about the difference between political engagement and factual understanding stood out to Malark-O-Meter. (Thanks to The Monkey Cage for Tweeting their links.) First, there's Brendan Nyhan's article at YouGov about how political knowledge doesn't guard against belief in conspiracy theories. Second, there'svoteview's article about issues in the 2012 election. (Side note: This could be the Golden Era of political science blogging) These posts stand out both as cautionary tales about what it means to be politically engaged versus factual, and as promising clues about how to assess the potential biases of professional fact checkers in order to facilitate the creation of better factuality metrics (what Malark-O-Meter is all about).

Read more below the fizzold.

Continue Reading

Recently, the Nieman Journalism Lab reported on OpenCaptions, the creation of Dan "Truth Goggles" Schultz. OpenCaptions prepares a live television transcript from closed captions, which can then be analyzed. I came across OpenCaptions back in October, when I learned about Schultz's work on Truth Goggles, which highlights web content that has been fact checked by PolitiFact. Reading about it this time reminded me of something I'd written in my critique of the fact checking industry's opinions about factuality comparison among individual politicians.

At the end of that post, I commented on a suggestion made by Kathleen Hall Jamieson of the Annenberg Public Policy Center about how to measure the volume of factuality that a politician pumps into the mediasphere. Jamieson's suggestion was to weight the claims that a politician makes by the size of their audience. I pointed out some weaknesses of this factuality metric. I also recognized that it is still useful, and described the data infrastructure necessary to calculate the metric. Basically, you need to know the size of the audience of a political broadcast (say, a political advertisement), the content of the broadcast, and the soundness of the arguments made during the broadcast.

OpenCaptions shows promise as a way to collect the content of political broadcasts and publish it to the web for shared analysis. Cheers to Dan Schultz for creating yet another application that will probably be part of the future of journalism...and fact checking...and factuality metrics.


Yesterday, I argued that fact checkers who rate their rulings on a scale should incorporate the number and type of logical fallacies into their ratings. I also argued that the rating scales of fact checkers like PolitiFact and The Fact Checker are valuable, but they conflate soundness and validity, which causes their ratings to be vague. As usual, I syndicated the post on the Daily Kos. Kossack Ima Pseudoynm provided valuable constructive criticism, which we'll consider today below the fold.


Should fact checkers like PolitiFact and The Fact Checker explicitly incorporate logic into their rating systems?

92%12 votes
0%0 votes
7%1 votes
0%0 votes

| 13 votes | Vote | Results

Continue Reading

Fact checkers perform a vital public service. The truth, however, is contentious. So fact checkers take criticism from all sides. Sometimes, they deserve it. For example, Greg Marx wrote in the Columbia Journalism Review,

But here’s where the fact-checkers find themselves in a box. They’ve reached for the clear language of truth and falsehood as a moral weapon, a way to invoke ideas of journalists as almost scientific fact-finders. And for some of the statements they scrutinize, those bright-line categories work fine.

A project that involves patrolling public discourse, though, will inevitably involve judgments not only about truth, but about what attacks are fair, what arguments are reasonable, what language is appropriate. And one of the maddening things about the fact-checkers is their unwillingness to acknowledge that many of these decisions—including just what constitutes “civil discourse”—are contestable and, at times, irresolvable.

Whether or not fact checkers wield it as a "moral weapon", they certainly use the "language of truth and falsehood", and some of them attempt to define "bright-line categories". This is most true for PolitiFact and The Fact Checker, which give clear cut, categorical rulings to the statements that they cover, and whose rulings currently form the basis of the malarkey score here at Malark-O-Meter, which rates the average factuality of individuals and groups.

The language of truth and falsehood does "invoke ideas of journalists as almost scientific fact-finders." But it isn't just the language of truth and falsehood that bestows upon the art of fact checking an air of science. Journalists who specialize in fact checking do many things that scientists do (but not always). They usually cover falsifiable claims, flicking a wink into Karl Popper's posthumous cup of tiddlies. They always formulate questions and hypotheses about the factuality of the claims that they cover. They usually test their hypotheses against empirical evidence rather than unsubstantiated opinion.

Yet Fact checkers ignore a lot of the scientific method. For instance, they don't replicate (then again, neither do many scientists). Moreover, fact checkers like PolitiFact and The Fact Checker use rating scales that link only indirectly and quite incompletely to the logic of a claim. To illustrate, observe PolitiFact's description of its Truth-O-Meter scale.

True – The statement is accurate and there’s nothing significant missing.

Mostly True – The statement is accurate but needs clarification or additional information.

Half True – The statement is partially accurate but leaves out important details or takes things out of context.

Mostly False – The statement contains some element of truth but ignores critical facts that would give a different impression.

False – The statement is not accurate.

Pants on Fire – The statement is not accurate and makes a ridiculous claim. [Malark-O-Meter note: Remember that the malarkey score treats "False" and "Pants on Fire" statements the same.]

Sometimes, fact checkers specify in the essay component of their coverage the logical fallacies that a claim perpetrates. Yet neither the Truth-O-Meter scale nor The Fact Checker's Pinocchio scale specify which logical fallacies were committed or how many. Instead, PolitiFact and The Fact Checker use a discrete, ordinal scale that combines accuracy in the sense of correctness with completeness in the sense of clarity.

By obscuring the reasons why something is false, these ruling scales make it easy to derive factuality metrics like the malarkey score, but difficult to interpret what those metrics mean. More importantly, PolitiFact and The Fact Checker make themselves vulnerable to the criticism that their truth ratings are subject to ideological biases because...well...because they are. Their apparent vagueness makes them so. Does this make the Truth-O-Meter and Pinocchio scales worthless? Probably not. But we can do better. Here's how.

When evaluating an argument (all claims are arguments, even if they are political sound bites), determine if it is sound. To be sound, all of an argument's premises must be true, and the argument must be valid. To be true, a premise must adhere to the empirical evidence. To be valid, an argument must commit no logical fallacies. The problem is that the ruling scales of fact checkers conflate soundness and validity. The solution is to stop doing that.

When and if Malark-O-Meter grows into a fact checking entity, it will experiment with rating scales that specify and enumerate logical fallacies. It will assess both the soundness and the validity of an argument. I have an idea of how to implement this on the web that is so good, I don't want to give it away just yet.

There are thousands of years of formal logic research that stretch into the modern age. Hell, philosophy PhD Gary N. Curtis publishes an annotated and interactive taxonomic tree of logical fallacies on the web.

Stay tuned to Malark-O-Meter, where I'm staging a fact check revolution.


This post was originally published at Malark-O-Meter, which statistically analyzes fact checker rulings to make comparative judgments about the factuality of politicians, and measures our uncertainty in those judgments.

There's a lot of talk this week about Marco Rubio, who is already being vetted as a possible front runner in the 2016 presidential 2012...right after the 2012 presidential campaign. In answer to the conservatives' giddiness about the Senator from Florida, liberals have been looking for ways to steal Rubio' clouds on the horizon that could lead to potential thunder maybe in a few years? I dunno. Anyway, one example of this odd little skirmish involves a comment that Senator Rubio made in answer to a GQ interviewers' question about the age of the Earth:

GQ: How old do you think the Earth is?

Marco Rubio: I'm not a scientist, man. I can tell you what recorded history says, I can tell you what the Bible says, but I think that's a dispute amongst theologians and I think it has nothing to do with the gross domestic product or economic growth of the United States. I think the age of the universe has zero to do with how our economy is going to grow. I'm not a scientist. I don't think I'm qualified to answer a question like that. At the end of the day, I think there are multiple theories out there on how the universe was created and I think this is a country where people should have the opportunity to teach them all. I think parents should be able to teach their kids what their faith says, what science says. Whether the Earth was created in 7 days, or 7 actual eras, I'm not sure we'll ever be able to answer that. It's one of the great mysteries. [emphasis added]

 Gotcha!" say my fellow liberals (and I). Ross Douthat, conservative blogger at the New York Times (among other places), argues convincingly that it was a "politician's answer" to a politically contentious question, but rightly asks why Rubio answered in a way that fuels the "conservatives vs. science" trope that Douthat admits has basis in reality. Douthat writes that Rubio could have said instead:
I’m not a scientist, but I respect the scientific consensus that says that the earth is — what, something like a few billions of years old, right? I don’t have any trouble reconciling that consensus with my faith. I don’t think the 7 days in Genesis have to be literal 24-hour days. I don’t have strong opinions about the specifics of how to teach these issues — that’s for school boards to decide, and I’m not running for school board — but I think religion and science can be conversation partners, and I think kids can benefit from that conversation.

So why didn't Rubio say that instead of suggesting wrongly, and at odds with overwhelming scientific consensus, that the age of the Earth is one of the greatest mysteries?

An issue more relevant to the fact checking industry that Malark-O-Meter studies and draws on to measure politicians' factuality is this: Why aren't statements like this featured in fact checking reports? The answer probably has something to do with one issue Rubio raised in his answer to GQ, and something that pops up in Douthat's wishful revision.

  • "I think the age of the universe has zero to do with how our economy is going to grow." (Rubio)
  • "...I'm not running for school board..." (Douthat)

You can easily associate these statements with a key constraint of the fact checking industry. As Glenn Kessler stated in a recent panel discussion about the fact checking industry, fact checkers are biased toward newsworthy claims that have broad appeal (PolitiFact's growing state-level fact checking effort notwithstanding). Most Americans care about the economy right now, and few Americans have ever thought scientific literacy was the most important political issue. Fact checkers play to the audience on what most people think are the most important issues of the day. I could not find one fact checked statement that a politician made about evolution or climate change that wasn't either a track record of Obama's campaign promises, or an assessment of how well a politicians' statements and actions adhere to their previous positions on these issues.

What does the fact checker bias toward newsworthiness mean for Malark-O-Meter's statistical analyses of politicians' factuality? Because fact checkers aren't that interested in politicians' statements about things like biology and cosmology, the malarkey score isn't going to tell you much about how well politicians adhere to the facts on those issues. Does that mean biology, cosmology, and other sciences aren't important? Does that mean that a politicians' scientific literacy doesn't impact the soundness of their legislation?


The scientific literacy of politicians is salient to whether they support particular policies on greenhouse gas reduction, or stem cell research, or education, or, yes, the economy. After all, although economics is a soft science, it's still a science. And if you watched the recent extended debate between Rubio and Jon Stewart on the Daily Show, and you also read the Congressional Research Report that debunks the trickle down hypothesis, and you've read the evidence that we'd need a lot of economic growth to solve the debt problem, you'd recognize that some of Rubio's positions on how to solve our country's economic problems do not align well with the empirical evidence.

But does that mean that Rubio is full of malarkey? According to his Truth-O-Meter report card alone, no. The mean of his simulated malarkey score distribution is 45, and we can be 95% certain that, if we sampled another incomplete report card with the same number of Marco Rubio's statements, his measured malarkey score would be between 35 and 56. Not bad. By comparison, Obama, the least full of malarkey among the 2012 presidential candidates, has a simulated malarkey score based on his Truth-O-Meter report card of 44 and is 95% likely to fall between 41 and 47. The odds that Rubio's malarkey score is greater than Obama's are only 3 to 2, and the difference between their malarkey score distributions averages only one percentage point.

How would a more exhaustive fact checking of Rubio's scientifically relevant statements influence his malarkey score? I don't know. Is this an indictment of truthfulness metrics like the ones that Malark-O-Meter calculates? Not necessarily. It does suggest, however, that Malark-O-Meter should look for ways to modify its methods to account for the newsworthiness bias of fact checkers.

If my dreams for Malark-O-Meter ever come to fruition, I'd like it to be at the forefront of the following changes to the fact checker industry:

  1. Measure the size and direction of association between the topics that fact checkers cover, the issues that Americans currently think are most important, and the stuff that politicians say.
  2. Develop a factuality metric for each topic (this would require us to identify the topic(s) relevant to a particular statement).
  3. Incorporate (and create) more fact checker sites that provide information about a politicians' positions on topics that are underrepresented by the fact checker industry. For example, one could use a Truth-O-Meter-like scale to rate the positions that individuals have on scientific topics, which are often available at sites like

So it isn't that problems like these bring the whole idea of factuality metrics into question. It's just that the limitations of the fact checker data instruct us about how we might correct for them with statistical methods, and with new fact checking methods. Follow Malark-O-Meter and tell all your friends about it so that maybe we can one day aid that process.


Malark-O-Meter's mission is to statistically analyze fact checker rulings to make comparative judgments about the factuality of politicians, and to measure our uncertainty in those judgments. Malark-O-Meter's methods, however, have a serious problem. To borrow terms made popular by Nate Silver's new book, Malark-O-Meter isn't yet good at distinguishing the signal from the noise. Moreover, we can't even distinguish one signal from another. I know. It sucks. But I'm just being honest. Without honestly appraising how well Malark-O-Meter fulfills its mission, there's no way to improve its methods.

Note: if you aren't familiar with how Malark-O-Meter works, I suggest you visit the Methods section.

The signals that we can't distinguish from one another are the real differences in factuality between individuals and groups, versus the potential ideological biases of fact checkers. For example, I've shown in a previous post that Malark-O-Meter's analsis of the 2012 presidential election could lead you to believe either that Romney is between four and 14 percent more full of malarkey than Obama, or that PolitiFact and The Fact Checker have on average a liberal bias that gives Obama between a four and 14 percentage point advantage in truthfulness, or that the fact checkers have a centrist bias that shrinks the difference between the two fact checkers to just six percent of what frothy-mouthed partisans believe it truly is. Although I've verbally argued that fact checker bias is probably not as strong as either conservatives or liberals believe, no one...NO ONE...has adequately measured the influence of political bias on fact checker rulings.

In a previous post on Malark-O-blog, I briefly considered some methods to measure, adjust, and reduce political bias in fact checking. Today, let's discuss the problem with Malark-O-Meter's methods that we can't tell signal from noise. The problem is a bit different than the one Silver describes in his book, which is that people have a tendency to see patterns and trends when there aren't any. Instead, the problem is how a signal might influence the amount of noise that we estimate.

Again, the signal is potential partisan or centrist bias. The noise comes from sampling error, which occurs when you take an incomplete sample of all the falsifiable statements that a politician makes. Malark-O-Meter estimates the sampling error of a fact checker report card by randomly drawing report cards from a Dirichlet distribution, which describes the probability distribution of the proportion of statements in each report card category. Sampling error is higher the smaller your sample of statements. The greater your sampling error, the less certain you will be in the differences you observe among individuals' malarkey scores.

To illustrate the sample size effect, I've reproduced a plot of the simulated malarkey score distributions for Obama, Romney, Biden, and Ryan, as of November 11th, 2012. Obama  and Romney average 272 and ~140 rated statements per fact checker, respectively. Biden and Ryan average ~37 and ~21 statements per fact checker, respectively. The difference in the spread of their probability distributions is clear from the histograms and the differences between the upper and lower bounds of the labeled 95% confidence intervals.

(Click here to view the image.)

The trouble is that Malark-O-Meter's sampling distribution assumes that the report card of all the falsifiable statements an individual ever made would have similar proportions in each category as the sample report card. And that assumption implies another one: that the ideological biases of fact checkers, whether liberal or centrist, do not influence the probability that a given statement of a given truthfulness category is sampled.

In statistical analysis, this is called selection bias. The conservative ideologues at (and Zebra FactCheck, and Sublime Bloviations; they're all written by at least one of the same two guys, really) suggest that fact checkers could bias the selection of their statements toward more false ones made by Republicans, and more true ones made by Democrats. Fact checkers might also be biased toward selecting some statements that make them appear more left-center so that they don't seem too partisan. I'm pretty sure there are some liberals out there who would agree that fact checkers purposefully choose a roughly equal number of true and false statements by conservative and liberal politicians so that they don't seem partisan. In fact, that's a common practice for at least one fact checker, The case for centrist bias isn't as clear for PolitiFact or The Fact Checker.

I think it will turn out that fact checkers' partisan or centrist biases, whether in rating or sampling statements, are too weak to swamp the true differences between individuals or groups. It is, however, instructive to examine the possible effects of selection bias on malarkey scores and their sampling errors. (In contrast, the possible effects of ideological bias on the observed malarkey scores are fairly obvious.)

My previous analysis of the possible liberal and centrist biases of fact checkers was pretty simple. To estimate the possible partisan bias, I simply compared the probability distribution of the observed differences between the Democratic and Republican candidates to ones in which the entire distribution was shifted so that the mean difference was zero, or so that the difference between the parties was reversed. To estimate possible centrist bias, I simply divided the probability distribution that I simulated by the size of the difference that frothy-mouthed partisans would expected, which is large. That analysis assumed that the width of the margin of error in the malarkey score, which is determined by the sampling error, remained constant after accounting for fact checker bias. But that isn't true.

There are at least two ways that selection bias can influence the simulated margin of error of a malarkey score. One way is that selection bias can diminish the efficiency of a fact checkers' search for statements to fact check, leading to a smaller sample size of statements on each report card. Again, the smaller the sample size, the wider the margin of error. The wider the margin of error, the more difficult it is to distinguish among individuals, holding the difference in their malarkey scores constant. So the efficiency effect of selection bias causes us to underestimate, not overestimate, our certainty in the differences in factuality that we observe. So the only reason why we should worry about this effect is that it would diminish our confidence in observed differences in malarkey scores, which might be real even though we don't know the reason (bias versus real differences in factuality) that those differences exist.

The bigger problem, of course, is that selection bias influences the probability that statements of a given truthfulness category are selected into an individual report card. Specifically, selection bias might increase the probability that more true statements are chosen over less true statements, or vice versa, depending on the partisan bias of the fact checker. Centrist selection bias might increase the probability that more half true statements are chosen, or that more equal numbers of true and false statements are chosen.

The distribution of statements in a report card definitely influences the width of the simulated margin of error. Holding sample size constant, the more even the statements are distributed among the categories, the greater the margin of error. Conversely, when statements are clumped into only a few of the categories, the margin of error is smaller. To illustrate, let's look at some extreme examples.

Suppose I have an individual's report card that rates 50 statements. Let's see what happens to the spread of the simulated malarkey score distribution when we change the spread of the statements across the categories from more even to more clumped. We'll measure how clumped the statements are with something called the Shannon entropy. The Shannon entropy is a measure of uncertainty, typically measured in bits (binary digits that can be 0 or 1). In our case, entropy measures our uncertainty in the truthfulness category of a single statement sampled from all the statements that an individual has made. The higher the entropy score, the greater the uncertainty. Entropy (thus uncertainty) is greatest when the probabilities of all possible events are equal to one another.

We'll measure the spread of the simulated malarkey score distributed by the width of its 95% confidence interval. The 95% confidence interval is the range of malarkey scores that we can be 95% certain would result from another report card with the same number of statements sampled from the same person, given our beliefs about the probabilities of each statement.

We'll compare six cases. First is the case when the true probability of each category is the same. The other five cases are when the the true probability of one category is 51 times greater than the probabilities of the other categories, which would define our beliefs of the category probabilities if we observed (or forced through selection bias) that all 50 statements were in one of the categories. Below is a table that collects the entropy and confidence interval width from each of the six cases, and compares them to the equal statement probability case, for which the entropy is greatest the confidence intervals are widest. Entropies and are rounded to the nearest tenth, confidence interval widths to the nearest whole number, and comparisons to the nearest tenth. Here are the meanings of the column headers.

  • Case: self explanatory
  • Ent.: Absolute entropy of assumed category probabilities
  • Comp. ent.: Entropy of assumed category probabilities compared to the case when the probabilities are all equal, expressed as a ratio
  • CI width: Width of 95% confidence interval
  • Comp. CI width: Width of 95% confidence interval compared to the case when the probabilities are all equal, expressed as a ratio

And here is the table:
Case Ent. Comp. ent. CI width Comp. CI width
Equal 2.3 1.0 18 1.0
All true 0.5 0.2 12 0.66
All mostly true 0.5 0.2 9 0.5
All half true 0.5 0.2 7 0.4
All mostly false 0.5 0.2 9 0.5
All false 0.5 0.2 12 0.66
For all the clumped cases, the entropy is 20% of the entropy for the evenly distributed case. In fact, the entropy of all the clumped cases are the same because the calculation of entropy doesn't care about which categories are more likely than others. It only cares whether some categories are more likely than others.

The lower entropy in the clumped cases corresponds to small confidence intervals relative to the even case, which makes sense. The more certain we think we are in the probability that any one statement will be in a given report card category, the more certain we should be in the malarkey score.

This finding suggests that if fact checker bias causes oversampling of statements in certain categories, Malark-O-Meter will overestimate our certainty in the observed differences if the true probabilities within each category are more even. This logic could apply to partisan biases that lead to oversampling of truer or more false statements, or to centrist biases that oversample half true statements. The finding also suggests that a centrist bias that leads to artificially equivalent probabilities in each category will cause Malark-O-Meter to underestimate the level of certainty in the observed statements.

Another interesting finding is that the confidence interval widths that we've explored follow a predictable pattern. Here's a bar plot of the comparative CI widths from the table above.

Click her for the image.

The confidence interval is widest in the equal probability case. From there, we see a u-shaped pattern, with the narrowest confidence intervals occurring when we oversample half true statements. The confidence intervals get wider for the cases when we oversample mostly true or mostly false statements, and wider still for the cases when we oversample true or false statements. The confidence interval widths are equivaelent between the all true and all false cases, and the all mostly true and all mostly false cases.

What's going on? I don't really know yet. We'll have to wait for another day, and a more detailed analysis. I suspect it has something to do with how the malarkey score is calculated, which results in fewer malarkey score possibilities when the probabilities are more closely centered on half true statements.

Anyway, we're approaching a better understanding of how the selection bias among fact checkers can influence our comparative judgments of the factuality of politicians. Usefully, the same logic applies to the effects of fact checkers' rating biases in the absence of selection bias. You can expect Malark-O-Meter's honesty to continue. We're not here to prove any point that can't be proven. We're here to give an honest appraisal of how well we can compare the factuality of individuals using fact checker data. Stay tuned.


If you've seen my UID lately, you'll notice that I haven't been here very long. Boy do I regret not coming here sooner. I could have seen the beginnings of and the Princeton Election Consortium. Is this where Drew Linzer started out, too?

Before I became a member, I read my fair share of DK. Yet as recently as two days before I signed up, I didn't even know that any regular Joe like me could write a DK Diary. I thought you were all paid professionals. That's how good the DK community is! Anyway, I have a confession to make. Once I learned that I could have a DK Diary of my very own, what prompted me to become a member is that I had started a blog, and I wanted to syndicate that blog here to help spread my ideas. Obviously, I seek a bigger audience because I think I have something important to say about my two current obsessions, which are fact checker data analysis and election forecast model averaging (soon to be added: redistricting algorithms and alternative voting mechanisms).

So far, some of my diary entries have lead to meaningful discussion in the comments section. Meaningful discussion is another thing that attracted me to DK. Not only could I increase my audience, but I could talk with smart people about topics that interest me. In fact, that's the reason why I blog in the first place: to generate and participate in intelligent, rational discussions.

Unfortunately, a cadre of of my fellow Kossacks are offended enough by my syndication practices that a few of the comment sections of my Diary entries are littered with a debate about something that is unimportant: whether or not my syndication constitutes "spam". Clearly, they aren't spam, which is the mass communication of a commercial message to a very large number of addresses. I don't get how you can apply the term "spammer" to describe a guy who shares data-driven essays about whether incumbents are more factual than challengers, or about whether we should employ Bayesian model averaging to make better election predictions. I mean let's step back for a minute here. When Nate Silver was just poblano, was he not promoting himself? The only difference is that he developed a following here before he published the fivethirtyeight domain to the web. That difference is not all. The only thing that matters is the work that Nate Silver does. That's what we should be judging and discussing. Not where or how he publishes it.

Now, I would appreciate it and respect the wishes of these Kossacks had they requested politely that I post the entire blog entry here at DK rather than posting a teaser. In fact, I've already adopted that practice because it seems fair to me. Yet what I can't get behind is the allegation that I am somehow "spamming" DK. Someone actually accused me of unconscionable spamming just today because I failed to back-link to a piece that I referenced in enough detail that anyone could find it with a few taps on their keyboard into a Google search bar.

What I also can't get behind is the possibility that Diary policing is cluttering the discussion section of the DK Diaries of people like me. Shouldn't we be discussing substantive issues here at Daily Kos? I think that's what kos would say.

I also can't get behind a knee-jerk response that includes threatening, putdown comments like, "You'll find that if you don't conform to our standards, you will be banned." I mean seriously? What am I, a fucking Amway salesman? Speak to people you don't even know, and who have spent several hours on original research that they now report to you because they honestly think it will inform you, with some goddamn respect. Honestly. Conform to your standards? Like I'm writing some kind of product description marketing tripe? Oh, wait, that's half the breaking news at TechCrunch or c|net. Come off your anonymity-inflated high horse.

Anyway, I propose a solution that will make the self-styled Diary police happy, and diminish the clutter of Diaries with unnecessary Diary policing. First, we make a new tag called "Syndicated". Every time someone like me syndicates a blog post here on DK, we add that tag. No exceptions. I'm going to adopt that policy starting today. Second, we create an easy-to-use mechanism to block certain tags, such as "Syndicated". That way, the Diary police don't have to read my original research, and can instead read the 100 word descriptions that over half of Kossacks make about stuff that other people write, which are basically glorified back-links, but which are apparently okie dokie by the standards of the DK Diary police. Third, we write some guidelines for syndication. These guidelines would include a rule that you publish the original blog post verbatim rather than a teaser. I would like this to include an exception for blog posts that originally contain a lot of graphics, because those are time-consuming to reproduce. But honestly, I wouldn't even care if the rules required complete syndication, including graphics. I'm here to spread good ideas, and to promote myself. So long as those two goals are tightly interwoven, I'd appreciate it if no one call me a spammer again. Instead, I'd like you to discuss with me what I've written. Otherwise, I'd like you to go away.

Continue Reading

This post was originally published at Malark-O-Meter, which statistically analyzes fact checker rulings to judge and compare the factuality of politicians, and to measure our uncertainty about those judgments. I also provide commentary on election prediction algorithms and fact checking methodology.

Glenn Kessler, Fact Checker at The Washington Post, gave two out of four Pinnochios to Barney Frank, who claimed that GOP gerrymandering allowed Republicans to maintain their House majority. Kessler would have given Frank three Pinocchios, but Frank publicly recanted his statement in a live television interview. Here at Malark-O-Meter, we equate a score of three Pinocchios with a PolitiFact Truth-O-Meter score of "Mostly False". Kessler was right to knock off a Pinocchio for Barney's willingness to publicly recant his claim. I'll explain why Kessler's fact check was correct, and why he was right to be lenient on Frank.

Frank was wrong because, as a Brennan Center for Justice report suggests, the Democrats wouldn't have won the House majority even before the 2010 redistricting. Although the Republicans clearly won the latest redistricting game, it doesn't fully explain how they maintained their majority. The other factor is geography. Dan Hopkins at The Monkey Cage cited a study by Chen and Rodden showing that Democrats are clustered inefficiently in urban areas. Consequently, they get big Congressional wins in key urban districts, but at the cost of small margin losses in the majority of districts. (And no, fellow fans of the Princeton Election Consortium, it doesn't matter that the effect is even bigger than the one Sam Wang predicted; it's still not only because of redistricting.)

So why was Kessler right to knock off a Pinocchio for Barney's willingness to recant?  At Malark-O-Meter, we see fact checker report cards as a means to measure the overall factuality of individuals and groups. If an individual recants a false statement, that individual's marginal factuality should go up in our eyes for two reasons. First, that person made a statement that adheres to the facts. Second, the act of recanting a falsehood is a testament to one's adherence to the facts.

Regardless of its causes and no matter what Barney's malarkey score ends up being because of his remarks about it, what do we make of the disparity between the popular vote and the House seat margin, which has occurred only three other times in the last century? Should we modify U.S. Code, Title 2, Chapter 1, Section 2c (2 USC § 2c), which became law in 1967 and requires states with more than one apportioned Representative to be divided into one-member districts? Should we instead go with a general ticket, which gives all House seats to the party that wins a state's popular vote? Is there some sensible middle ground? (Of course there is.)

The answer to these questions depends critically on the role we want the geographic distribution of the U.S. population to play in determining the composition of the House. The framers of the Constitution meant for the House of Representatives to be the most democratic body of the national government, which is why we apportion Representatives based on the Census, and why there are more Representatives than Senators. Clearly, it isn't democratic for our redistricting rules to be vague enough that a party can benefit simply by holding the House majority in a Census year. Is it also undemocratic to allow the regional geography of the United States to determine the House composition?

I don't think so. Instead, the geographic distribution of humans in the United States should determine the House composition. There are a bunch of redistricting algorithms out there that would help this happen. The underlying theme of the best algorithms is that Congressional districts should have comparable population size. Let's just pick an algorithm and do it already. And if we're not sure which of these algorithms is the best one, let's just do them all and take the average.


This post was originally published at Malark-O-Meter, where Brash Equilibrium (aka Benjamin Chabot-Hanowell, aka me) statistically analyzes fact checker data and attempts to influence election forecasting methodology. Help me do the latter by passing this essay around to your geek friends. And while you're at it, check out my fact checking analyses of the 2012 election. More in-depth fact checker analysis is forthcoming. Okay, on with the show.

In the aftermath of the 2012 election, campaign prognosticators Nate Silver, Simon Jackman, Drew Linzer, and Sam Wang make preliminary quantitative assessments of how well their final predictions played out. Others have posted comparisons of these and other election prediction and poll aggregation outfits. Hopefully, we'll one day compare and combine the models based on their long term predictive power. To compare and combine models effectively, we need a good quantitative measure of their accuracy. The prognosticators have used a measure called the Brier score to measure the accuracy of their election eve predictions of the election outcome in each state. Despite its historical success in measuring forecast accuracy, the Brier score fails in several ways as a forecast score. I'll review its inadequacies and suggest a better method.

The Brier score measures the accuracy of binary probabilistic predictions. To calculate it, take the average, squared difference between the forecast probability of a given outcome (e.g., Obama winning the popular vote in California) and the observed probability that the event occurred (.e.g, one if the Obama won, zero if he didn't win). The higher the Brier score, the worse the predictive accuracy. As Nils Barth suggested to Sam Wang, you can also calculate a normalized Brier score by subtracting four times the Brier score from one. A normalized Brier score compares the predictive accuracy of a model to the predictive accuracy of a model that perfectly predicted the outcomes. The higher the normalized Brier score, the greater the predictive accuracy.

Because the Brier score (and its normalized cousin) measure predictive accuracy, I've suggested that we can use them to construct certainty weights for prediction models, which we could then use when calculating an average model that combines the separate models into a meta-prediction. Recently, however, I've discovered research in the weather forecasting community that suggests a better and more intuitive way to score forecast accuracy. This score has the added benefit of being directly tied to a well-studied model averaging mechanism. Before describing the new scoring method, let's describe the problems with the Brier score.

Jewson (2004) notes that the Brier score doesn't deal adequately with very improbable or probable events. For example, suppose that the probability that a Black Democrat wins Texas is 1 in 1000. Suppose we have one forecast model that predicts Obama will surely lose in Texas, whereas another model predicts that Obama's probability of winning is 1 in 400. Well, Obama lost Texas. The Brier score would tell us to prefer the model that predicted a sure loss for Obama. Yet the model that gave him a small probability of winning is closer to the "truth" in the sense that it estimates he has a small probably of winning. That seems counter-intuitive. In addition to its poor performance scoring highly improbable and probable events, the Brier score doesn't perform well when scoring very poor forecasts (Benedetti 2010; sorry for the pay wall).

These issues with the Brier score should give prognosticators pause for two reasons. First, they suggest that the Brier score will not perform well in the "safe" states of a given party. Second, they suggest that Brier scores will not perform well for models whose predictions were poor (here's lookin' at you, Bickers and Berry). So what should we do instead? It's all about the likelihood. Well, actually its logarithm.

Both Jewson and Benedetti convincingly argue that the proper score of forecast accuracy is something called the log likelihood. A likelihood is the probability of a set of observations given the model of reality that we assume produced those observations. As Jewson points out, the likelihood in our case is the probability of a set of observations (i.e., which states Obama won) given the forecasts associated with those observations (i.e., the forecast probability that Obama would win those states). A score based on the log likelihood penalizes measures that are very certain one way or the other, giving the lowest scores to models that are perfectly certain of the outcome.

To compare the accuracy of two models, simply take the difference in their log likelihood. To calculate model weights, first subtract the likelihood score of each model from the minimum likelihood score across all the models. Then exponentiate the difference you just calculated. Then divide the exponentiated difference of each model by the sum of those values across all the models. Voila. A model averaging weight.

Some problems remain. For starters, we haven't factored Occam's razor into our scoring of models. Occam's razor, of course, is the idea that simpler models are better than complex models all else equal. Some of you might notice that the model weight calculation in the previous paragraph is identical to the model weight calculation method based on the information criterion scores of models that have the same number of variables. I argue that, for now, this isn't really an issue. What we're doing is measuring a model's predictive accuracy, not its fit to previous observations. I leave it up to the first order election prognosticators to decide which parameters they include in their model. In making meta election forecasts, I'll let the models' actual predictive performance decide which ones should get more weight.


Should we penalize election forecast model weights by the number of parameters they estimate?

53%8 votes
33%5 votes
13%2 votes

| 15 votes | Vote | Results


Sun Nov 11, 2012 at 09:03 AM PST

Aggregatio ad absurdum

by Brash Equilibrium

A funny short story about the triumph and perils of endless recursions in meta-analysis. NOT a critique of meta-analysis itself. Originally published at my blog, Malark-O-Meter (and reproduced in its entirety here). At Malark-O-Meter, I do statistical analysis of fact checker report cards and try in vain to influence election prediction methodology.

Once upon a time, there was a land called the United States of America, which was ruled by a shapeshifter whose physiognomy and political party affiliation was recast every four years by an electoral vote, itself a reflection of the vote of the people. For centuries, the outcome of the election had been foretold by a cadre of magicians and wizards collectively known as the Pundets. Gazing into their crystal balls at the size of crowds at political rallies, they charted the course of the shapeshifting campaign. They were often wrong, but people listened to them anyway.

Then, from the labyrinthine caves beneath the Marginuvera Mountains emerged a troglodyte race known as the Pulstirs. Pasty of skin and snarfy in laughter, they challenged the hegemony of the Pundet elite by crafting their predictions from the collective utterances of the populace. Trouble soon followed. Some of the powerful new Pulstir craftsmen forged alliances with one party or another. And as more and more Pulstirs emerged from Marginuvera, they conducted more and more puls.

The greatest trouble came, unsurprisingly, from the old Pundet guard in their ill-fated attempts to merge their decrees with Pulstir findings. Unable to cope with the number of puls, unwilling to so much as state an individual pul's marginuvera, the Pundet's predictions confused the people more than it informed them.

Then, one day, unbeknownst to one another, rangers emerged from the Forests of Metta Analisis. Long had each of them observed the Pundets and Pulstirs from afar. Long had they anguished over the amount of time the Pundets spent bullshyting about what the ruler of America would look like after election day rather than discussing in earnest the policies that the shapeshifter would adopt. Long had the rangers shaken their fists at the sky every time Pundets with differing loyalties supported their misbegotten claims with a smattering of gooseberry-picked puls. Long had the rangers tasted vomit at the back of their throats whenever the Pundets at Sea-en-en jabbered about it being a close race when one possible shapeshifting outcome had been on average trailing the other by several points in the last several fortnights of puls.

Each ranger retreated to a secluded cave, where they used the newfangled signal torches of the Intyrnet to broadcast their shrewd aggregation of the Pulstir's predictions. There, they persisted on a diet of espresso, Power Bars, and drops of Mountain Dew. Few hours they slept. In making their predictions, some relied only on the collective information of the puls. Others looked as well to fundamental trends of prosperity in each of America's states.

Pundets on all (by that, we mean both) sides questioned the rangers' methods, scoffed at the certainty with which the best of them predicted that the next ruler of America would look kind of like a skinny Nelson Mandela, and would support similar policies to the ones he supported back when he had a bigger chin and lighter skin, was lame of leg, and harbored great fondness for elegantly masculine cigarette holders.

On election day, it was the rangers who triumphed, and who collectively became known as the Quants, a moniker that was earlier bestowed upon another group of now disgraced, but equally pasty rangers who may have helped usher in the Great Recession of the early Second Millennium. The trouble is that the number of Quants had increased due to the popularity and controversy surrounding their predictions. While most of the rangers correctly predicted the physiognomy of the president, they had differing levels of uncertainty in the outcome, and their predictions fluctuated to different degrees over the course of the lengthy campaign.

Soon after the election, friends of the Quants, who had also trained in the Forests of Metta Analisis, made a bold suggestion. They argued that, just as the Quants had aggregated the puls to form better predictions about the outcome of the election, we could aggregate the aggregates to make our predictions yet more accurate.

Four years later, the Meta-Quants broadcast their predictions alongside those of the original Quants. Sure enough, the Meta-Quants predicted the outcome with greater accuracy and precision than the original Qaunts.

Soon after the election, friends of the Meta-Quants, who had also trained in the Forests of Metta Analsis, made a bold suggestion. They argued that, just as the Meta-Quants had aggregated the Quants to form better predictions about the outcome of the election, we could aggregate the aggregates of the aggregates to make even better predictions.

Four years later, the Meta-Meta-Quants broadcast their predictions alongside those of the Quants and the Meta-Quants. Sure enough, the Meta-Meta-Quants predicted the outcome with somewhat better accuracy and precision than the Meta-Quants, but not as much better as the Meta-Quants had over the Quants. Nobody really paid attention to that part of it.

Which is why, soon after the election, friends of the Meta-Meta-Quants, who had also trained in the Forests of Metta Analisis, made a bold suggestion. They argued that, just as the Meta-Meta-Quants had aggregated the Meta-Quants to form better predictions about the outcome of the election, we could aggregate the aggregates of the aggregates of the aggregates to make even better predictions.


One thousand years later, the (Meta x 253)-Quants broadcast their predictions alongside those of all the other types of Quants. By this time, 99.9999999% of Intyrnet communication was devoted to the prediction of the next election, and the rest was devoted to the prediction of the election after that. A Dyson Sphere was constructed around the sun to power the syrvers necessary to compute and communicate the prediction models of the (Meta x 253)-Quants, plus all the other types of Quants. Unfortunately, most of the brilliant people in the Solar System were employed making predictions about elections. Thus the second-rate constructors of the Dyson Sphere accidentally built its shell within the orbit of Earth, blocking out the sun and eventually causing the extinction of life on the planet.

The end.

You can add a private note to this diary when hotlisting it:
Are you sure you want to remove this diary from your hotlist?
Are you sure you want to remove your recommendation? You can only recommend a diary once, so you will not be able to re-recommend it afterwards.


Subscribe or Donate to support Daily Kos.

Click here for the mobile view of the site