Analytics Did Not Cost Clinton the Election; Or Why the Insight Fairy Won't Save You

by angryea

Community

(This content is not subject to review by Daily Kos staff prior to publication.)

Tuesday, Dec. 27, 2016 Tuesday, Dec. 27, 2016 at 10:06:54am PST

Charlie Cook writing at the National Journal, thinks that analytics cost Clinton the election:

The reliance, or perhaps overreliance on analytics, may be one of the factors contributing to Clinton’s surprise defeat. The Clinton team was so confident in its analytical models that it opted not to conduct tracking polls in a number of states during the last month of the campaign. As a consequence, deteriorating support in states such as Michigan and Wisconsin fell below the radar screen, slippage that that traditional tracking polls would have certainly caught.

This, apparently, is a refutation of the notion that data has more value than anecdote:

Experienced journalists might argue that the overreliance by reporters on both polls and analytics has led to a decrease in shoe-leather, on-the-ground reporting that might have picked up movements in the electorate that the polls missed. As the Michigan results came in on election night, I vividly recalled that two congressmen from Michigan—one a Democrat, the other a Republican—had been warning me for months that Michigan was more competitive than publicly thought. I wished I had listened.
The analytical models for both sides pointed to a Clinton victory, albeit not a runaway. The Clinton campaign and super PACs had several of the most highly regarded polling firms in the Democratic Party, yet in the places that ended up mattering, very little if any polling was done. So while 2016 wasn’t a victory for traditional polling, it certainly took a lot of the luster from analytics. In the end, big data mattered very little.

The above is both true and misleading. The problem described here is not an inherent flaw in using data to drive decisions. The flaw here is in thinking that your model is what is important, when in fact your model inputs are what is important. Your algorithm/model/insert-buzzword-here is not as important as the data you use to fill it. If the Clinton campaign did not catch the problems in the upper Midwest in time to correct them, then they missed those problems not because they relied on data but because they relied on untrustworthy or out of date data. That is a failure of their data practitioners, not the idea of data driven decision making. A model is only as good as the data it is fed.

Data driven decision making requires accurate data in a timely manner. If that data is incorrect in some fashion, or if that data is out of date, then the model will not be as accurate as is needed. If you are using a model to predict something, you must take pains to continually feed it data, and you must take pains to ensure that the data you feed it is as accurate as possible. (Accurate here meaning as close to representative of the current state of the electorate's mood as possible.) At some point, the Clinton campaign either decided to stop updating their model for the Midwest states or they relied on state level polls from outside organizations. The latter results in bad data since those organizations, mostly media companies, as the article points out, do not have the resources to do the most accurate kind of polling. As a result, the Clinton models were apparently wrong enough to cost her the election.

That is not a flaw in the notion of using data to drive decisions. It merely highlights something data practitioners should keep in mind: no model can protect you from the age old problem of garbage in, garbage out. The Insight Fairy will not arrive at your doorstep the day you stuff your data into a Hadoop instance and hire a couple of folks who know how to use Mapreduce or Spark. Making decisions based on data means that you must see to the care and feeding of the model you depend on, and that means making sure you have accurate, up to date data. The most important aspect of any data project is not the model or algorithm. The process by which you obtain and prepare data to make it as accurate as possible and the continuous updating of the model with new data and new learnings are what make a data driven decisions worthwhile. Absent that, you are basically left with a slightly more complicated version of reading goat entrails. And you don't even get to enjoy the roasted goat.

Does this mean, as Cook implies, that "shoe leather reporting" would have been superior? Not really, or at least not always. First, anecdotes, which is what this kind of "talk to the voters" reporting really is given the limitations in data size, are not likely to be more accurate than most kinds of large data sets. Remember, Romney thought the was going to win in 2012 in part because of the size and enthusiasm of his crowds. Anecdotal data can be used as a trigger, a means by which to investigate potential truths. But it is not really an approximation for the truth. However, there are times when data driven decision making may be the wrong approach.

It is important to remember that the Clinton data people were likely well aware of how to run a decent data project. I doubt very much that they did not know that they needed accurate data, or that they had to update their data regularly, or that they built a model that did not take into account voter sentiment (of which polls would be one very important measure). These are points that anyone working with data systems for any meaningful amount of time would recognize. They are obvious to me and I am a nobody in terms of data systems; I am sure they were obvious to Clinton's data people. But, for some reason, the obvious was not acted upon. Why not?

I am obviously speculating as I have no connection to the Clinton campaign or anyone in it, but I suspect that they fed their model the state polls. Perhaps they didn't realize or take seriously the flaws in those polls. Perhaps they didn't fight hard enough for the resources needed to conduct high quality polling of their own, or perhaps those requests were swatted away by higher ups who didn't really understand what data driven decisions actually required. And that situation, when you cannot get or maintain the flow of quality data is when you should avoid a data driven decision making process.

Again, data driven decision making processes require accurate data updated in a timely fashion. No matter how clever you think your algorithm, it is garbage if fed bad or stale data. If you want to make decisions based on data, you have to spend the time and effort to acquire good data, clean that data to ensure accuracy, and process that data into a format that your model can effectively use. If you want your model to predict the future in a dynamic domain space such as an election, then it needs to be constantly fed data that is accurate, well cleaned, and timely. If you do not have the skills or resources to properly manage your data and ensure a constant flow of it is sent to your model, then you should not be relying on your model to drive decisions. It is much better to avoid the garbage such a system will spit out.

If the Clinton campaign was failed by data, it was not failed by the idea of data driven decisions. It was failed by their inability to feed the model what it needed: accurate, timely data. Whether due to a lack of understanding or a lack of resources, they put garbage into a model and, unsurprisingly, got garbage out.