I may overuse the parable about the blind men describing the elephant when talking about polling and pollsters, and why we aggregate … but that’s because it’s a great metaphor for what’s going on. Think of the electorate as the elephant, and the pollsters as the men trying to describe the elephant, one small sample at a time, except one pollster has his hand on its trunk, one on its tusk, one on its tail, and so on. Individually, their stories seem to contradict one another, and might sound bizarre taken on their own. (“The electorate is sharp and pointy! No, the electorate is long and bristly!”)
That’s never been clearer than this week, when The Upshot on the New York Times website commissioned a poll from Siena, and then gave the raw data to four different other pollsters to see what final results they would arrive at. Remarkably, the results ranged from Clinton +4 to Trump +1. Mind you, this wasn’t four different samples; this was four different interpretations of the same sample, just using different judgments about how to determine likely voters and how to weight the different demographic subsamples! This really reaffirmed that there’s no one right poll, only a range of acceptable answers.
What the men describing the elephant really needed was someone off to the side, jotting down all the different descriptions and trying to assemble them into a complete picture of an elephant. If you get enough of the conflicting samples in one place, and try to put them in an order that makes sense, you might actually get a clearer sense of what everyone is fumbling around with. That’s where poll aggregation comes in! It gives you more data points to flesh out that “range of acceptable answers.”
(I’m not using Sam Wang’s much grosser metaphor: “Think of each poll as being a mystery body part; what PEC provides is the head cheese.”)
Unfortunately, even the job of being the guy on the sidelines compiling all the reports of tusks and trunks doesn’t have its objective set of rules. Do we listen to every description equally, or do we give extra weight to the guys who already have a decent track record of successfully describing animals? Do we add in some historical precedent, like whether have been other elephant sightings in recent years? Do we try and make small corrections to the descriptions when we’re pretty sure it’s an elephant, but some of the describers keep insisting they feel fur? Do we outright discard the reports from the describer who seems clearly drunk (we’ll just call that guy “Emerson”)?
47 days remain until the election. Click here to make sure you're registered to vote. And while you're at it, make sure your family and friends are registered too.
That’s why different aggregators can look at the same batch of polls and arrive at a variety of predicted outcomes, to the extent that there were days when Wang was projecting “mostly sunny” while Nate Silver was projecting “plague of locusts, with occasional blood showers.” In a circumspect moment on Wednesday, Silver tweeted out an extremely interesting graph attempting to analyze what specifically differed between the various major models.
Whether or not to use “fundamentals” (and, if yes, which fundamentals to use) is a key distinction; That’s a question that we get asked about a lot when people don’t like what the model is doing, and it’s certainly a fair question. (Our model uses the Alan Abramowitz “Time for a Change” model as its fundamentals piece, which even Abramowitz himself has questioned how well it applies to this odd year. One potential criticism is that there hasn’t been a large enough sample of post-22nd Amendment presidential elections to know just how big a penalty the party in power seeking a third term should receive. Another potential problem is that the economics piece is only the 2nd quarter GDP figure … which might be unnecessarily penalizing Hillary Clinton’s chances this year, since the 1.1 percent GDP growth is counteracted by a number of other more positive indicators, like low unemployment, income growth, and rising consumer confidence.)
One other key distinction, which Nate didn’t include on his chart, was how big a net each aggregator casts for polls, and how much the polls get manipulated. The Upshot and PEC, as far as I know, just use the Huffington Post Pollster database as their source; this leads to some polls being included that we don’t include (like the Ipsos/Reuters 50 state “poll,” where some states had unacceptably low sample sizes), or excluded that we do include (like Emerson, who gets bounced by HuffPo for only calling landlines). FiveThirtyEight and Daily Kos Elections have their own curated databases, though FiveThirtyEight seems to have a more active system of adjusting results for pollster quality while the only adjustment we make is a penalty for partisan polls.
The most important distinction, though, may be the mysterious-sounding one in the middle column: “confidence level.” This refers, essentially, to how much each individual state in the model takes cues from the other states around it; in other words, how much their behavior is correlated. FiveThirtyEight seems to be the most aggressive about treating the states as moving in tandem (which they express as “low” confidence in the chart, in other words, low confidence that a small lead would hold up, because if a candidate is faltering in some states, he or she is probably faltering in all the states).
The Upshot, in their own recent piece comparing the different aggregators, found a visual way to make that remarkably clear. It’s the two histograms at the top of this article, comparing the distribution of all our simulations vs. FiveThirtyEight’s simulations. Our distribution is a higher, narrower pile. What that tells us (probably) is that there’s less correlation between the states built into our model. In other words, if something goes wrong in Ohio, it’s only a little likelier that something starts going wrong in, say, Maine, as well.
With FiveThirtyEight’s model, though, it looks like the states are yoked together a little more. That’s why they have a lower, flatter distribution, because you have a lot more scenarios where things go either really well or really poorly for Clinton, meaning a lot more scenarios where she ends up in either the 200 electoral vote range (where she starts losing blue states) or the 400 electoral vote range (where she starts winning red states), and fewer scenarios clustered in the usual mundane 290-320 vote range. In other words, their model has much longer tails, on both sides, and if you look closely, the red tail on the left is a little fatter, which is an indication of how their model ends up with more simulations than ours where Donald Trump wins.
(If you really want your mind blown, check out this video of a Galton board simulation that uses different size balls. (A Galton Board is a board with pegs that you drop balls down and watch which way they bounce as they fall. Drop enough balls down it, and it shows you what a normal distribution looks like.) The Daily Kos Elections model distribution looks just like the “large ball” simulation (where the balls do a lot of crashing into one another, thus largely reverting toward the mean), while the FiveThirtyEight model distribution looks like the “medium ball” simulation (where the balls slide through more fluidly, fanning out more).)
To turn the discussion to where our model is today, that correlation between the states is part of why things have stayed kind of stagnant in the model despite the two polls with good news from Florida (both with 5-point Clinton leads, one from Monmouth and one from St. Leo). Clinton’s overall odds are currently 64 percent, up from the previous day’s 63 but not much different from Monday’s 65. Part of the problem is that even if the model liked what it saw in Florida, it’s not seeing across-the-boards momentum in other states (for instance, Wednesday saw two North Carolina polls with small Donald Trump leads, as well as a small Trump lead in yet another Rasmussen poll of Nevada) that would give a little boost to all the states, even ones where there hasn’t been any new data in a while.
That conflicts a bit with this week’s noticeable uptick in national-level polls (best characterized by Wednesday’s 6-point Clinton lead in an NBC/WSJ poll). For one thing, our model, as I often point out, doesn’t use national polls at all. But also keep in mind the old saw that state polls usually lag the national polls (presumably because there are simply fewer of them, spread out across more places). It took a while … up until last week, really … for the state polls to fall in a manner consistent with the slow ebb of Clinton’s convention bounce wearing off in late August. So you might need to sit tight and wait a bit longer to see state-poll movement that follows the national polls (and, at that point, it may get attributed to a debate-related bounce instead, if Clinton does indeed get a boost out of the first debate next Monday).
Similarly, the Democratic odds of getting 50 seats or better in the Senate continue to bounce along in total tossup territory, hitting 49 percent odds today. While Democratic Rep. Tammy Duckworth’s odds in Illinois improved quite a bit after a Loras College poll, that was counteracted a bit by that same Rasmussen Nevada poll, which also had Catherine Cortez Masto trailing Joe Heck.