tl;dr
Okay, I wrote too much. The 538 model is overdesigned, and the current scenario has reduced its output to nonsense.
This diary is not about Nate Silver.
DKos and 538 have a long shared history, and for many of us here there is an emotional connection between the two. As 538 has become more and more commercial and the editorial content there has changed from deliberate Democratic boosterism to the flat apolitical tone which ESPN tends to prefer across all of its outlets, it’s easy for those of us who remember Poblano’s posts here to feel betrayed or at least left behind. While I’ve had those thoughts on occasion, I also don’t think that’s fair to him. Expecting someone’s professional output to mirror his personal preferences is unrealistic, particularly in the specific situation he is in.
As 538’s model has drifted out of alignment with other prediction sites there’s been a lot of chatter that Nate has his thumb on the scales — perhaps because of commercial considerations or because he’s a Republican mole or whatever the current consipiracy theory is. Don’t believe it. There is no evidence of it. It’s ugly ad hominem and a nasty mirror image of what Republicans have always said about the site. Since the model is not open it is a hard slur to disprove, but if Nate were caught at it it would utterly destroy the reputation he’s built so my read is that he would not do it.
My main contention in this diary is that the 538 model is failing in some specific ways which can be traced to its design. The results we see from the model are legitimate but useless output given the inputs from the last few weeks. I’d love to have an argument about model design with Nate — even in 2012 I felt 538 was massively overinterpreting the polls — but given that the model is an interpretive design and not simply about poll aggregation, the content of that argument boils down to debating what any given poll actually means. This gets somewhat metaphysical, as we’ll see here in a bit.
Models, models, and models.
Let’s take a few minutes and discuss what various prediction systems actually do.
Almost all of the systems out there which we talk about here are referred to as poll aggregators. The idea is that they collect a bunch of polls, then average out the results, and that average is probably closer to the reality than any single poll can be. This general idea seems to mostly work correctly, but there are all sorts of pitfalls. Polls all operate in slightly (or not-so-slightly) different ways, so you run into questions about whether a straight average is the best way to do it.
The most widely-read operation running a pure poll-aggregation method is RCP. The famous RCP average is indeed an average — they select some polls, add up the results, and output a single clean number. Individual state averages get plotted on a map, and the results go straight to their EV prediction. Except of course that RCP doesn’t appear to have any strict rules about what polls they select, and thus there’s a large degree of (Republican) editorial bias evident when even a little analysis is applied. I don’t link people to RCP.
Sam Wang at PEC is a few steps further away from pure aggregation. All the polls are listed out, with a simple, universal rule for removing polls as they age or replacing them when the polling organization re-polls the same race. Rather than averaging, the median poll is used. This tends to reduce the distortions which extreme outliers cause, but also makes individual state results a little more jumpy. A formula is used to convert a lead into a percentage chance of winning a state, and all of the state results go into the most complicated bit of math which interprets this input into a range of modeled possible outcomes. The median and standard deviation of this histogram are plotted with functions that model a possible amount of change before the election, and from that he gets his top-line percentages.
While there is definitely interpretation in this system, it largely is contained in assumptions on how closely polls represent reality, and how quickly the race can change. I tend to disapprove of the “Bayesian” model (reality is not so stable as that), but generally agree with the margin of error he diagnoses in aggregated polling data.
The next step further away is the Huffpost Pollster presidential model. In very general terms, what they do is run simulations, pick the ones that best match the data to date, and then average what the selected simulations project will happen on Election Day. I have not investigated all the math, but the two points of interest for the greater discussion here are that the polls are selected with some editorial rules which are transparent and used (theoretically) without bias, and that the polls are used as solid data points without extra layers of interpretation.
The better-known Pollster graphs (like this one) are basically pure poll aggregation, but using best-fit functions rather than a plain averaging algorithm to output the lines and end numbers highlighted on the right side of the graph. This has the advantage of showing what direction the race is headed in… with the caveat that in can and will change as soon as a new poll is added. It also emphasizes current polls and ages out old data quickly when new data comes in. You can choose alternative best-fit math in the custom mode, and get different results. Don’t turn on the high-sensitivity mode. Just don’t.
And then there is 538.
What 538 does is best described as mathematical modeling — the model takes polling data as input, and the black box spits out thousands of scenarios of what might happen. How it goes from the input to the output is not open information, but the descriptions the site has given and the data which it shows us do let us know some of the general shape of what is going on.
The 538 model appears to start with the assumption that all polling data is wrong. This is absolutely true on some level. The entire concept of poll aggregation is based on the idea that individual polls are almost always not going to be exactly correct representations of the entire electorate, and the poll-of-polls will be closer. 538 departs from this starting point by claiming that it can, through analysis of historical (both recent and long-term) data, modify poll results to more-accurately project the true shape of reality.
This poll adjustment is actually a multitude of adjustments. Nate goes on at extreme length about the kinds of adjustments involved in the process. From his notes it’s not always clear if specific adjustments are being applied before or after the aggregation step, but a lot of what I’ll be discussing today seems to happen before aggregation.
The adjustment that gets the most attention and that most of us at least notionally agree with comes from 538’s pollster ratings. All pollsters are evaluated for how accurate they are, and what their partisan lean is. Rasmussen (for instance) heavily leans Republican, so their poll numbers are adjusted towards the Democratic candidate. 538 doesn’t usually use the full bias number to adjust polls (there was a technical discussion of this posted a couple months ago but I haven’t tracked down the link) so this factor is usually just a point or two.
The poll reliability data (and sample size, and time since the poll was taken, and more) also goes to relative weighting of polls. While this is not strictly an adjustment to the poll it does heavily affect how the poll impacts the aggregate. I’ve seen state poll weightings from .02 to 3.9 at 538, so polls listed on the site will vary in their impact by a couple orders of magnitude. This makes sense — a large, fresh poll from a reputable national pollster should affect the outcome more than an old, tiny poll run as an exercise by a college that does it once every four years. That said, there are some counterintuitive ways in which 538 applies weighting. In particular, aging does not reduce weight as rapidly as one might guess and polls which have been superceded by polling of the same race by the same pollster keep non-zero value.
There are multiple layers of adjustment towards previous election averages and demographic data (even in the supposed polls-only flavors). Presumably this informs early projections and is given lower weighting as we near election day, but the math to show us just how much this counts is not open.
Adjustments are made which attempt to correlate the status of the race between states. Results of a poll in, say, Missouri cause adjustments to polling of the states nearby it and the states demographically similar to it, and less adjustment to states less similar to it. I picked Missouri here because just this last week there was an update where a single poll of MO by a low-end pollster showing what I would characterize as expected results changed the top-line percentage calculation by more than 1%.
And then there are adjustments that are made to reflect the “momentum” of the race. This is a little similar to using the trend line at Pollster to predict where the race is at a given instant, except instead of projecting forward, 538 projects backwards, making a change to the adjustment applied to each individual poll in its system. This is one of the main ways that national polling data is used to modify state polling data. The guide to the model says that momentum is measured over a long period early in the election, and a shorter period near election day. As we’ll see below, “shorter” is still probably more than ten days.
After all the adjustments, the state numbers are interpreted into percentage chance of winning, and then fed into a monte-carlo method (or similar — I’m a layman at the actual math and Nate specifically uses the word simulation, but it’s about creating national race probabilities from the collected output from each run) that delivers the final headline probabilities. Compared to PEC and other systems which use a similar process, Nate uses a probability distribution which assumes a lower reliability for the aggregated polling data, making the output histogram wider, with edge cases more likely.
All of these adjustments, all of the many, many steps of the model, are put in to try to improve the result. This is all done in good faith, to make it the best model ever. The downside is that every step added to a process like this adds another way that things can go wrong. Each additional way that something can go wrong multiplies the chances of something actually going wrong. Given that the basic premise of the model is that it will adjust the poll data to match reality, “going wrong” means that it’s not modeling the real world any more.
This is ultimately my argument with Nate Silver: While there is plenty of reason to not trust polls (and that’s worse this year than in 2012), there is also a lot of evidence that basic polling aggregation is trustworthy. A poll is intended to be a snapshot of the period it was taken. Changing its results to make it representative of more than that means that you’re effectively losing the data in the poll. Aggregating this sort of created data multiplies the effect, especially if you allow trend data to become self-referential. Pretty soon your model is just lost in space, and everyone is looking at you weirdly.
In 2008, when this kind of analysis was fairly new and the election season had only a few swings, I felt that it did a fabulous job. In 2012, as I began looking deeper into what made the numbers act the way they did, I started to actively dislike the polls-plus flavor and favor the now-cast — and even that was occasionally iffy. This year, while there have been moments where I think the model has been basically “right”, the swings in the race appear to break 538 until it settles out again.
Some “real-world” examples:
I really don’t want to get into “unskewing” 538, but there are two important states where the model currently favors Trump that I want to explore in some detail.
(Not sure just how much I can copy from 538 within the bounds of fair use. One table seems to be well within reason, but using three tables from two states has got my fair-usage senses tingling a little. If it’s too much, I’ll chop it down.)
Florida
538.com — Polls Only — Florida — 1630 EST Nov. 6 — default view
The table shown here is what you see by default when you open up Florida. Your eye is attracted to the adjusted leader column by the bold type and colors, and going by the content of that column one might feel relatively justified in saying that Trump is should be favored in the state. Indeed, on the nationwide map Florida is a very light pink and a straight average of those polls is Trump +1.1.
This table is not lying to you, but it is surely misrepresenting the state of the race. The dates of the polls are in light gray, so lean in and squint a little, and realize that they are not in chronological order. 538 sorts polling tables by the polls’ assigned weights. (If you look really close you can see that the little up arrow by weight is blacker than the other arrows.) The most heavily-weighted poll left the field on October 24. The three Trump-heavy results in the 4-6 positions left all finished no later than October 27. This is all data that is more then ten days old. Some of it is more than two weeks old. All of it is pre-Comey.
But Florida is an important state, and it gets polled far more often than that. Let’s look and see what happens when we sort by date.
538.com — Polls Only — Florida — 1630 EST Nov. 6 — sorted by date
Ah ha — all these polls have come in since November 1. The ones with very low rating typically have been re-polled (though as you see the old data still has non-zero weighting). The projected leader column is more blue and the Trump results less daunting in size, but there are a lot of ties and Clinton +1’s, and one would expect from this dataset that the state might be a really light blue. Probably worth a percentage point or two on the top-line national numbers.
But how about that leader column, the one without the pretty colors? The actual polling data, not the 538 parallel universe data. There we see just one poll where Trump leads, and even more Clinton leads. The actual straight unadjusted polling average from this data is Clinton +1.2, which probably puts her in about 65% chance-to-win territory. (I don’t have the conversion table handy.)
You can argue about how 538 handles the numbers here on a few levels, but ultimately what I want to point out is that old data decays too slowly. Two-week old polls could be useful if there were not dozens since then — indeed, PEC kicks out old polls but has an exception when there are insufficient new polls to replace them. The old Florida polls at 538 are from pollsters who rate highly, so much so that the model treats them as more predictive (you can see just how much more in the weight column, 2.5ish versus 1.low).
North Carolina
538.com — Polls Only — North Carolina — 1630 EST Nov. 6 — default view
In North Carolina flipping between sorting modes doesn’t bring up as big a difference as in Florida, so I’ve gone with the default weight view here. Again, this is a state which shows up as light pink — 538 makes Trump a 51-49 favorite. If you look at the numbers here, that seems a little strange. The adjusted leader column has a flat average of C +.5, so probably there are some less-weighted polls which move the needle far enough to just get into Trump territory. Indeed, there is a large grouping of red in the .5 to 1.2 weight range, a lot of which are from early or middle October.
The problem is that all of the polls (except for the Trafalgar Group, dunno what is with that one, and two unchanged) at the top of the weighting have been adjusted pro-Trump. This is against the listed bias numbers in 538’s database in multiple cases. And it’s even worse with older polls in the midtable — I described all of them as red, but the majority of them started with Clinton leads and have been adjusted Trump +3 to +5.
This is a phenomenon that you can see in all of 538’s current tables, including the Florida tables I show above. Almost nothing is adjusting pro-Clinton right now, and older polls have shifted even more Republican. Think about what that means — a measurement in mid-October which was already tweaked to fit the 538 view of the world gets retweaked later — continuously as every new poll comes in! — to say something different about how the race looked on that day. There’s something to say about refining our view of the past due to new data, but this is flat out ongoing revisionism.
I suspect Nate would say that the adjusted poll isn’t truly representing how the race was on that date, but an abstract prior which informs the current averages. I can respect that, but it combines with the slow-decay issue in weighting to make the overall model oversensitive. Worse, it appears to be a phenomenon that applies to the national polls as well, creating the specter of recursion within the system.
I can not see into 538’s black box far enough to give a certain answer, but I’ve seen this phenomenon any time that the national polls start changing rapidly. It has swung both directions — the adjusted polls at 538 just after the conventions were absurdly blue. My belief is that this is the momentum adjustment — polls from before the current trend are changed to reflect that trend, and polls during the trend also get tweaked in that direction, but not by as large a factor. The interval used to determine the trendline is supposed to be “shorter” as we approach election day, but consider the current situation: National polls started swinging against Clinton more than ten days ago (it was actually slightly pre-Comey), then in the last five or six days has leveled out (or even reversed slightly). But the momentum adjustment at 538 is still heavily pro-Trump, so the timescale for the trend measurement must be at least a week, and probably more than ten days.
What you get is a system which is hypersensitive, but can not undo a trend quickly. I guess the model will settle sometime next weekend… which will be so, so helpful. Yes, that’s sarcasm.
Bottom line,
I think 538 is completely wrong about Nevada, Florida, and North Carolina for the reasons listed above. Ohio is closer than he has it, but I wouldn’t rate it any better than 50-50 before ground game considerations. Iowa is just gone, completely gone. Arizona looks unlikely, but not impossible though I’d have go outside the bounds of pure polling analysis to explain why, and the same goes for Georgia.
Nate Silver’s main defense today is that pure poll aggregation is not all that accurate after all, and that the misses go in both directions. My response to that is that he’s actually simplifying too much, and conflating presidential and midterm data. 2004 aside, the modern trend has always been Dem presidential candidates overperforming polling estimates. Yes, there is some chance of systemic polling error, but the claim that there’s at least a 1 in 3 chance that there is one and it will favor Trump is badly thought out.
Please don’t say these things.
“538 is putting their hand on the scales.”
They made the system, and the system has problems. But if there was no Comey and some nasty nasty oppo dump landed on Trump last weekend, those problems would be working in the other direction. In some scenarios it might track reality better than straight poll aggregation. This is just not one of them.
“538 has discounted early voting results.”
And well it should. It’s not in the model. Early voting and GOTV are not in any of the models I’ve discussed here, nor in any other model which I’ve seen on the internet. I suppose some of the non-quant political ratings might integrate that info, but probably not as a major factor. The few people discussing this really fall more into the journalist category so far. Oh, and by the way…
And one last bit of good news.
This one really made me smile this morning. Now I know we’re going to win.