The Upshot's rankings of the modelers, using logarithmic scores (higher is better)
Amidst the array of mostly bad results from Tuesday's election, we did find one group of Democrats who won ... us! A pair of our fellow modelers calculated something called Brier scores and both found that the
Daily Kos Election Outlook did the "best" of the lot, at least as far as the Senate models are concerned. The
New York Times used a similar metric called logarithmic scores (shown in the chart above) and also found Daily Kos
at the top of the heap. So yes, now that the election is over and there are no more polls to aggregate, we've moved on to aggregating each other's models! (The scores assume that the results in Alaska and Virginia hold.)
WaPo's calculations |
Sam Wang's calculations |
Daily Kos |
0.024 |
Daily Kos |
0.10 |
WaPo |
0.027 |
WaPo |
0.12 |
FiveThirtyEight |
0.032 |
FiveThirtyEight |
0.14 |
PredictWise |
0.032 |
Betfair |
0.14 |
HuffPo |
0.034 |
HuffPo |
0.14 |
Upshot |
0.035 |
Upshot |
0.15 |
Princeton |
0.043 |
Princeton |
0.18 |
If you're wondering what a
Brier score is, it's a measure of how accurate probabilistic predictions are. The best possible Brier score is 0; the worst possible score is 1. For instance, if you predicted that it was going to rain tomorrow, with 100 percent certainty, and you were right, you'd get a score of 0. If you predict it'll rain with 100 percent certainty and you were wrong, you get a score of 1. If you predict there's a 50 percent chance of rain, you get a score of 0.25 either way. If you make a whole bunch of predictions—as all the modelers did, over the course of 36 different Senate races—then you have to average out all those scores for one cumulative score.
If you look at a handy chart of all the different models' predictions (the New York Times one is perhaps the easiest to navigate), though, you'll notice that all the models essentially predicted the same thing. Most models—including ours—correctly predicted all but one race, the North Carolina race, where everyone predicted Kay Hagan would win. Several models also gave a narrow edge to Greg Orman in the Kansas race, for two misses.
So where it matters, in terms of the hair splitting of which model "wins," is the degree of confidence each aggregator had in each result. If you only looked at the overall topline—whether or not the Republicans would take over the Senate—the Washington Post's Election Lab model was most accurate, by being the most bearish, giving the Republicans a 98 percent chance in their final forecast. We were second most bearish of the models, at "only" a 90 percent overall chance.
You have to delve into the confidence level of each individual race, though, to take a fuller measure of each model. What probably clinched the overall best Brier score for us was our lack of confidence in a Hagan victory. We gave her an only 56 percent chance of winning, while WaPo gave her a 76 percent chance. (It was very subtle, but the last few days of polls in North Carolina definitely showed some erosion, usually running at a tie or a 1-point lead, instead of the 2- or 3-point leads we'd been seeing.)
The other race where we broke from the pack a little, and it paid off, was Georgia. We gave Michelle Nunn only a 7 percent chance of winning, while the Post, FiveThirtyEight, and the others all gave her at least a 20 percent chance. Our approach to Georgia definitely drew some skepticism. We figured Nunn only had so-so chances of even forcing a January runoff (Georgia has a unique system of requiring general election runoffs when no candidate tops 50 percent), given her weak polling. On top of that, given the Democrats' terrible history in Georgia runoffs, we gave her no chance of winning a runoff if she didn't win outright in November. In essence, we forced her to jump through two different hoops; on Election Day, though, she didn't even make it through the first hoop, losing without a runoff.
We'll talk more about all of these evaulations, as well as the gubernatorial races, over the fold.
Really, though, there's no way to make a full determination of which model was "best," unless you want some sort of sadistic Groundhog Day-style scenario where you actually run the 2014 election over and over again thousands of times and tally up the number of wins in each race amidst the hellish repetitions to see if they match people's predictions. One thing that we might take away from the Brier scores, though, is that a purely Bayesian approach like ours can work just as well as, if not better, than a more fundamentals-based model.
The FiveThirtyEight, Washington Post, and New York Times models all relied on fundamentals to a certain extent, using factors like generic ballot polling, candidate fundraising, the states' historical leans, and simply the historical nature of what happens to the presidential party in midterms. As we were fond of saying, we, by contrast, had an all-meat, no-special sauce model, mostly under the assumption that all those "fundamentals" would gradually just get baked into the polls anyway. The fundamentals-based models, however, eventually dialed down the impact of the fundies as Election Day approached, so they too were driven by the same polls as us in the final days (which explains the similarities in the results).
One thing that the fundamentals-based approach does have to recommend for itself is stability; that was especially evident with FiveThirtyEight's model, which stayed very consistently in the range of 60 percent Republican odds almost the whole way through until starting to break in their direction in the last few weeks. By contrast, our model started as one of the most optimistic models for the Democrats, giving them better than 50 percent odds of holding the Senate on occasions in August and early September; once the polls really started to go south in late September, though, we caught up to and then went past the fundamentals-based models. (Which isn't to say that our model was volatile and wandering all over the place; it stayed parked at a median 48 Democratic seats for nearly the entire month of October, until dropping to 47 at the last minute. The actual outcome was 46.)
That suggests that the Democrats were overperforming the fundamentals for much of the year (or maybe it was just Mark Begich who was overperforming the fundamentals, as his strong polling numbers were buoying overall Democratic odds at first). But by the very end, they may have been
under-performing the fundamentals, as is often the case with wave-type elections, where the side that's already losing sees things go from bad to worse on Election Day as most of the close races all tend to break against them.
Nobody calculated Brier scores for predictions on the gubernatorial races, in part because not all of the models considered gubernatorial races at all. We were the only model to measure gubernatorial races from the start, and the only one to assign an overall percentage to Democratic gubernatorial odds (though simply on the question of whether the Democrats would gain seats; unlike the Senate, nobody really cares about the issue of whether you have a majority of statehouses).
FiveThirtyEight and Sam Wang started looking at the governors later in the game; needless to say, neither they nor we saw much likelihood of the Republicans winning the Maryland race. (We gave Democratic candidate Anthony Brown a 93 percent chance of victory, and FiveThirtyEight gave him a 94 percent chance; obviously, they and we both saw the very few, and ultimately wrong, polls of this race.)
The other gubernatorial races that we got "wrong" were Illinois (where we gave the Dems a 61 percent chance), Kansas (60 percent), Maine (52 percent), and, on the positive side, Colorado (where we gave John Hickenlooper only a 45 percent chance of surviving). By contrast, FiveThirtyEight gave the Dems 66 percent odds in Illinois, 82 percent odds in Kansas, and 57 percent odds in Maine; they did correctly pick Colorado, though, giving Hickenlooper 57 percent odds.
The similarity of these numbers, and the similarity of the overall Brier scores for all the different modelers, should make it very clear that we're all seeing the same polls, and that all of the modelers' successes and failures are entirely predicated on the pollsters getting it right in the first place. When a race is only sporadically polled, and polled incorrectly (or maybe polled inadequately at the end ... it's possible Larry Hogan didn't pull into a lead until very late in the game), as we saw in Maryland, or in North Dakota's Senate race in 2012, even the best-designed model isn't going to get it right. Garbage in, garbage out. Which reminds me: a big thanks to all the pollsters out there, who are the ones out there really busting their butts, and who usually get things right.
In the coming weeks, we'll continue to evaluate Mary Landrieu's odds in the Louisiana runoff, and you may hear more from us as we reverse-engineer the model and look at what we might have done differently to get even better results. And when I say "we," I mean it's a truly team effort from all of us at Daily Kos Elections. Thanks to Drew Linzer, who did all of the programming work and came up with most of the key assumptions that the model is based on, and also to Steve Singiser, who did all the unglamorous work of keeping the massive Daily Kos Elections polling database (which the model draws on) up to date.
Tue Nov 11, 2014 at 3:12 PM PT: We started wondering about whether that our victory would also apply to our model for gubernatorial races, even though there's a limited data set there (only Huffington Post and FiveThirtyEight bothered to calculate individual odds in these races, which remain unsexy compared to the Senate despite the fact that the state houses, not the completed gridlocked Congress, is where any action is going to occur in the next couple years).
It turns out that we were also the most correct of the three prognosticators who looked at gubernatorial races. The difference is paper-thin, 0.08 versus 0.09, but Election Outlook again finished a nose ahead of the pack.