EDIT: Some overly sensitive fellow Kossacks didn't like that I back-linked to my blog, where I posted this originally. In the interest of encouraging discussion rather than comments about unimportant things, I just posted the whole thing here, verbatim. May we please have a discussion now, for chrissake?
I've written a lot recently about the promise of combining the results from the different election prediction models that have cropped up over the last decade. (Here's a scroll of those articles in reverse chronological order.) One suggestion I've made is to average the results of the election prediction models. The marginoferror blog made the same suggestion, noting that the averaged aggregator performs better than any individual aggregator (that they included in their sample of aggregators).
Today, I present suggestions for how to calculate averaging weights for a given prediction of the winner of the presidency in each state, and of the percent popular vote in that state. These methods suggestions were inspired by the reporting of Brier scores and other prediction accuracy statistics by Simon Jackman, Sam Wang, and Drew Linzer.
State-level outcomes (thus EV outcomes)
To calculate the model weight for a given model at a given point in time, start with Christopher A. T. Ferro's sample size adjusted Brier score (see equation 8, which depends on equation 3 and the first expression in section 2.a) comparing all observed state-level outcomes to the probability estimated from all of the years that an aggregator has made predictions at the specified calendar distance from election day.
Ferro's adjusted Brier score is best because it accounts for the effects of sample size on the Brier score.
Next, subtract that Brier score from one, which is the highest possible value for a Brier score. The result is an absolute score that increases as the Brier score decreases. Recall that the Brier score is larger when there is greater distance between the predicted and observed values.
Next, we repeat that process for all aggregators that have made predictions at that distance from election day.
Next, we normalize all the absolute scores by the summed absolute scores to give each model a relative weight.
Finally, we weight each model by its relative weight when averaging.
This method could easily be modified to give models weights corresponding to entire prediction histories, and/or to prediction within a given time interval at a given distance from election. It could also be extended to deal with one-off forecasts that are never updated. Because state-level outcomes largely determine the electoral vote, I propose that the same model weight calculated as above could be used when averaging electoral vote distributions.
State-level shares of popular vote
The method is identical to what I described above, except we replace Ferro's adjusted Brier score with the sample normalized root mean squared error, which would measured the average percentage point difference between the observed and expected popular vote outcomes. Simply calculate one minus the sample normalized root mean squared error of a given model, and divide the difference by the some of the same for the rest of the models. Then, calculate a weighted average.
These methods have a lot of nice features. They result in weights that are easily interpreted. The weights can also be decomposed into different components because they are based on the Brier score and root mean squared error. For example, the Brier score can be decomposed to examine calibration and uncertainty effects. The mean squared error can be decomposed into bias and variance components. The methods are flexible enough to accommodate any scope of predictive power that interests researchers.
Originally posted at Malark-O-Meter.org.