OK

This is only a Preview!

You must Publish this diary to make this visible to the public,
or click 'Edit Diary' to make further changes first.

Posting a Diary Entry

Daily Kos welcomes blog articles from readers, known as diaries. The Intro section to a diary should be about three paragraphs long, and is required. The body section is optional, as is the poll, which can have 1 to 15 choices. Descriptive tags are also required to help others find your diary by subject; please don't use "cute" tags.

When you're ready, scroll down below the tags and click Save & Preview. You can edit your diary after it's published by clicking Edit Diary. Polls cannot be edited once they are published.

If this is your first time creating a Diary since the Ajax upgrade, before you enter any text below, please press Ctrl-F5 and then hold down the Shift Key and press your browser's Reload button to refresh its cache with the new script files.

ATTENTION: READ THE RULES.

  1. One diary daily maximum.
  2. Substantive diaries only. If you don't have at least three solid, original paragraphs, you should probably post a comment in an Open Thread.
  3. No repetitive diaries. Take a moment to ensure your topic hasn't been blogged (you can search for Stories and Diaries that already cover this topic), though fresh original analysis is always welcome.
  4. Use the "Body" textbox if your diary entry is longer than three paragraphs.
  5. Any images in your posts must be hosted by an approved image hosting service (one of: imageshack.us, photobucket.com, flickr.com, smugmug.com, allyoucanupload.com, picturetrail.com, mac.com, webshots.com, editgrid.com).
  6. Copying and pasting entire copyrighted works is prohibited. If you do quote something, keep it brief, always provide a link to the original source, and use the <blockquote> tags to clearly identify the quoted material. Violating this rule is grounds for immediate banning.
  7. Be civil. Do not "call out" other users by name in diary titles. Do not use profanity in diary titles. Don't write diaries whose main purpose is to deliberately inflame.
For the complete list of DailyKos diary guidelines, please click here.

Please begin with an informative title:

This post was originally published at Malark-O-Meter, where Brash Equilibrium (aka Benjamin Chabot-Hanowell, aka me) statistically analyzes fact checker data and attempts to influence election forecasting methodology. Help me do the latter by passing this essay around to your geek friends. And while you're at it, check out my fact checking analyses of the 2012 election. More in-depth fact checker analysis is forthcoming. Okay, on with the show.

In the aftermath of the 2012 election, campaign prognosticators Nate Silver, Simon Jackman, Drew Linzer, and Sam Wang make preliminary quantitative assessments of how well their final predictions played out. Others have posted comparisons of these and other election prediction and poll aggregation outfits. Hopefully, we'll one day compare and combine the models based on their long term predictive power. To compare and combine models effectively, we need a good quantitative measure of their accuracy. The prognosticators have used a measure called the Brier score to measure the accuracy of their election eve predictions of the election outcome in each state. Despite its historical success in measuring forecast accuracy, the Brier score fails in several ways as a forecast score. I'll review its inadequacies and suggest a better method.

The Brier score measures the accuracy of binary probabilistic predictions. To calculate it, take the average, squared difference between the forecast probability of a given outcome (e.g., Obama winning the popular vote in California) and the observed probability that the event occurred (.e.g, one if the Obama won, zero if he didn't win). The higher the Brier score, the worse the predictive accuracy. As Nils Barth suggested to Sam Wang, you can also calculate a normalized Brier score by subtracting four times the Brier score from one. A normalized Brier score compares the predictive accuracy of a model to the predictive accuracy of a model that perfectly predicted the outcomes. The higher the normalized Brier score, the greater the predictive accuracy.

Because the Brier score (and its normalized cousin) measure predictive accuracy, I've suggested that we can use them to construct certainty weights for prediction models, which we could then use when calculating an average model that combines the separate models into a meta-prediction. Recently, however, I've discovered research in the weather forecasting community that suggests a better and more intuitive way to score forecast accuracy. This score has the added benefit of being directly tied to a well-studied model averaging mechanism. Before describing the new scoring method, let's describe the problems with the Brier score.

Jewson (2004) notes that the Brier score doesn't deal adequately with very improbable or probable events. For example, suppose that the probability that a Black Democrat wins Texas is 1 in 1000. Suppose we have one forecast model that predicts Obama will surely lose in Texas, whereas another model predicts that Obama's probability of winning is 1 in 400. Well, Obama lost Texas. The Brier score would tell us to prefer the model that predicted a sure loss for Obama. Yet the model that gave him a small probability of winning is closer to the "truth" in the sense that it estimates he has a small probably of winning. That seems counter-intuitive. In addition to its poor performance scoring highly improbable and probable events, the Brier score doesn't perform well when scoring very poor forecasts (Benedetti 2010; sorry for the pay wall).

These issues with the Brier score should give prognosticators pause for two reasons. First, they suggest that the Brier score will not perform well in the "safe" states of a given party. Second, they suggest that Brier scores will not perform well for models whose predictions were poor (here's lookin' at you, Bickers and Berry). So what should we do instead? It's all about the likelihood. Well, actually its logarithm.

Both Jewson and Benedetti convincingly argue that the proper score of forecast accuracy is something called the log likelihood. A likelihood is the probability of a set of observations given the model of reality that we assume produced those observations. As Jewson points out, the likelihood in our case is the probability of a set of observations (i.e., which states Obama won) given the forecasts associated with those observations (i.e., the forecast probability that Obama would win those states). A score based on the log likelihood penalizes measures that are very certain one way or the other, giving the lowest scores to models that are perfectly certain of the outcome.

To compare the accuracy of two models, simply take the difference in their log likelihood. To calculate model weights, first subtract the likelihood score of each model from the minimum likelihood score across all the models. Then exponentiate the difference you just calculated. Then divide the exponentiated difference of each model by the sum of those values across all the models. Voila. A model averaging weight.

Some problems remain. For starters, we haven't factored Occam's razor into our scoring of models. Occam's razor, of course, is the idea that simpler models are better than complex models all else equal. Some of you might notice that the model weight calculation in the previous paragraph is identical to the model weight calculation method based on the information criterion scores of models that have the same number of variables. I argue that, for now, this isn't really an issue. What we're doing is measuring a model's predictive accuracy, not its fit to previous observations. I leave it up to the first order election prognosticators to decide which parameters they include in their model. In making meta election forecasts, I'll let the models' actual predictive performance decide which ones should get more weight.

Intro

You must enter an Intro for your Diary Entry between 300 and 1150 characters long (that's approximately 50-175 words without any html or formatting markup).

Extended (Optional)

Poll

Should we penalize election forecast model weights by the number of parameters they estimate?

53%8 votes
33%5 votes
13%2 votes

| 15 votes | Vote | Results

EMAIL TO A FRIEND X
Your Email has been sent.