Skip to main content

View Diary: What's an Obama State? With February predictions. (242 comments)

Comment Preferences

  •  Overfitting (11+ / 0-)

    Hi Polbano

    I think that there is a very real danger that your model might be overfitting the data.  I don't claim to be an expert statistician, but I do work with this type of stuff fairly frequently.

    There are two significant problems with your model:

    First, you are using a lot of explanatory variables.  If the ratio of explanatory variables / data points gets too large your model will have a fantastic R Squared for those individual data points, but it won't necessarily have much predictive power for new observations.

    Second, I think that there is a very real risk of multicollinearity.  This is a pretentious way of saying that some of your independent variables are strongly related to one another.  For example, I wouldn't be surprised to discovery that John Kerry Vote share and Percentage of Southern Baptists are related to one another.  Here, once again, multicollinearity can artificially inflate your R Squared.

    I recognize that you've already done a significant amount of work here.  Even so, I'd recommend considered a somewhat more sophisticated analysis.  You might consider using a Cross Validation technique to explore whether you are over fitting the data.  (Cross Validation is a method in which you break the data into a training set and a validation set.  You build the model using the training set and then test the accuracy using the test set).  If you are over fitting, a technique known as Principle Component Analysis can be using to compensate.

    •  Too few data points (3+ / 0-)
      Recommended by:
      Abou Ben Adhem, dharmafarmer, kyril

      for cross-validation. I was thinking of overfitting too (AI grad degree, I did this stuff with computer learning, the learning/testing split is canonical), but there's just too few data points to have a meaningful result. A better bet is simply to see whether the February primaries match the predictions IMO.

      •  overfitting still might be a problem (4+ / 0-)
        Recommended by:
        Victor, Abou Ben Adhem, kyril, zackamac

        although I agree that cross-validation isn't likely to be the answer.

        Obviously, waiting until some results from Feb come in is a great test.

        Alternatively, it is possible that some of the variables could be collapsed: for example, the ratio of Southern Baptists to African-Americans might be more informative than including both in the model. That might capture the dynamic that poblano seems to be noticing.

        Another way to do this would be to try to get the information county by county (or CD by CD). That would give the model MANY more data points, allowing for a much more robust fit.

        Maybe a collective project -- it would be a lot of work to get this sort of info for each CD for one person, but if everybody chipped in a few CDs, it would be easy.

        •  bravo! (0+ / 0-)

          I'm certainly no newbie to DK, but over the years, I've found 90% of the diaries and comments to of turd grade.  Its the diamonds in the turds that make it worthwhile.

          This diary and comment threads like yours are the rare exception.

          Liberals drive me crazy. Unfortunately, conservatives are even worse.

          by goblue72 on Sat Feb 09, 2008 at 08:07:29 PM PST

          [ Parent ]

      •  Well, the result is in... (1+ / 0-)
        Recommended by:
        zackamac

        Poblano's model is way off. You could write WA off to the momentum effect which is hard to quanitify, but under-predicting LA win suggests that it was indeed an overfitting problem.

    •  with these few data points (1+ / 0-)
      Recommended by:
      kyril

      jackknife (or 'leave one out') is likely to be a good method of validation.

    •  One partial test (0+ / 0-)

      I'm not a stat expert, by any means, but it strikes me that one potential extra test of this would be to apply his formulae, which were developed, apparently, by looking at state contests, to the results in individual Congressional Districts (or cross-reference both CDs and Counties as a further test) within California. There is a very wide range of demographics in California, bunched oddly, and related in complex ways to the districts.

      So poblano - if you're reading this, and you've got an extra couple of dozen hours on your hands (heh) - you could try applying it to California piecemeal to see if it holds up against a different, though not unrelated dataset.

      As for tonight, I see you missed WA and LA by a ways, are about right so far on NE.

      Cry, the beloved country, these things are not yet at an end. - Alan Paton

      by rcbowman on Sat Feb 09, 2008 at 10:21:55 PM PST

      [ Parent ]

      •  use counties (0+ / 0-)

        That's a really good idea-  counties would be a better geographical unit to use because they are more likely to be interally homogenous and externally heterogeneous.  Also they are unrelated to party affiliation, which CD's are strongly related to.

    •  I kept reading down looking for OVERFITTING (0+ / 0-)

      A little knowledge of statistics is REALLY dangerous.

      This model sounds wonderful, but we're already seeing it fall apart out of sample.

      Still, I applaud the effort. And I think there is something useful in trying to go beyond the usual talking head bloviating using some real data analysis.

Subscribe or Donate to support Daily Kos.

Click here for the mobile view of the site