Poll Aggregation (RFC from Statisticians)

by miholo

Community

(This content is not subject to review by Daily Kos staff prior to publication.)

Monday, Aug. 29, 2005 Monday, Aug. 29, 2005 at 3:12:49pm PDT

This is a request for comments from any statistician around here. Others please ignore, unless you want to gawk at geeks. However, since diaries scroll by so fast these days, would be nice if you'd call it to the attention of anyone you know might be interested.

I've been looking for a good Bayesian opinion poll aggregation model for a while, without success. If you know of one, let me know (and all the following discussion is then moot).

All aggregations I've seen are simple moving averages, or simple least-squares fitting of linear of polynomial trend lines.

My first pass at a model from looking at the data would be:

Variables:

X_t : variable to estimate (e.g. Bush's approval-disapproval spread), on day t

D_t : drift term, or trend

Dynamics:

X_{t+1} = X_t + D_t + V_t

D_{t+1} = \alpha D_t + W_t

where:

V_t and W_t are each i.i.d., distribution to be estimated from data

\alpha accounts for mean reversion (to zero) on the drift, to be estimated

Observation:

Y^j_k = 1/N^j_k \sum_{i=0}^{N^j_k-1} X_t + C^j + E^j_k

where:

j is the polling outfit, doing its k-th poll, and N the number of days that poll ran

C^j is the bias associated with pollster j's methodology and question phrasing

E^j_k is the poll error (the distribution of which can be made a function of that particular poll size, although this is probably unnecessary).

Given prior distribution on all r.v.s above and observations Y, we can then compute posterior distributions for X_t at the present and future times (like this).

V_t probably needs to be modeled with a fat-tailed distribution to account for jumps. Or maybe we need to introduce an exogenous indicator variable I_t for major events (e.g. terrorist attack, starting a war...) The model would then be:

X_{t+1} = X_t + D_t + I_t V_t

This, of course, means the model has no predictive power regarding such events (as no model will), but allows us to better fit the model to historical data and therefore get better predictions when there are no such disruptions.

I'm planning to implement this model and validate it on historical data. I'll be coding this in Matlab, probably with Gibbs sampling if I can work out how to sample from all the conditionals (and if the number of variables doesn't bog it down -- though I do have access to pretty good number crunching hardware).

If anyone wants to volunteer to do the data prep (i.e. getting all the data from PollKatz, PollingReport, etc. and putting it in the right format) it would be much appreciated.

Another possible model variation would allow for X_t itself to be mean-reverting (to some to-be-estimated value). My guess is that this will not validate very well, but may be worth a shot.

Why do this now?

Because working with the current data will put a smile on your face, and make getting the work done all the much easier.
Because having a model developed, implemented and tested by the time the elections roll around could be very useful.

The next step will be to split the results on a state-by-state basis. The Bayesian data model would include estimates of state-to-state correlations. Here the benefits of having a good model would actually be substantial (as opposed to just enlightening the public debate about new poll numbers): when just one state or sub-set of the states is polled, you'd know exactly how to revise the estimates for nearby of sociologically similar states that were not polled.

If you don't want to post your comments for everyone to see, please email me (see my profile).
CategoryPoll