Statistics 101 Part 9:Multivariate techniques

by plf515

Community

(This content is not subject to review by Daily Kos staff prior to publication.)

Tuesday, Jun. 20, 2006 Tuesday, Jun. 20, 2006 at 6:30:06am PDT

In today's diary, I will look very briefly at a variety of multivariate techniques. I'm not going to get into the math, because the math gets a little hard, and because typing math here is something I haven't learned yet (I've gotten some hints from various people, but I've been lazy about following up).

Multivariate techniques arise when you have a lot of variables on a bunch of subjects, and want to figure out how the variables or subjects relate to one another.

If you're interested, follow me below the fold.

Some notes before we get started:

1. Thanks to dmsilev for putting my earlier diaries on the dkospedia in an article called (of all things) statistics
Statistics.

2. Please send me suggestions, questions, ideas, or whatever. These diaries are for you, the kossacks. I get a lot out of reading this site, and I'd like to pay back. So, tell me what you want to know. If you've got a good article or topic, and it's online, you can link here and we'll talk about it in a later diary. Or put your requests in one of my Open Thread requests (I have one request in SusanG diary rescue of last night). Could be anything.

OK, on to the main topic.

Sometimes, you have a lot of variables on a bunch of subjects. If you're interested in explaining one of them by using others, then you might want some variation of the generalized linear model. But suppose that the variables are all relatively equal, but there are too many to look at all at once. Then you may want a multivariate technique.

For instance, you might have data on how congresscritters got rated by various groups (some of this is published in the Almanac of American Politics, some may be available online, I don't know.

Or you might have the answers to all the questions on the SAT for a bunch of HS students

Or you might have answers to an attitude survey for people you've interviewed.

So, what can you do? And when should you do what?

If you think that the variables can be partially explained by latent factors, you might want Factor Analysis. A latent variable is something you think exists, but you can't measure it exactly. In the congress example, some possible latent factors are: Conservatism vs. liberalism; environmentalism; friendliness to business vs. consumer interest; etc. Factor analysis uses the Correlationsamong the variables to reduce a large number of variables to a much smaller number of factors.

If you want, instead, to see whether there are groups of subjects who are similar on the variables, you might want Cluster analysis. This method tries to put the subjects into groups that are similar on the variables. In the congress example, that might lead clusters of people who share voting patterns.

If you want to see how people judge the similarity of things, there's a technique called Multidimensional scaling which starts with judgements of similarity and maps the subjects into a few dimensions. In the congress example, you might do a survey of people and ask them to judge pairs of senators as to similarity, then put the results into MDS and see whether there were sensible dimensions (Given how little people know about Senators, it might (or might not) be better to limit the survey to some knowledgable group - one interesting thing would be to do the survey among congresscritters themselves).

If you want to see whether the variables allow you to separate the subjects into some predefined groups, you might want Discriminant analysis, although logistic regression does very similar stuff with more lenient assumptions. In the congress example, one obvious application would be to see how well the ratings separate the people into Repub and Dem. But you might want to see if they distinguish men from women, or among regions of the country.

OK. Enough. Now, it's your turn.