Statistics 101 Part 2

by plf515

Community

(This content is not subject to review by Daily Kos staff prior to publication.)

Sunday, Jun. 11, 2006 Sunday, Jun. 11, 2006 at 3:22:02am PDT

Yesterday, in this diary
[http://www.dailykos.com/...] we looked at measures of central tendency. Today, we will look at measures of spread, and tomorrow, measures of shape . One way of looking at a measure of central tendency is as your best guess of what something will be. Measures of spread tell you how good that guess is, and measures of shape tell you how you are likely to be wrong.

There's more, after the fold.

Statistics can be divided into two big areas: Descriptive statistics and inferential statistics. Descriptive statistics is about describing data, and inferential statistics is about making inferences from a sample to a population. Suppose, for instance, you were interested in the average income of adults in the USA. You can't get the information on the whole population, so you take a sample. (We'll get into ways to do this in a later diary). When you try to say things about the whole population based on your sample, that's inferential statistics. When you are just talking about your sample, that's descriptive statistics.

Sometimes, though, you do have the whole population. If you wanted to find the average SAT score in a class of students, you could ask everyone. Then you don't need to infer anything.

(By the way, don't get used to these terms being sensible. Statisticians often use familiar words in unfamiliar ways; in particular, when statisticians use the words significance, power, random, and confidence, they don't mean exactly what they do in everyday discorse. Don't blame me, I didn't make up the terms).

OK, enough background. Let's say you've collected the data on whatever it is you are interested in. There are often several things you are interested in. You are interested in what a typical person is like, and for this, the measures of central tendency are good. You can think of this as ways to formalize the idea of a best guess. But you are also interested in how good that guess is. For that, you need a measure of spread. There are several popular ones. By far the most common is the standard deviation. Others are the variance, range, and the interquartile range.

The standard deviation of a sample is gotten by
1) Finding the mean
2) Subtracting each value in your sample from the mean
3) Squaring each of these
4) Adding the result of step 3
5) Dividing by n 6) Taking the square root of step 5

(As an aside, is there a way to type formulas here?)

For the variance, just leave out step 5.

The range is just the lowest value to the highest (it's usually given as both numbers). The interquartile range requires first dividing the data into quartiles, which essentially means putting them into order, then taking the bottom quarter, the middle (which is the same as the median), and the top quarter. The interquartile range is the range from the first quartile to the third (if you remember percentiles, then the first quartile is the same as the 25%tile and the third quartile is the 75%tile).

Enough math. Those who want more formal definitions and examples can, of course see wikipedia or some such.

When is each of these good? Or bad?

Well, the standaard deviation is usually good for the cases where the mean is a good measure of central tendency (see yesterday's diary). The variance is not used much in everyday reporting, it's mostly used for further statistical work. The range is almost always useful, and easy to interpret, and the interquartile range ought to be used a lot more, because, once you understand it, it's easy to interpret, and it gives a good sense of the spread.

OK, my kid is demanding to get on this computer, so, I will stop here. After all, I can always edit later! Update [2006-6-12 8:16:7 by plf515]: I've had requests for examples of when SD is better, and when the IQR or range is better. Briefly, if you think the mean is a good measure of central tendency, then usually the SD is a good measure of spread. If you use the median, then you often want the IQR and range in addition to (or even instead of) the SD. And, if there is no good measure of central tendency, there is likely to be no good measure of spread. Some concrete examples: If you wanted to know the average IQ of Kossacks, then (presuming you could get a good sample, which I will talk about in another diary) the mean would be a good measure of central tendency, and the SD a good measure of spread. IQ is normally distributed (we'll get to that in another diary, too) (actually, there is evidence that IQ isn't *exactly* normally distributed, but it's close). OTOH, if you wanted to know about the income of kossacks, then the median would be a good measure of central tendency, and, while the SD wouldn't exactly be WRONG, I would want to look at IQR and range as well. Finally, if you wanted to look at the heights and weights of professional athletes (as a whole group) then *no* measure of CT would be really good, nor would any measure of spread, because the group is composed of people who are too different from one another.