Yesterday, in this diary
[http://www.dailykos.com/...] we looked at measures of central tendency.  Today, we will look at measures of spread, and tomorrow, measures of shape .  One way of looking at a measure of central tendency is as your best guess of what something will be.  Measures of spread tell you how good that guess is, and measures of shape tell you how you are likely to be wrong.

There's more, after the fold.

Statistics can be divided into two big areas: Descriptive statistics and inferential statistics.  Descriptive statistics is about describing data, and inferential statistics is about making inferences from a sample to a population.  Suppose, for instance, you were interested in the average income of adults in the USA.  You can't get the information on the whole population, so you take a sample.  (We'll get into ways to do this in a later diary).  When you try to say things about the whole population based on your sample, that's inferential statistics. When you are just talking about your sample, that's descriptive statistics.

Sometimes, though, you do have the whole population.  If you wanted to find the average SAT score in a class of students, you could ask everyone.  Then you don't need to infer anything.

(By the way, don't get used to these terms being sensible.  Statisticians often use familiar words in unfamiliar ways; in particular, when statisticians use the words significance, power, random, and confidence, they don't mean exactly what they do in everyday discorse.  Don't blame me, I didn't make up the terms).

OK, enough background.  Let's say you've collected the data on whatever it is you are interested in.  There are often several things you are interested in.  You are interested in what a typical person is like, and for this, the measures of central tendency are good.  You can think of this as ways to formalize the idea of a best guess.  But you are also interested in how good that guess is.  For that, you need a measure of spread.  There are several popular ones.  By far the most common is the standard deviation.  Others are the variance,  range, and the interquartile range.

The standard deviation of a sample  is gotten by
1) Finding the mean
2) Subtracting each value in your sample from the mean
3) Squaring each of these
4) Adding the result of step 3
5) Dividing by n 6) Taking the square root of step 5

(As an aside, is there a way to type formulas here?)

For the variance, just leave out step 5.

The range is just the lowest value to the highest (it's usually given as both numbers).  The interquartile range requires first dividing the data into quartiles, which essentially means putting them into order, then taking the bottom quarter, the middle (which is the same as the median), and the top quarter.  The interquartile range is the range from the first quartile to the third (if you remember percentiles, then the first quartile is the same as the 25%tile and the third quartile is the 75%tile).

Enough math.  Those who want more formal definitions and examples can, of course see wikipedia or some such.

When is each of these good? Or bad?

Well, the standaard deviation is usually good for the cases where the mean is a good measure of central tendency (see yesterday's diary).  The variance is not used much in everyday reporting, it's mostly used for further statistical work.  The range is almost always useful, and easy to interpret, and the interquartile range ought to be used a lot more, because, once you understand it, it's easy to interpret, and it gives a good sense of the spread.

OK, my kid is demanding to get on this computer, so, I will stop here.  After all, I can always edit later! Update [2006-6-12 8:16:7 by plf515]: I've had requests for examples of when SD is better, and when the IQR or range is better. Briefly, if you think the mean is a good measure of central tendency, then usually the SD is a good measure of spread. If you use the median, then you often want the IQR and range in addition to (or even instead of) the SD. And, if there is no good measure of central tendency, there is likely to be no good measure of spread. Some concrete examples: If you wanted to know the average IQ of Kossacks, then (presuming you could get a good sample, which I will talk about in another diary) the mean would be a good measure of central tendency, and the SD a good measure of spread. IQ is normally distributed (we'll get to that in another diary, too) (actually, there is evidence that IQ isn't *exactly* normally distributed, but it's close). OTOH, if you wanted to know about the income of kossacks, then the median would be a good measure of central tendency, and, while the SD wouldn't exactly be WRONG, I would want to look at IQR and range as well. Finally, if you wanted to look at the heights and weights of professional athletes (as a whole group) then *no* measure of CT would be really good, nor would any measure of spread, because the group is composed of people who are too different from one another.

• ##### Questions? Commnets? Tips? Flames?(14+ / 0-)

looks of total confusion?

Let me know if all this makes sense.

Whenever we take away the liberties of those whom we hate we are opening the way to loss of liberty for those we love. -- Wendell Willkie

But I don't see much that is useful here.  Your important points seem to be that data reduction and inference are the two big uses of statistics.  You probably want to say more about these, give a few examples, and describe the solutions.

The formula for variance doesn't make sense without the context of probability spaces and measurable functions, which you definitely don't have here.

Fake Canadians are total hosers.

candiate? Do you see common errors that Democrats are making in this area (local polling)?  How would a local candidate running on a small budget get good data on a large neighboorhoods polling data? How many people should they ask in a neighboorhood of say 50o people to see if its blue or red?

What about on economic data? The economy often gets reported as being strong, but people rate it as being poor (I know part of this is the fact that corporate profits are making up a larger share of the economy but..)

-1.63/ -1.49 "Speaking truth to power"

• ##### Good questions(0+ / 0-)

but they are for a later diary in this series.

The questions about polling relate to standard error, sampling, and so on.  I will cover those soon.  Briefly, the best thing is to try to get as random a sample as possible.  There are various methods for doing this, such as random digit dialing, but this is not an area I am expert in.  Perhaps someone else here can help out?  But getting a random sample is going to matter more than the sample size.  You can get pretty accurate numbers even with a small sample; but if it's not a random sample of voters, then it will be precisely wrong.  One famous statistician (George Box) is reputed to have said

An approximate answer to the right question is more valuable than an exact answer to the wrong question

The questions about the economy are not really statistical, but substantive. I am not an economics expert (not even close).  It's a question of what should be reported.

Whenever we take away the liberties of those whom we hate we are opening the way to loss of liberty for those we love. -- Wendell Willkie

It goes to the edit page, to which only you have access.

Here is the link you need:

http://www.dailykos.com/...

Men never do evil so completely and cheerfully as when they do it from a religious conviction. - Pascal

• ##### Fixed, I hope(0+ / 0-)

Thanks for pointing that out

Whenever we take away the liberties of those whom we hate we are opening the way to loss of liberty for those we love. -- Wendell Willkie

• ##### Still not fixed(0+ / 0-)

you need 'story' where you have 'diary_edit'.

Men never do evil so completely and cheerfully as when they do it from a religious conviction. - Pascal

• ##### Sheesh(0+ / 0-)

I am so web-challenged

Is it right now?  I tried preview and it seemed to work, but I can't tell.

Whenever we take away the liberties of those whom we hate we are opening the way to loss of liberty for those we love. -- Wendell Willkie

• ##### Good!(0+ / 0-)

I mean, the link is good, not that you are web-challenged - that's bad, but understandable and getting better . . .

BTW: the link yesterday -> day before is just fine . . .

Men never do evil so completely and cheerfully as when they do it from a religious conviction. - Pascal

Best way, as far as I know, is to render the equation as an image:

There's something called MathML which should let you just write formulas as you go, but support for it is spotty.

-dms

• ##### Thanks(0+ / 0-)

I wish I could post LaTeX code here somehow, and have it rendered as formulas.  I will look up MathML.

Whenever we take away the liberties of those whom we hate we are opening the way to loss of liberty for those we love. -- Wendell Willkie

that MathML is vaguely similar to LaTeX.

I used LaTeX to run off that formula; I've got a little applet that lets me type in a LaTeX formula, and it generates a bitmap of whatever size you need. If you're on a Mac, look at http://ktd.club.fr/...

There's probably something comparable for PCs.

-dms

• ##### Thanks(0+ / 0-)

Yes I am on a PC, but will look into it.

Whenever we take away the liberties of those whom we hate we are opening the way to loss of liberty for those we love. -- Wendell Willkie

• ##### Update(0+ / 0-)

It looks like the software that powers Wikipedia (and, by extension, dkosopedia) speaks LaTeX. Details on this wikipedia page.

That won't help much for diaries, but if we do get around to archiving this series, it will make that part of the job somewhat easier.

-dms

[ Parent ]

• ##### Of these,(2+ / 0-)
Recommended by:
sodalis, plf515

I would add, the standard deviation is the most important for interpreting what's going on with the measures in the distribution.

The standard deviation is the amount by which you can expect a measure to differ from the mean of the distribution.

So if I draw numbers randomly from a distribution, on average the numbers will be different from the mean by exactly one standard deviation.

The standard error is just the standard deviation of a distribution of sample means, and thus is the amount by which one can expect a mean to differ from the mean of the means.  (This is the point at which my students' heads begin to pop.)

The variance is less easily interpreted, but useful for other computations.  The range and its cousins are useful measures of spread for distributions of ordinal data.

I love stats.

-9.25, -7.54

Catecholamines: Can't live with 'em, can't live without 'em.

• ##### I love stats too(3+ / 0-)
Recommended by:
Clem Yeobright, sodalis, bartman

You are, of course, absolutely right about SD and what it is.  Also about standard error.

One place where I disagree is that I actually find the range and the IQR useful for continuous data as well.
First, they are have easy interpretations.  Interpreting the SD requires a bit of practice.  The IQR and range are more intuitive.  Second, if the data are not normally distributed, that shows up in the IQR and range, but not in the SD.  Of course, I haven't described the normal distribution yet (for those following along, for now, it's the bell shaped curve --- I will get into more detail in  a later diary).

Whenever we take away the liberties of those whom we hate we are opening the way to loss of liberty for those we love. -- Wendell Willkie

• ##### I get stuck on inferential things,(0+ / 0-)

so I don't find the range very useful -- it's a good descriptor, especially if you have hinky distributions, but it's no good for doing inferential stuff.  But that's my bias.

I'm looking forward to the rest of these....

-9.25, -7.54

Catecholamines: Can't live with 'em, can't live without 'em.

• ##### This is a great series(4+ / 0-)
Recommended by:
indybend, Clem Yeobright, sodalis, plf515

I have to say that my single favorite diary genre is when someone who knows what the hell they're talking about takes a specific and often complex and/or misunderstood topic and breaks it down in a way that everyone can understand and learn from. It's what makes DarkSyde's posts so good as well.

Can you re-run during the week?  Then I will be sitting at my desk in learning mode.  Right now I am propped up in my bed, wearing my nightie, nursing a hangover with a big steaming hot Barney's coffee...hard to study in this condition.  =)

Time For A Cool Change: Gore 2008

Have you read this?

Dkospedia: Hotlist

Much better than trying to catch diaries that run off the list in an hour, doncha think?

Men never do evil so completely and cheerfully as when they do it from a religious conviction. - Pascal

• ##### I use dkos like my granny used her microwave...(0+ / 0-)

She pushed the number 1 and start...no matter what she was cooking/heating.
She didn't know nuthin about that newfangled popcorn button or the defrost setting.
Guess I need to get out the Dkos instruction booklet and start getting my 'monies' worth from the site.
Thanks for the nudge Clem.

Time For A Cool Change: Gore 2008

• ##### I am pretty web challenged(1+ / 0-)
Recommended by:
DemiGoddess

myself (see my repeated attempts to get the link in this diary right) but setting up a hotlist is so easy even I could do it, and so can you :-)

Whenever we take away the liberties of those whom we hate we are opening the way to loss of liberty for those we love. -- Wendell Willkie

...so I could find this next week.  Hope you don't mind. :-)

Robyn

Teacher's Lounge opens each Saturday, sometime between 10am and 11am EST

• ##### I don't mind up at all(0+ / 0-)

Thanks!

Whenever we take away the liberties of those whom we hate we are opening the way to loss of liberty for those we love. -- Wendell Willkie

• ##### A couple other requests(3+ / 0-)
Recommended by:
bronte17, ama, plf515

I teach some of these stats, and one thing that really makes it come home is graphics.  Can you show some graphics that show curves with mean, median, mode, range, variance, standard deviation, etc.

You may consider doing this on your web page, and posting this series when it is done on your web page.

A couple other thoughts in terms of stats.  On key thing is to understand how the data are collected.  Is this from a sample at a point in time, like most polling is done, or from a process over time.  (My stats almost always come from process over time, polling is sampling at a point in time.)

You may also define standard deviation in terms of how much will fall between 1, 2, and 3 Stan from the mean.  Ie, 3 stan usually represents 99% of the data.

Thanks for the series.  This is great.

• ##### I gotta figure a lot(0+ / 0-)

of this out.  I can create the graphics, but am looking into how to get it here.  I have a web page, but someone else designed it and I am not sure how to add stuff like graphics to it.

I will definitely be covering how data is collected at some point, you are absolutely right that it is essential.

The bit about the sd in terms of how much falls between 1, 2 or 3 sds isn't really a definition, and depends (to an extent) on the distribution of the data (there are some general rules, but they are different than the usual rules for normally distributed data.

The problem with a series like this (or with teaching stats in general) is what order to cover things in, since everything depends on everything else.

Feel free, though, to add stuff.  I do not want this to be a lecture series; the great thing about blogs is that everyone can add stuff.

So, if you know stuff, feel free to share.  If you don't know stuff, feel free to ask!

Whenever we take away the liberties of those whom we hate we are opening the way to loss of liberty for those we love. -- Wendell Willkie

• ##### More examples pls(0+ / 0-)

I have a hard time conceptualizing generalities until I see actual examples, and then the generalties make perfect sense.

So, in this diary, a case where SD is clearly better for measuring spread, and one where IQR or Range alone is far better.

Also, as a future request - tips on how to spot misuse of stats, maybe common tactics to selectively misquote stats, things to look out for etc.

This is a great series idea, as one who had some trouble with stats despite a heavily mathematical education (software engineering) - this helps, even where I do understand a concept, it's nice to reaffirm that I do.

"I will make a bargain with the Republicans. If they will stop telling lies about Democrats, we will stop telling the truth about them." -- Adlai Stevenson