Testing, testing, 1 , 2, 3

by plf515

Community

(This content is not subject to review by Daily Kos staff prior to publication.)

Tuesday, Apr. 11, 2006 Tuesday, Apr. 11, 2006 at 7:22:24pm PDT

There's a lot of talk these days about testing, fairness, and education. A lot of this relates to the field known as psychometrics. Since it seems like some of this is relevant to some people here, and since that's what I got my PhD in, I thought I'd share some thoughts. A gentle introduction to the field, if you will. So, if you will, join on the flip, and if you won't, then I'll see you on other diearies.

Any measurement of any quality - weight, income, intelligence, scholoastic aptitude, whatever - has certain qualities that can be used to judge how good a measurement it is. Three that are of concern to psychometricians are relibability, validity, and bias.

Reliability is whether measure is accurate and conssitent.
The most common ways to assess reliability are test-retest reliability and split-half reliability. Test-retest means you give the same test (or measure) to the same people more than once, and see how similar their scores are. Split-half means you divide the items on a test into two parts, and see if the two parts have similar scores.

Validity is whether it measures what you think it measures. Although there are lots of types, they boil down to two sorts of things: Face validity - does the measure seem right? and all the other kinds, which all involve seeing whether the measure is related to things you think it should be, and not related to things it shouldn't be.

Bias is systematic over- or under-estimation.

For something like weight, this is relatively straightforward. Step on a scale. Get off. Step back on. Do the numbers match? Are they close? That's a measure of reliability. (split half reliability isn't really meaningful here). If you add a weight tot he scale, do the numbers go up? Do things that seem bigger and heavier weigh more than things that seem smaller and lighter? Do you weigh more with your clothes on than naked? More after a big meal than before? All those are signs of validity. OTOH, if your weight changes when you touch your nose, that's not so good.

Bias would mean that a scale consistently over-estimated or underestimated your weight. If three scalce say you weigh 120 pounds, and another says you weigh 130, amd there are similar differences for other people, then at least one is biased.

When it comes to things like intelligence or scholastic aptitude, though, things are murky. They are also controversial. This combination is hardly one to lead to senseible debate! For intelligence, in fact, things are so murky that they might as well be statements by George Bush.

Why? Well, for intelligence no one knows what it means. There's no agreed upon definition. So, does IQ measure intelligence? Well, if you tell me what intelligence is, I will tell you if IQ measures it. I will say, though, that IQ seems to be related to something that most people seem to think of as intelligence. In general. But if you had a 10 minute conversation with 10 people who had IQs of 140, 10 with IQs of 100, and 10 with IQs of 60, you'd probably be able to guess which is which. And IQ tests are at least moderately reliabile (not nearly as reliable as bathroom scales, but better than tea leaves or astrology), so they are measuring something.

For scholastic aptitude, things are a little better. We have at least some idea what we mean by that. It has something to do with getting good grades. But not exactly, becuase it is APTITUDE, and isn't intended to measure desire, or time availability, or a dozen other things.
But there's a bigger problem with assessing the validity of SATs - which colllege you go to is determined in part by your SAT score, and what grades you get is determined in part by which college you go to. There are ways around this, in particular, there's a statistical technique called hierarchical linear models (aka mixed models, random effect models, and several other terms) that seems promising. I have, however, seen no studies using this method for SATs and college grades.

There's even MORE to the problem, because the SATs might be more valid for some people than for others. In particular, people with various disabilities (blindness, deafness, and various learning disabilities) may be measured less validly than more typical people.

SATs, like IQs, are at least somewhat reliable.

The real controversy comes in with regard to bias. Some groups, in particular certain racial and ethnic groups, do worse on IQs and SATs than other groups. That's just a fact, and it's not controversial. What is blazingly controversial is why these differences exist. Part of it is due to factors that correlate with race - Number of siblings, parental income, parental education, likelihood of one- versus two- parents in the home. But, AFAIK, some difference still persist, and no one is sure why. I certainly don't know.

But another part of the controversy is whether SATs should be used for college admission, given the above. Well, the real problem here is what to use instead. Typical choices are high school grades, college entrance essays, and interviews. The problem is, these other methods are even more biased than SATs. And interviews have MUCH lower reliabilities than SATs.

I don;'t have any answers, but I hope this at least illuminates the questions.