There's a lot of talk these days about testing, fairness, and education.  A lot of this relates to the field known as psychometrics.  Since it seems like some of this is relevant to some people here, and since that's what I got my PhD in, I thought I'd share some thoughts.  A gentle introduction to the field, if you will.  So, if you will, join on the flip, and if you won't, then I'll see you on other diearies.

Any measurement of any quality - weight, income, intelligence, scholoastic aptitude, whatever - has certain qualities that can be used to judge how good a measurement it is.  Three that are of concern to psychometricians are relibability, validity, and bias.

Reliability is whether measure is accurate and conssitent.
The most common ways to assess reliability are test-retest reliability and split-half reliability.  Test-retest means you give the same test (or measure) to the same people more than once, and see how similar their scores are.  Split-half means you divide the items on a test into two parts, and see if the two parts have similar scores.

Validity is whether it measures what you think it measures.  Although there are lots of types, they boil down to two sorts of things: Face validity - does the measure seem right?  and all the other kinds, which all involve seeing whether the measure is related to things you think it should be, and not related to things it shouldn't be.

Bias is systematic over- or under-estimation.

For something like weight, this is relatively straightforward.  Step on a scale.  Get off. Step back on. Do the numbers match?  Are they close?  That's a measure of reliability.  (split half reliability isn't really meaningful here).  If you add a weight tot he scale, do the numbers go up?  Do things that seem bigger and heavier weigh more than things that seem smaller and lighter?  Do you weigh more with your clothes on than naked? More after a big meal than before?  All those are signs of validity.  OTOH, if your weight changes when you touch your nose, that's not so good.

Bias would mean that a scale consistently over-estimated or underestimated your weight.  If three scalce say you weigh 120 pounds, and another says you weigh 130, amd there are similar differences for other people, then at least one is biased.

When it comes to things like intelligence or scholastic aptitude, though, things are murky.  They are also controversial.  This combination is hardly one to lead to senseible debate!  For intelligence, in fact, things are so murky that they might as well be statements by George Bush.

Why?  Well, for intelligence no one knows what it means.  There's no agreed upon definition.  So, does IQ measure intelligence? Well, if you tell me what intelligence is, I will tell you if IQ measures it.  I will say, though, that IQ seems to be related to something that most people seem to think of as intelligence.  In general.  But if you had a 10 minute conversation with 10 people who had IQs of 140, 10 with IQs of 100, and 10 with IQs of 60, you'd probably be able to guess which is which.  And IQ tests are at least moderately reliabile (not nearly as reliable as bathroom scales, but better than tea leaves or astrology), so they are measuring something.

For scholastic aptitude, things are a little better.  We have at least some idea what we mean by that.  It has something to do with getting good grades.  But not exactly, becuase it is APTITUDE, and isn't intended to measure desire, or time availability, or a dozen other things.
But there's a bigger problem with assessing the validity of SATs - which colllege you go to is determined in part by your SAT score, and what grades you get is determined in part by which college you go to.  There are ways around this, in particular, there's a statistical technique called hierarchical linear models (aka mixed models, random effect models, and several other terms) that seems promising.  I have, however, seen no studies using this method for SATs and college grades.  

There's even MORE to the problem, because the SATs might be more valid for some people than for others.  In particular, people with various disabilities (blindness, deafness, and various learning disabilities) may be measured less validly than more typical people.

SATs, like IQs, are at least somewhat reliable.

The real controversy comes in with regard to bias. Some groups, in particular certain racial and ethnic groups, do worse on IQs and SATs than other groups.  That's just a fact, and it's not controversial.  What is blazingly controversial is why these differences exist.  Part of it is due to factors that correlate with race - Number of siblings, parental income, parental education, likelihood of one- versus two- parents in the home.  But, AFAIK, some difference still persist, and no one is sure why.  I certainly don't know.

But another part of the controversy is whether SATs should be used for college admission, given the above.  Well, the real problem here is what to use instead.  Typical choices are high school grades, college entrance essays, and interviews.  The problem is, these other methods are even more biased than SATs.  And interviews have MUCH lower reliabilities than SATs.

I don;'t have any answers, but I hope this at least illuminates the questions.  

  •  Tips? Comments? Argumetns? (4+ / 0-)
    This is the spot.

  •  good start (1+ / 0-)
    i was disappointed that you ended by talking only about SAT's (and IQ's).  in the age of NCLB, other acronyms and weeks of testing play a big role in making school boring... and graduation impossible... for a whole lot of kids.  see e.g. http://www.civilrightsproject.harvar...

    •  Thanks (0+ / 0-)

      I don't know that much about the NCLB tests; I studied this stuff long before that happened.

      I do agree that a lot of NCLB is just plain dumb, and I've said so in comments on other diaries (I think I made these comments on one of TeacherKen's diaries)

      •  The testing for NCLB (1+ / 0-)
        should be measuring achievement, which is completely different from IQ testing, and somewhat different from aptitude testing.

        Aptitude testing like the SAT seeks to predict future performance, and although some of the measurement may be of things you already know or are expected to know (and a lot of the SAT is), the goal of the test is still predictive.

        Achievement testing, which is what most classroom tests are, simply tries to measure if you've mastered certain learning objectives (and even higher-order skills, like analysis, criticism and synthesis can be learning objectives).

        At least that's my recollection from a 3 credit course in Test and Measurement - no PhD here.

        •  OK (0+ / 0-)

          That's about what I knew about the NCLB tests - what they are supposed to be doing.  But I haven't seen anything on their reliability or validity.

          The SAT isn't supposed to measure future performance, least not exactly.  It's supposed to measure aptitude, which is a little different.  SAT scores ought to measure something like 'future performance if effort were equal'

  •  I think... (1+ / 0-)
    ...that the SATs have some use.  For example, I enrolled in a community college and my SAT scores were high enough to allow me to skip the placement tests normally required.  So it personally benefitted me.

    However, to use it as a measure of intelligence is just stupid.  My SAT didn't get me into college, or help me get in.  An enrollment form and $45 did that.  Many smart people suck at tests.  So using them for college admissions seems pointless to me.

    •  What should we use instead? (0+ / 0-)

      I agree that there are people who such at tests but would do well in school.  And people who do well on tests, but won't do well in college.  No one is proposing (least of all me) that they be the ONLY criterion.  But the other criteria are, if anything, worse.

      •  I don't know (0+ / 0-)

        I'd suggest using mainly high school scores and merits for admissions.  Though I didn't mean to imply anything about you.

        •  The problem with HS grades (0+ / 0-)

          The problem with HS grades is 1)  that they aren't comparable to each other - there are thousands and thousands of high schools, and an A from one is not necessarily an A from another.  This is helped a little by using class rank.  But some HS are now refusing to release class rank (saying it puts too much pressure on kids) and even when they do, being in the top 10% of different classes means different things in different HS as well.  2) They are also subject to bias - teachers are human, and may, consciously or not, grade kids unfairly.

          OTOH, HS grades /*do*/ measure some things that SATs do effort in class.  So, I think we should use both.

          •  and yet correlate w/college grades (0+ / 0-)
            at least as well as SAT, which lacks any predictive value beyond the 1st year, for a variety of reasons, and which really does not account for that much of the variance in 1st year college grades.

            •  Data (0+ / 0-)

              I have seen others say this, as well, but never seen the actual data.....have you got a link? Or a citation?  I tried googling (include google scholar) and found very little.

              But there is also the statistical problem - as I mentioned in my diary.  Straight correlation or regression would assume that the observations are independent, when they clearly are not. First, GPA is not independent of school; second, if you try to figure out multiple years, then GPA in year 2 is not
              independent of GPA in year 1.  

              Then there's the problem of figuring in for people who drop out...

              Very tricky.  I'd be very interested in any articles you (or anyone) can find on this

              •  not readily available (0+ / 0-)

                and sorry, today is the first of two days set aside for doing taxes, so I cannot search.  I know I have stuff in hardcopy someplace that cites this, but cannot look for it now.

                also, given my own experience of doing test prep  - for Princeton and another company, a total of 5 years  -  I would question the reliability of SAT because of how much I was able to raise scores.

                I am not a fan of SAT, even though I personally do quite well on such tests.

  •  Rehabilitate the SAT (0+ / 0-)

    The SAT actually began as a way to make college more accessible to less privileged students.  Before the SAT, admissions at top colleges were sometimes granted based upon nothing more but a letter of introduction from a well-placed family member or school dean.

    The SAT could be seen as a democratic challenge to admissions policies that are biased in favor of privilege.  Unfortunately, the same inequities in education that it was meant to address simply repeated themselves in the training for the test.  So priviliged kids get $100/hour tutoring and poor, public-school students get nothing but a foreign-looking puzzle.

    What's worse, the test gets slammed (often deservedly) in a way that can only diminish the motivation of the students who would benefit from it the most.

    I'm working on it, though.

  •  One question. And you better get it right. (1+ / 0-)
    At least SAT and ACT tests are, in the main, voluntary and used by colleges who can pick and choose how much weight to give them.

    The same cannot be said about the  State tests under NCLB which are given unusually high credence by government officials in terms of validity and reliability.

    For the most part, they attempt to measure whether or not a student is scoring at "grade level".  What is grade level?  Well, that is the median score in a random sample of students in that grade:  half score above, half score below.  

    And so, under NCLB, the trick is to get everyone, 100% up to grade level.  But you see the problem, we already have determined that grade level is simply a statistical mid-point and does not really exist in reality.  In reality, kids are coming and going from schools, rising and falling in interest, attention, engagement--in point of fact, just about everything imaginable is happening to those kids.

    And yet, if just 1 of them does not test up to the median for their grade level, the school will be judged to have failed by 2013.  

    Psychometrician Man, tell me if this makes sense to you on a statistical level.

    •  Lake Woebegone already compliant (1+ / 0-)
      Every week the report says that 'all the children are above average'.

      One small school district in Minnesota down, rest of United States to go.

    •  Well, no, that makes no sense (0+ / 0-)

      I am not that familiar with NCLB legislation.  But it's hard to believe that even the Republicans would be dumb enough to say that everyone has to be above the median - that's an impossibility.  The median is the point where 50% are below and 50% above.

      Others here will know more, but I /*think*/ what they say is that all the kids have to be at some proficiency level.  

      Even then, though, the idea of putting that much emphasis on one test, with limited reliability and validity, is highly questionable.  

      •  depends where state sets proficiency (1+ / 0-)
        in theory a state can set proficiency levels at a high enough point that you are trying to achieve Lake Wobegon effects

        or because of the political implications of too many children "failing" the state could set its levels so low that almost everyone is shown as proficient.

        We have clear examples of states manipulating cut scores on their own tests to show "improvement"  --  in Virginia this happened with SOL tests for US history at both the high school and middle school levels.  Most of the "improvement" in pass rates between the testing in 2001 and that in  2003 was because the cut scores were significantly lower.

        With respoect to NCLB in theory NEAP is supposed to be used as a control, but that means it will cease to lose its value as an  research evaluator because states will begin to game it, as they already have in s-called state NEAP.   And that is still independent of what the proficiency levels of NEAP mean, a point on which Bracey for one fulminates with regularity.

  •  Not familiar with SAT... (1+ / 0-)
    but I watched my husband go through GMAT.

    I don't think it measures what it purports to measure. For one thing, there are all those books that "train" you to get a higher score. And they work. They work by teaching people how to get around the weird quirks of the GMAT questions. That's not measuring "aptitude" at English, or math, or anything but reading the test producers' minds.

    I have been reading since I was three. I was reading Dickens for pleasure when I was ten; I have probably read at least 40,000 books in my life. I edited a small newspaper for five years. I have written graduate-level papers, corporate vision statements, technical documentation, cost-benefit studies for new technology, letters to the editor, comments on usenet and blogs, a number of short stories, and two novels. I have made a particular study all my life of the effective use of English. I've been at it for sixty-plus years, and I'm still working at it.

    I say this not to brag, but to make a point. You see, if I had taken the test without preparation, I would have scored quite badly in the language section. Now, if I can't do reasonably well on a test that supposedly measures "aptitude" for English, then I'm confident it's not me that's at fault. It's the test.

    I would have done worse, in fact, than some of my husband's fellow students whose native language was not English, who had little grasp of idiom and could barely construct a grammatical sentence. They did well only because they had drilled on sample test questions over and over and over again.

    That drilling, by the way, did not help them to handle the level of English required by an MBA program. They struggled; if they were lucky, they'd have a fellow team member who could edit their contribution to the team's assignment to be readable.

    My conclusion? The GMAT doesn't actually measure anything except the learnable but essentially useless skill of aceing the GMAT.

    •  A good point, clearly (0+ / 0-)

      Yes, the role of the tutoring firms and preparation books is problematic.  The evidence for their effect on SAT scores is not conclusive, but there is at least some evidence that they help.  They may help more for the GMAT, which is moer speicalized.

      Unfortunately, there is no way to ban these.....

      As for your particular case; there's a Yiddish saying "For instance is not proof".  I admit, as do all responsible parties, that some people who are highly competent (obviously you are)  do badly on these tests.

      •  The interesting thing is (0+ / 0-)
        the reason I would have done badly.

        For many of the sample questions I saw, the "right" answer depended, not on accurate comprehension or a large vocabulary, but on noticing deliberate trickery. In real life, the texts and academic papers a grad student is required to read may at times be obscurely or awkwardly phrased, and will often use very specialized terminology -- but they are never purposely deceptive.

        The "training" works because the people constructing the GMAT test have only a limited bag of tricks. Once you have seen them repeated often enough, you can learn to avoid the traps.

        On the composition questions, GMAT scoring rewards rote learning of a handful of pre-approved formulas that can be quickly filled out with details, rather than original thought or clear, effective expression. The training books are quite frank about this. They explicitly tell students that these formulas they're learning will be of no use to them in graduate studies.

        The problem is not the existence of preparation books, and the solution is not to ban them. They would not be able to improve scores -- nobody would want to buy them -- if the GMAT were really testing what it pretends to test: the ability to excel in graduate school.

        I think whatever correlation is found to exist between the test and success in grad school has a common third cause. The test selects, not so much for aptitude, as for a pragmatic willingness (and ability) to game the system. That is a success factor of sorts, I suppose, but it's not the one the GMAT claims to be measuring.

        Moreover, it penalizes some students who are naively idealistic and interested only in the subject matter they want to study. Unfortunately, it is this latter group, caring only about knowlege for knowledge's sake, too impatient to waste time on anything as irrelevant as gaming the system, that is most likely to give us a real advancement in human understanding of the world. What is needed is a better test, one that can select for the people who really should be in grad school.

  •  sorry to be so late to discussion (1+ / 0-)
    had not known of diary until receiving your email.  I would have recommended for the quality of clear explanations of key terminology.

    Keep posting.

    •  thanks! (0+ / 0-)


      It is so hard to keep track of all the diaries here!
      And I keep adding people to my hotlist!

      I'm gonna have to retire just to read daily kos!  

