All Recent Stories Staff Community Trending Elections From Markos' Desk Comics Community Groups Community Spotlight Actions Civiqs Make a Donation

Help Desk Jobs Work With Us Advertising Overview

Problems with the use of Student Test Scores to Evaluate Teachers

by teacherken

Community

(This content is not subject to review by Daily Kos staff prior to publication.)

Saturday, Aug. 28, 2010 Saturday, Aug. 28, 2010 at 9:01:09pm PDT

If new laws or policies specifically require that teachers be fired if their students’ test scores do not rise by a certain amount, then more teachers might well be terminated than is now the case. But there is not strong evidence to indicate either that the departing teachers would actually be the weakest teachers, or that the departing teachers would be replaced by more effective ones. There is also little or no evidence for the claim that teachers will be more motivated to improve student learning if teachers are evaluated or monetarily rewarded for student test score gains.

That is a quote from the Executive Summary of one of the most important policy briefs about education in recent years. At a time when the Dept. of Education is pushing to tie teacher evaluation and compensation to student test scores, this Economic Policy Institute Briefing Paper (whose title is the same as this diary, and which is a pdf), pulls together the extensive relevant research that demonstrates the dangers of pursuing such a path. Please continue reading as I explore this important document, released at 12:01 AM today, August 29.

First, let me clarify several things.

This is a very long diary. That is because I am trying to reasonably thoroughly cover the contents of an extremely important document. My purpose in doing so is to convince people of the document's importance. Thus I will be perfectly happy should you decide you do not need to further read what I have written below. You can follow the link for the brief (which I have provided you again), download the pdf, and begin reading. The executive summary is only four pages. The brief itself, without the critical apparatus of footnotes and sources, another 17. So if you want, one more time follow this link.

This document has been in the works for several months, and was NOT hurriedly put together as a response to the recent series by the Los Angeles Times which used value-added assessment to label teachers in the Los Angeles Unified School District. Second, the ten scholars whose names are on the document are some of the most eminent in educational circles, including among their midst former Presidents of the American Educational Research Association and the National Council on Measurement in Education, two of the three professional organizations most involved with psychological measurement, of which school-related testing is a subset. One of the scholars, Robert Linn, has not only presided over both of those organizations, he has also serve as chair of the National Research Council's Board on Testing and Assessment. The group also includes the immediate past president of the National Academy of Education, Lorrie Shepard, Dean of the School of Education at Colorado. A brief and applicable curricula vitae of each of the ten authors can be found at the end of the document, and briefer descriptions at the beginning, where each author is listed, along with the following statement:

Authors, each of whom is responsible for this brief as a whole, are listed alphabetically.

An email address is provided for further contact.

The ten authors, alphabetically, are as follows:
Eva L. Baker
Paul E. Barton
Linda Darling-Hammond
Edward Haertel
Helen F. Ladd
Robert E. Linn
Diane Ravitch
Richard Rothstein
Richard J. Shavelson
Lorrie A. Shepard

Let me be blunt. I do not know how anyone who knows the work of these scholars and who reads this brief can accept the idea of placing any stakes as to firing or awarding of merit pay based on the current status of Value-Added Assessment methodologies. The document is thorough. It reviews all the relevant studies, including one not yet in print. Those includes studies by Mathematica for the US Department of Education: by Rand: by the Educational Testing Service; done for the National Center for Education Statistics of the Institute of Education Sciences of the U. S. Dept. of Education; issued by the Board of Testing and Assessment of the Division of Behavioral and Social Sciences and Education of the National Academy of Sciences, and so on. There are citations from books, from peer reviewed journals.

I am not a scholar. I am a high school social studies teacher. During now abandoned doctoral studies in educational policy I got interested in value-added assessment and devoured what studies there were in the educational literature. I also talked extensively with the technical person for one organization that offered a value-added methodology who cautioned me that the approach was not stable enough for it to be used as the basis for decisions with any kind of meaningful stakes. That was about a decade ago. What I had read since, and what I have absorbed from this study convinces me that the situation is not significantly better now.

But you do not have to take my word for it. Let me offer a few key examples from the study. Those who follow me on Daily Kos already have seen in the study by Mathematica the high rate of error in determining superior and inferior teachers beyond the broad middle. In this diary, written on August 27, I noted that the error rate with 2 years of data was 36%, with 3 years 26%, and even with 10 years of data still 12%.

But that is just the tip of the iceberg of the technical problems with using such an approach.

Without recapitulating the entire brief, let me offer a couple of other key points.

Results for individual teachers are not stable:

One study found that across five large urban districts, among teachers who were ranked in the top 20% of effectiveness in the first year, fewer than a third were in that top group the next year, and another third moved all the way down to the bottom 40%. Another found that teachers’ effectiveness ratings in one year could only predict from 4% to 16% of the variation in such ratings in the following year.

One key question is whether one is really accounting for teacher effects and excluding other influences in the results one gets from value-added assessment. Jesse Rothstein reported something interesting, about which I quote from the Executive Summary:

A study designed to test this question used VAM methods to assign effects to teachers after controlling for other factors, but applied the model backwards to see if credible results were obtained. Surprisingly, it found that students’ fifth grade teachers were good predictors of their fourth grade test scores. Inasmuch as a student’s later fifth grade teacher cannot possibly have influenced that student’s fourth grade performance, this curious result can only mean that VAM results are based on factors other than teachers’ actual effectiveness.

The brief notes that arguments that the private sector evaluates professional employees using quantitative measures that are parallel. The authors of the brief point out that rarely are such quantitative measures the sole or even the primary factor, noting that management experts warning against using such measures for making salary or bonus decisions. They remind us that some of the distortion on Wall Street was the result of emphasizing short term gains that could be easily measured. They also touch on medicine:

In both the United States and Great Britain, governments have attempted to rank cardiac surgeons by their patients’ survival rates, only to find that they had created incentives for surgeons to turn away the sickest patients.

Students are not randomly assigned to teachers. While some control for school effects is possible, scholars are reluctant to place any weight on comparisons for teachers in different schools even within the same system. And even within a school, teachers may have varying numbers of students who are learning English or have learning disabilities or are homeless or who move multiple times, each of which is a factor that can affect learning.

Sample sizes are often too small. Even if the class makeup stays stable during the year, and all the students show up regularly, the N=30 of a large elementary class is too small a sample to provide a result that can allow strong inferences to be drawn. Often the makeup of the class changes during the year. If you exclude students who were not there all year, or whose absences exceed some designated level, the N decreases, providing a result of even less reliability.

Some argue that statewide data banks can address the question of student mobility. But if you derive results on a year or two years of data where the student has moved, how much of the improvement can properly be assigned to any one teacher? Even in elementary school, do we account for pull-out instruction, or possible tutoring (that could in some cases be counterproductive) as a possible influence on the test results upon which we base our analysis?

Even with value-added analysis, to date scholars have not been able to isolate the impact of outside learning experiences, home and school supports, and differences in student characteristics and starting points when trying to measure their growth.

A proper system of value-added assessment would have vertically scaled tests. Most states do not currently have such tests, for example, neither New York nor California does. That is, the tests in one grade are not necessarily congruent with those of the next along a continuum from year to year - we are not testing the same thing each year. As testing expert Dan Koretz of Harvard is quoted as noting,

"because of the need for vertically scaled tests, value-added systems may be even more incomplete than some status or cohort-to-cohort systems"

Here it is worth noting that cohort to cohort is comparing this year's fourth graders to last years, which is how Adequate Yearly Progress under No Child Left Behind has been calculated.

If measuring end of year to end of year, even if there are vertically scaled tests, there is still the well-documented issue of summer learning loss, which falls disproportionally upon those of lesser economic means, which also means it falls disproportionally upon those of color, who are more heavily represented at the lower end of the economic scale. IF we do not control for summer learning loss, our results are skewed. Allow me to quote a relevant portion of the study:

researchers have found that three-fourths of schools identified as being in the bottom 20% of all schools, based on the scores of students during the school year, would not be so identified if differences in learning outside of school were taken into account. Similar conclusions apply to the bottom 5% of all schools.

The authors also cite a study that shows "two-thirds of the difference between the ninth grade test scores of high and low socioeconomic status students can be traced to summer learning differences over the elementary years."

There is more, but this should give a real sense of how much there is in this paper, how thoroughly the authors examine relevant material to demonstrate that value-added assessment, the supposed magic bullet to allow us to tie student learning back to the effectiveness of teachers, cannot properly fulfill the task some wish to give to it.

The authors acknowledge that value-added approaches are superior to some of the alternatives methods of using test scores to evaluate teachers. These are

status test-score comparisons - compare average scores of students of one teacher to those of another

over change measures - compare the average test results of a single teacher from one year to the next - remember, these are different students

over growth measures - a comparison of the scores of the students of the teacher this year to the scores of those same students the previous year when they had different teachers.

Each of these approaches has serious problems with it. One can read the detailed explanation on p. 9. Value-added assessments may be an improvement, but

the claim that they can "level the playing field" and provide reliable, valid, and fair comparisons of individual teachers is overstated. Even when student demographic characteristics are taken into account, the value-added measures are too unstable (i.e., vary widely) across time, across the classes that teachers teach, and across tests that are used to evaluate instruction, to be used for the high-stakes purposes of evaluating teachers.

Let me offer a few of the quotes about value-added assessment that the authors of the brief offer from scholars who have examined the approach over the years, and then I will offer a few observations of my own.

in 2003, a research team at Rand concluded

The research base is currently insufficient to support the use of VAM for high-stakes decisions about individual teachers or schools.

In 2004, Donald Rubin opined

We do not think that their analyses are estimating causal quantities, except under extreme and unrealistic assumptions.

Henry Braun, then at ETS, offered this in 2005:

VAM results should not serve as the sole or principal basis for making consequential decisions about teachers. There are many pitfalls to making causal attributions of teacher effectiveness on the basis of the kinds of data available from typical school districts. We still lack sufficient understanding of how seriously the different technical problems threaten the validity of such interpretations.

Last year the Board on Testing and Assessment of the National Research Council of the National Academy of Sciences wrote to the Department of Education saying

...VAM estimates of teacher effectiveness should not be used to make operational decisions because such estimates are far too unstable to be considered fair or reliable.

Finally, this year, a report of a workshop run jointly by The National Research Council and the National Academy of Education offered this:

Value-added methods involve complex statistical models applied to test data of varying quality. Accordingly, there are many technical challenges to ascertaining the degree to which the output of these models provides the desired estimates. Despite a substantial amount of research over the last decade and a half, overcoming these challenges has proven to be very difficult, and many questions remain unanswered...

Let me repeat that last sentence, written this year: Despite a substantial amount of research over the last decade and a half, overcoming these challenges has proven to be very difficult, and many questions remain unanswered...

And yet this administration wants to move ahead with using student test scores, perhaps analyzed through value-added assessment methodologies, as a significant component of teacher evaluation. It is including this as part of the criteria to win Race to the Top Funds. In fairness, the Department does not specify using value-added (although anything else is far worse) nor does it specify what percentage of the evaluation is to depend upon the test scores - both of these decisions are still left to the states, some of which have left themselves wiggle room in their applications, using terms like "significant" to indicate the proportion of the evaluation that will depend upon student test scores.

The original Bush proposal for No Child Left Behind, as it went up on the White House website shortly after the inauguration of the 43rd president, proposed giving a 1% bonus of Title I money to schools that would give parents the value-added scores of the teachers of their students. That, fortunately, did not make it into the final legislation. Now we have the Los Angeles Times action, about which the Secretary of Education has offered a somewhat mixed and confusing response, even as he seems to support the idea of using such evaluations in assessing of teachers. Since the Times story broke we have seen some who write or advocate about education who have praised what the paper did, while others have condemned it. While mine might not be a major voice on education, I find myself very much in the latter camp.

One problem is that too many who write about education are close to ignorant about the limits of the information one can get from various kinds of assessment. We tend to what hard numbers as a society, we are obsessed with comparisons and rankings. In the process we often give far more credence to quantitative measures than they warrant.

I do not dispute that tests, including tests external to the school, have some utility. I also recognize that value-added assessment is beginning to offer some useful additional information. By itself that information is not sufficiently reliable that people's livelihoods should be either solely or heavily determined by the information they provide. They MAY indicate a teacher outside the norm - either well above or well below - but as the various studies you will encounter in this brief demonstrate, that is not necessarily the case, the results are not yet stable for individual teachers from year to year, we do not yet know how to properly control for non-instructional factors that can influence the scores upon which the analysis is based, nor can we properly distribute responsibility for student learning among the different adults who interact with a child at school.

I am a high school teacher. Let me offer a hypothetical - if I do more work in a social studies class on a particular kind of writing and that is what is assessed on the English exam, does the English teacher properly deserve the credit or blame for how students do on that part of the test? Those of us who teach in high school are aware that students often learn about our content either in other classes or from interactions outside of our classroom. Sometimes what they learn is correct and increases their performance in our class, sometimes it is incorrect and undercuts what we are instructing. To date, even value-added assessment is insufficient to control for such influences and allow proper inferences to be drawn about the actual impact of the teacher upon the learning of the students.

I have only explored a small portion of the material in the brief. You can download it without paying. If you are worried about whether you will be able to understand the contents, don't. You can start with the executive summary, in which you will find most of the key takeaways, written in language and presented in a style that is easily accessible. It is a bit less than four pages. The brief itself runs from pages 5-21, followed by three columns (over a page and a half) of footnotes, and 5 columns (over three and half pages) of sources. You can read through the brief without having to check the footnotes, or you can if you want glance at the back to see who is being cited if that is not clear in the text.

Let me clear. The authors are not opposed to value-added assessment. They are not even opposed to it being included in the process of teacher evaluation, although they offer some serious cautions that policy makers would be well advised to consider.

The title is accurate - there are still serious problems with using test scores to evaluate teachers. These problems are not solved by resorting to a value-added methodology.

We need to be careful not to denigrate nor discourage our teaching corps. We will not improve education if the end result of our efforts is to drive away the very teachers who most connect with students, who are able to inspire those students to persist when they are struggling, who are willing to take on the harder to teach. We have other methods of ascertaining whether teachers are in fact effective. We should not be abandoning them in favor of quantitative measures that cannot, as yet, fully carry the load.

The authors of this study have enough prestige that one can hope our media will give some attention to it. Those responsible for educational policy at local, state and national levels are not doing their jobs if they are unwilling to read and be sure they understand the implications of this brief.

That said, and adding that I will try to bring to the attention of as many policy makers as I can, I do not have high hopes that our wrongheaded headlong pursuit of quantitative measures of teacher effectiveness can even be slowed. I will add what voice I have to the efforts of these scholars. Perhaps after you read the brief, you will add yours?

Thanks.