The default assumption in the value-added literature is that teacher effects are a fixed construct that is independent of the context of teaching (e.g., types of courses, student demographic compositions in a class, and so on) and stable across time. Our empirical exploration of teacher effectiveness rankings across different courses and years suggested that this assumption is not consistent with reality. In particular, the fact that an individual student’s learning gain is heavily dependent upon who else is in his or her class, apart from the teacher, raises questions about our ability to isolate a teacher’s effect on an individual student’s learning, no matter how sophisticated the statistical model might be.
These are key words from a new study, which I explore below the fold.
Those words are from a new study on the stability of teachers scores using Value-Added methodologies to ascertain teachers effects upon students scores on tests. Along with the recent policy brief by the Economic Policy Institute, about which I wrote here, we are now presented with an understanding of the very real limitations of attempting to use student scores as a key element in evaluating teacher effectiveness.
The study in question, titled Value-Added Modeling of Teacher Effectiveness: An Exploration of Stability across Models and Contexts, was released late last month at Education Policy Analysis Archives, a peer-reviewed online journal of education policy previously edited by our own SDorn, Sherman Dorn of the University of South Florida (disclosure - and like me an alumnus of Haverford College), who is still on the editorial board. Two of the four authors of this study, Edward Haertel and Linda Darling-Hammond, are also among the authors of the EPI policy brief.
The link to the article takes you to a webpage with an abstract,and also a link from which you can download the entire study as a PDF. The abstract reads as follows:
Recent policy interest in tying student learning to teacher evaluation has led to growing use of value-added methods for assessing student learning gains linked to individual teachers. VAM analyses rely on complex assumptions about the roles of schools, multiple teachers, student aptitudes and efforts, homes and families in producing measured student learning gains. This article reports on analyses that examine the stability of high school teacher effectiveness rankings across differing conditions. We find that judgments of teacher effectiveness for a given teacher can vary substantially across statistical models, classes taught, and years. Furthermore, student characteristics can impact teacher rankings, sometimes dramatically, even when such characteristics have been previously controlled statistically in the value-added model. A teacher who teaches less advantaged students in a given course or year typically receives lower effectiveness ratings than the same teacher teaching more advantaged students in a different course or year. Models that fail to take student demographics into account further disadvantage teachers serving large numbers of low-income, limited English proficient, or lower-tracked students. We examine a number of potential reasons for these findings, and we conclude that caution should be exercised in using student achievement gains and value-added methods to assess teachers’ effectiveness, especially when the stakes are high.
Since I realize that is dense, let me parse it a bit:
We find that judgments of teacher effectiveness for a given teacher can vary substantially across statistical models, classes taught, and years. The key words are vary substantially - if we are talking about something that is basic, it should not vary that much because of the particular value-added metholology being used. For point of comparison, where I do measure you by yardstick, tape measure, using at one time inches and another centimeters, if I am really measuring the same thing (height) I would get consistent results (which perhaps some slight variation due to measurement error).
But the method of measuring is only one problem. If there are different results because of classes taught, that MAY be because of different levels of effectiveness in different curricula, or it could be something else. And if there is substantial variance from year to year, either the teacher is very inconsistent (which does not seem all that likely) or that variance is due to something not under control of the teacher.
Furthermore, student characteristics can impact teacher rankings, sometimes dramatically, even when such characteristics have been previously controlled statistically in the value-added model. - this is CRITICAL. People have justifiably argued that using the single score at the end of a year tells us little about what the teacher has done, and may be primarily due to knowledge with which students arrived. Value-added analysis is supposed to provide a method that allows us to control for different characteristics among students, and thus enable use to focus in on what impact the teachers actually had. But if despite our attempts to control for variance among students, that variance still seriously impacts the derived value-added score, then relying upon value-added approaches is dangerous, because we will be making decisions that cannot be justified by the data we are using.
The study is highly technical. I suspect that those who have read the abstract might already begin to grasp that. Let me try to summarize some of the key findings, and offer some additional words of my own. If you feel comfortable, you can of course skip all of what I am offering and simply read the study on your own.
The authors, during the review of previous literature on value-added, note several key limitations already identified. Among these are failure to control for school effects, failure to control for student effects, and stability of teacher effects. In other words, the limitations of a value-added approach are well-documented in the peer-reviewed professional literature.
Some of the factors that value-added approaches do not control for include differences in class sizes, curriculum materials, availability of instructional supports, or the competence of principals and peers.
Also not controlled for are the effects upon the learning of one student from the other students in the classroom. As the authors note,
In particular, the fact that an individual student’s learning gain is heavily dependent upon who else is in his or her class, apart from the teacher, raises questions about our ability to isolate a teacher’s effect on an individual student’s learning, no matter how sophisticated the statistical model might be.
And without going through all the results obtained, let me quote the following:
Even in models that controlled for student demographics as well as students’ prior test scores, teachers rankings were nonetheless significantly and negatively correlated with the proportions of students they had who were English learners, free lunch recipients, or Hispanic, and were positively correlated with the proportions of students they had who were Asian or whose parents were more highly educated. In addition, English teachers were more highly ranked when they had a greater proportion of girls in their classes, and math teachers were more highly ranked when they had more students in their classes who were on a "fast track" in mathematics (that is, taking a given course at an earlier grade than it is generally offered in the school’’s usual math sequence).
Absorb that - factors that are impacting the scores that are NOT under the control of the teacher.
Then note this, which requires no technical background to understand:
Furthermore, the magnitude of the differences among teachers’’ rankings across models was also significantly related to student demographics, especially in English language arts: Teachers with more students who were African American, Hispanic, low-income, limited English proficient students, and whose parents had lower levels of education were more likely to be ranked significantly lower when student demographics were omitted from the VAM model.
This article carefully examines the technical limitations of the study done by Buddin upon which the now infamous LA Times articles were based. There are many, but unfortunately little of the media coverage explained to readers how severe those limitations were.
Let me focus on what I think is the most important finding of this study, which is the affirmation of what had already been known, the instability of Value-added scores for teachers. In my examination of the EPI policy brief, and even before, I had focused on the results of a study done for the US Department of Education by Mathematica that found if you attempted to classify teachers as superior or inferior to the vast majority, with 2 years' data (the most common model) you had a 36% rate of misclassification, with 3 years 26%, and even with 10 years (rarely obtainable) 12%.
The study examines this more closely. It divides performance into deciles (tenths), and examines the percentage of teachers who change deciles across various dimensions: by value-added model being use, by course being taught, by year of the student data. Let me only state the percentage of teachers who changed by 2 or more deciles
methodology 12-33% (depending upon which two models were compared)
course 54-92% (depending upon model used)
year 45-63% (depending upon model used).
If one looks at only at Year, 19-41% vary by three or more deciles.
or more - in some cases as many as 8!!!
The key takeaway from the final paragraph is this:
Our conclusion is NOT that teachers do not matter. Rather, our findings suggest that we simply cannot measure precisely how much individual teachers contribute to student learning, given the other factors involved in the learning process, the current limitations of tests and methods, and the current state of our educational system.
Let me be blunt. We currently lack accurate statistical methods of isolating teacher effects. The tools available to us cannot justify making decisions about teacher effectiveness solely or predominantly based on student test scores, even when statistically adjusted using a value-added approach. We already knew that we were failing to control for other factors, and the lack of stability of scores makes that clear.
I would argue that it is improper to use the data in any way for a summative adjustment, not with the high degree of variance in the results. It is even somewhat dangerous to base further non-quantitative examinations of teacher performance on the value-added results that are currently available.
We as Americans want to rank and compare. We want to have hard numbers. Value-added gives numbers, but they are not "hard" - simply put, they are not stable enough to allow us to rely upon them with any confidence.
All of this is independent of the issue of whether or not the underlying scores of student performance on such tests are an accurate measure of what the student knows or can do. But that is an entirely different can of worms. For now, even assuming that such results are reliable (consistent) and thus allow valid inferences to be drawn (questionable), using value-added methodologies still does not allow us to consistently and accurately determine teacher effects.
So might it be possible to back off our obsession with numbers and consider other more effective ways of evaluating teacher performance? Or has the possibility already been abandoned, in which case the real impact of our evaluation policies upon our students is something we might very well fear.
I can't answer that.
That I can't answer it scares the hell out of me.
year