MailBox Monday #3: Tests, Time, Teachers, and Inertia

A Pretty German MailBox

It is a beautiful but cold Monday morning here in Princeton New Jersey. Well, just barely morning as I am writing this a shade before noon. But my ebullience arises in part from finally shedding my obsession (as noted in y’day’s post) with reading critiques of the SAT. Earlier today, I gleefully informed the Google bots of my uninterest in several such articles they had lobbed into my feed. Now that attention polluting tributary can stick carrying bits about medieval Ireland, Ukraine worries, and whatever Pope Francis has to say along with always the best things to watch on Netflix, Hulu, Apple TV, etc. etc., which BTW I do not always judge to be the best things when I finally watch them. This is not the place to complain further about the adaptation of the marvelous novel Station Eleven, but we will NOT be inviting Patrick Somerville to the St. Patrick’s Day party this year.

Another reason for my good humor is that it is Mailbox Monday for this 24th of January Jolt. And we had mail. One of my friends who is the holder of the prestigious chair at a Northeastern University wrote about his encounter with the blog:

As a work-based learning character, I can’t comment with any intelligence on these matters, but can ask a question.  I have had a longstanding concern with standardized tests like the MCAS (Massachusetts Comprehensive Assessment System) in this state (Massachusetts) not necessarily because the test is substantively bad, but because of its use and interpretation.  Specifically, it takes too much time (relative to time devoted to other learning – and I count the time spent teaching to(for) the test) and is used for decisions aside its remit.  And more specifically, it seems to contribute to the one element of learning the lack of which continues to doom the reflective judgment necessary to run any kind of reasonable society (nb:  ‘it’s my right to refuse a vaccine because it denies me of my liberty!’).  Needless to say, it may take time away from critical reflection!  With that preamble(s) advocated, my question is:  Am I correct that standardized tests, unwittingly or not, deprive students from exposure to the study of critical reflection?*

*I recognize that in some classrooms (in some states), there would be no coverage of critical reflection in any case.  I mean, heck, it’s immoral and anti-religious!

My good friend evokes my gratitude not just because there have been fewer reactions to the blog then desired in order to create the dialogue that was sought but also because he race is one of the most important issues about testing especially in the K-12 arena: our teachers and other education officials spending too much time teaching to the test? The specific citation of MCAS is helpful because as mentioned in a previous post it’s considered the best K-12 assessment in the country by many observers; California might have a legitimate gripe there. Therefore, if they aren’t doing it in the best way possible then we might expect that other states are also failing in getting the right proportion of preparation for test. My correspondent also raises the question about what the focus of instruction and subsequently examinations should be in the schools; I’ll come to that in a moment.

MCAS like every other assessment on the K-12 level has its critics who usually seek to dismantle at least some part of the overall framework. Last September, the Massachusetts legislature considered eliminating passing all the sections of the test as a requirement for high school graduation. For those readers unfamiliar with this kind of test, eight hours of examination time happen over three days for each student seeking to graduate.

MCAS enjoys significant advocacy despite these criticisms from important figures such as the current Massachusetts Governor, Charlie Baker, who said of the test: “”People can say they don’t like MCAS one way or another, but the simple truth is MCAS plus the financing law that was put in place in the original education reform bill was an enormous success. It gave Massachusetts what most people consider to be the best schools in the country overall and also had a very significant and positive impact on kids and underperforming school districts.”

A very spirited if not aggressive back-and-forth on what the actual practices and effects of these kinds of tests constitute has been going on for many years, but teaching to the test and the amount of time involved in testing represent perhaps the most significant areas of disagreement. This article — Nine myths about testing — from the time of the Bush administration and No Child Left behind (NCLB) gives a fair impression of one side although the defenders of testing in subsequent Democratic administrations were far less forceful. A better source answering my correspondent’s question can be found in this short article quoting several former ETS colleagues like Randy Bennett and Joanna Gorin. Randy my opinion it’s to the heart of the problem with two statements that were derived from an earlier larger paper that he wrote that is linked within the article: first, he points out that the problem became worse in the Obama ministration when they continued the earlier Bush administration idea of using test results to evaluate teachers. If you tell someone that there job performance depends upon how the kids doing the tests and not upon something that they can do themselves then naturally in almost every case they’re going to try to figure out some way to improve specifically how the kids do on the test rather than focus more broadly on the overall curriculum as Randy points out, “ “The unintended but understandable consequence (of using test scores to evaluate teachers and possibly fire them) was that teachers spent time teaching kids the formats, content, design and layout of the tests to an excessive degree.”

But that’s not the only problem with using test scores to evaluate teachers. While at ETS, one of the great advantages for a generalist like myself to learn more about how assessment really works is that there were at least a dozen or so excellent presentations on various aspects of the field right there in person every year. One of the most impressive dealt with this specific issue. Ed Haertel

Haertel had also written about this as part of an all-star cast of authors in the appropriately titled paper, Problems with the Use of Student Test Scores to Evaluate Teachers. It is what they had to say on the subject:

A review of the technical evidence leads us to conclude that, although standardized test scores of students are one piece of information for school leaders to use to make judgments about teacher effectiveness, such scores should be only a part of an overall comprehensive evaluation. Some states are now considering plans that would give as much as 50% of the weight in teacher evaluation and compensation decisions to scores on existing tests of basic skills in math and reading. Based on the evidence, we consider this unwise. Any sound evaluation will necessarily involve a balancing of many factors that provide a more accurate view of what teachers in fact do in the classroom and how that contributes to student learning.” Emphasis added

The problem of teaching to the test may really be a problem of freaking teachers out as to what will happen if the scores of their students are lower than average. The second comment of Randy’s from the article is pertinent because it relates to whether teaching to the test is good teaching practice: “Teaching to the particular sample of questions included on a test may increase test performance but not increase performance in the larger domain. Teaching to particular test content — the test items themselves — would consequently be poor instructional practice.” What makes this little more dire as a practice is that there is some research that showed that teaching to the test in many instances didn’t even improve test performance. The whole area has been so politicized by these two camps of adherents that getting good unbiased studies is very difficult. And relying upon the self-reports of either side plays into their self-interest too much, which leads me to a story of one of the odder incidents my ETS career.

One of the things that a Chief Learning Officer does tilted large group decision-making. This was good news for me when I got hired because that was a significant part of my consulting practice before coming to ETS. Therefore, even in situations where the group that needed to make a decision involved outside dignitaries I would often be called upon to design and then run some session where people with different points of view could try to emerge with some sort of agreement. That’s how I came to be in the Washington DC office of ETS during the latter days of the Bush administration. ETS researchers have failed to find connections between the evaluation of teachers and the test scores of their students. That’s what results showed as published in 2010 in ETS promoted paper on VAM or Value-added modeling (also known as value-added measurement, value-added analysis and value-added assessment), a method of teacher evaluation that measures the teacher’s contribution in a given year by comparing the current test scores of their students to the scores of those same students in previous school years.”

“A review of VAM research from the Educational Testing Service’s Policy Information Center concluded,

VAM results should not serve as the sole or principal basis for making consequential decisions about teachers. There are many pitfalls to making causal attributions of teacher effectiveness on the basis of the kinds of data available from typical school districts. We still lack sufficient understanding of how seriously the different technical problems threaten the validity of such interpretations.” Emphasis added

And ETS wasn’t alone. As reported later in another paper, “…the research community has cautioned against the heavy reliance on test scores, even when sophisticated VAM methods are used, for high stakes decisions such as pay, evaluation, or tenure.” But this meeting held in a darkened conference room with researchers on one side and Bush administration education department officials both past and present on the other side happened prior to this research being public. In fact, that was the purpose of the meeting because the Bush administration was ticked off the ETS, the recipient of some important money through the contract to develop NAEP, was about to publish research that showed that the Bush ministration’s arguments about how to measure teacher performance were wrong. At a break during the daylong meeting, one of the researchers visibly perturbed approached me to ask how could this possibly be resolved? When I went back up into the room I was no longer be neutral facilitator. It wasn’t just because the researchers will colleagues at ETS, but was also a factor of one side having simply a political agenda and the other side being scientists. Sound familiar? We did manage a rapprochement that day that involved tweaks in language, disclaimers about the obvious ongoing nature of research. (That’s always a good hedge when some political body is uncomfortable with your results since almost all research is always ongoing.) But I had been present at a moment were unmistakably but somewhat silently government officials past and present were attempting to get researchers to say something different than the results indicated. Or to say nothing at all. I think this happens just on one side of the political divide, but the event proved a very interesting session in my own education as the way things really work.

So what about my friend’s question about time? To answer whether eight hours of testing time plus whatever time is spent in going over the material is a good or bad idea requires knowing whether the test is capturing the broader curriculum or just a subset. In the case of MCAS, the results seem to point to the latter condition. Good test construction takes into account the amount of time the test-taker must been because physical factors such as fatigue and mental factors such as concentration could inhibit the validity of the test as construct irrelevance. If you plan eight hours of tests over three days, a legitimate grievance might be that part of what you are testing is the endurance capacity of the test-taker, which is “phenomena not included in the definition of the construct” whether that construct is English or Math or whatever. There’s a trade-off in setting the amount of questions and, therefore, time allotted to a test-taker. The less time the more difficult it is to get the degree of validity that you want especially if you are tackling the wider area of the subject being examined. The more time the greater possibility that you are advantaging those who possess in greater quantities elements such as physical endurance, attention span, anxiety control.

The good news is that they are is a stream of research that looks at how to glean the right kind of data about someone’s performance and knowledge in a shorter amount of time. But that gets us into questions of whether states and school districts are willing to experiment with these new methods. Another piece of good news is that the federal government does a tremendous job in this realm NAEP tests. Their explorations could be adopted by the states. But that would require change, and contemplating the difficulties of change gets me to the point in the letter from my university professor friend about what we teach and test not including critical reflection. If you take time to read the article above about the Massachusetts test and the legislation that was proposed (but I believe not adopted) to scrap passing that exam as a requirement for high school graduation, you can see very clearly the battle lines over what is to be taught in schools. Teachers and to a certain degree parents are lined up on one side and the business community and the certain degree politicians are lined up on the other side is not as neat is that, but that’s a good shorthand for what happens. Business owners understand that they are faced with talent pools that lack the skills they need for their jobs. That may also be a feature of business owners being unwilling to pay wages that would attract people who have the skills, but that’s a whole other topic. Those business leaders especially in the case of the Massachusetts test reform proposed don’t want anybody monkeying around with basics like literacy and numeracy. Educational research, however, tells us that there are other elements that we should explore especially with in the new economy, which is somewhat ironic given that the business leaders are so present oriented.

One of those elements has been referred to as critical thinking for some time and critical reflection is contiguous to that component. We’ve already talked about conscientiousness and how and we could go on about communication skills is also being important. I would like to see revamping of education, but education is a powerful force of inertia within our society. That inertia runs from the self-interest of everyone as well as a natural resistance to change. I will used to quotes to describe that situation and close out today’s adventure in blogging. The great expert of change, Bas Verplanken, once told me  “People tend to do what they tend to do.” Priceless in understanding life, isn’t it? But what makes the situation in education even more intractable is well explained by a famous quote from UPTON SINCLAIR It is difficult to get a man to understand something when his salary depends upon his not understanding it.” You can make that gender-neutral and consider that when it comes to education and its subset of testing that there are many people out there whose salary depends on their not understanding and not acknowledging the need for certain changes. My nonexistent salary no longer does impose such requirements and that’s why I get to write this blog.


2 thoughts on “MailBox Monday #3: Tests, Time, Teachers, and Inertia

  1. Marianne Talbot

    Well, what can I say, apart from: ‘twas ever thus. People have been misusing assessment outcomes since the dawn of time, sometimes unwittingly, of course, but not always. The imperfect match between the policy-makers’ stated intentions and the test developers’ principled design is often a yawning gap, although perhaps sometimes it is a little narrower (vocational qualifications, maybe, although they have other assessment ‘issues’?). I would imagine that most test developers are faced with designing assessments that meet multiple, sometimes conflicting, purposes, so all those purposes are highly unlikely to be met in full – perhaps not any of them. It’s just the way it is.

    1. testingapersonalhistory Post author

      So well said. And that’s part of the purpose of this whole blog is to get people to think more about how the hidden world of testing affects them especially when it’s evident in that misuse of assessments.


Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.