I hope that every reader knows the movie Some Like It Hot. But just in case you missed that 1959 classic in which for variety of reasons, here’s a relevant plot twist. Jack Lemmon’s character of Jerry not only has to dress up as a female member of an ‘all girl band’, but also in that mistaken identity gets a marriage proposal from multimillionaire Osgood played by Joe E Brown. Miranda Corcoran supplies a wonderful summary of the exchange between this ‘couple’ in the final moments of the film:
“as Jerry/Daphne and Osgood discuss their relationship in the front seats of the boat. Osgood tells Daphne, who is still in full drag, that his mother called and that she was so happy to hear about Osgood’s upcoming nuptials that she cried. Osgood goes on to tell his fiancée that his mother wants Daphne to wear her wedding gown for the ceremony, to which Jerry/Daphne responds, “I can’t get married in your mother’s dress. She and I … we’re not built the same way”. Jerry/Daphne then lists all of the reasons they can’t marry. Jerry/Daphne confesses to Osgood that she has deceived him, that she is not a natural blonde, that’s she’s a heavy smoker, and that she can never have children. Osgood loves the person he knows as Daphne so much that he tells her that none of her confessions will change his feelings for her, and that they can always adopt some children. He even forgives her declaration that she is a loose woman who, for the last few years, has been living with a saxophone player. Exasperated, Jerry/Daphne finally pulls off the blonde wig and announces, “You don’t understand Osgood; I’m a man!” Osgood simply smiles and responds, “Well, nobody’s perfect”. As the screen fades to black, Osgood grins happily while Daphne/Jerry simply stares ahead in disbelief.”
‘Nobody’s perfect’ is one of the best ‘curtain lines’ in all of cinema. It also can serve as perhaps the foundational truth of testing: there is no perfect test. There is no exam or analysis that can allow a claim of absolute certainty to be made about someone’s knowledge, skill, or ability. No such thing.
Think about that for a moment. When your high school algebra teacher handed you back a paper with a 90 or 80 or 70, did they offer this caveat? How about that teaching assistant in the World Literature class in college? Is such a hedge in large print on the cover of any standardized test you ever took? No. And yet if we are to recover testing for the purposes that best serve all of us, then we have to start with its inherent imperfection. Yesterday’s post talked about true stories, realities, and illusions in testing, but only scratched the surface. There will be much more of that in later posts this month. But this truth — that measurement by a test can never be perfect — deserves our attention right away.
One writer who offers a great deal of insight about this aspect of management is Derek C. Briggs who recently wrote a book called Historical and Conceptual Foundations of Measurement in the Human Sciences: Credos and Controversies
In educational measurement — the fancy name for the ‘science of testing’, there are milestones and mirages, which can be difficult to tell apart.
Part of the confusion is that measurement and quantification in the physical sciences existed robustly long before similar levels of educational measurement. While educational measurement has never caught up in being as precise, its introduction in this country caused many to believe test scores enjoyed the certainty of the numbers in the physical sciences. They don’t. If the altimeter on a plane or the speedometer on a car had as much margin of error as your typical test score than crashes would be routine.
In his first chapter, Briggs gets at the nub of the problem by quoting Edward L. Thorndike, one of the pioneers of educaitonal testing:
“Whatever exists at all exists in some amount. To know it thoroughly involves knowing its quantity as well as its quality. Education is concerned with changes in human beings; a change is a difference between two conditions; each of these conditions is known to us only by the products produced by it—things made, words spoken, acts performed, and the like. To measure any of these products means to define its amount in some way so that competent persons will know how large it is, better than they would know without measurement. To measure a product well means to define its amount that competent persons will know how large it is, with some precision, and that this knowledge may be conveniently recorded and used. This is the general Credo of those who, in the last decade, have been busy trying to extend and improve measurements of educational products.” (Thorndike, 1918, 16)
“With some precision.” Fair enough, but Briggs continues the thought by wondering if “there are things that might not be measurable.”
He proceeds to offer four definitions of measurement; the last one is central to our discussion:
- “Measurement is the process of experimentally obtaining one or more quantity values that can reasonably be attributed to a quantity “
Reasonably, not perfectly, not absolutely! If what you are measuring is quantity — how many did you get right on the quiz?– then you might get closer to perfect in that scheme. Whether they were the questions that best tested someone’s knowledge of the chosen construct is another matter. And when it comes to psychological attributes being measured, quantity alone will not suffice. We need to look at quality — does your essay answer suggest you understand the “the goals of U.S. policymakers in major international conflicts, such as the Spanish–American War, World Wars I and II, and the Cold War, and how U.S. involvement in these conflicts has altered the U.S. role in world affairs.” (Real AP History Question from 2012.) How could any test measure such understanding perfectly?
Another point that Briggs makes that concerns all of our personal histories of testing is that measurement only exists within the context within its parameters and tools were created. He offers the example of telling time: “Our understanding that time is additive—that five minutes is the sum of 300 seconds and always means the same thing whether it passes in the morning, afternoon or evening—is inextricable from our ability to measure it. The point here is that even for the most canonical examples of extensive attributes, measurement cannot be separated from human culture and conventions. Nonetheless, when measurement involves extensive attributes, it can be recognized as a matter of direct comparison between two instances of the same quantity (e.g., through use of a ruler, a stopwatch, or a balance beam)”
But educational measurement at its best is quality, not quality. Measurement is imperfect but not arbitrary, yet it is affected by “human culture and conventions.” The Spanish-American War can be understood in many different ways that depend upon the culture and conventions of the testing organization, the text book publisher, the AP teacher, and, of course, the student test-taker.
Why is this important in stitching together and viewing our histories of testing? Because such an understanding invites us to question the measures rather than automatically accepting them.
Briggs notes that one of the fundamental purposes of measurement is “to reduce our uncertainty about the quantity value of a targeted attribute, and (2) to report a quantity value that can be generalized beyond the specific and local implementation of the measurement procedure.” Conversely, pretensions to certainty are foolhardy. We are never fully certain of a measurement; for example, there may be variables undiscovered that are interfering with our readings. Measurement is never perfect.
But measurement is useful. Bob Mislevy in his paper on tests as structured arguments quotes Harold Gulliksen, another pioneer of educational measurement or its even fancier name, psychometrics, that “Society, was able to describe ‘the central problem of test theory’ as ‘the relation between the ability of the individual and his [or her] observed score on the test’” They are NOT the same thing: your ability is NOT your score and vice versa. As Bob notes in that paper, “test theory sought to make better and better connections between the study of the relationship between responses to a set of test items and a hypothesized trait (or traits) of an individual as a problem of statistical inference.’ ” Okay, that’s a bit of dense quote by Charles Lewis, but what it means is that the best in testing of whihc Bob is certainly one will tell you again and again that what someone answers on a test is not a perfect representation of what they know or can do regarding whatever the subject of that test is whether it’s English Lit or Environmental Science, Fashion Marketing or Nuclear Fusion.
A designer of a test knows this and they do their best to make sure the test is valid.
Validity is what Sam Messick (another in today’s parade of testing pioneers) Messick (1989) described as ‘an integrated evaluative judgment of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of inferences and actions based on test scores or other modes of assessment’ Two words jump out from that definition because they suggest a certain amount of wiggle room, a caution against expecting certainty: evaluative and judgment . Commentators on sporting programs use the phrase ‘judgment call’ frequently to indicate a call that could’ve gone either way pending upon how the referee saw things. In other words, it’s a call that cannot be perfect.
The word evaluative evokes the notion of estimating, appraising. In such activities, we strive to be as exact as we can, but no one believes that an appraisal is perfect every time.
Assessment validity is the extent to which a test measures what it is supposed to measure. The Standards for Educational and Psychological Testing (2014) defines validity as the “degree to which evidence and theory support the interpretations of test scores for proposed uses of tests” again two words jump out: extent and degree. Extent is the limit to which anything reaches. No test reaches the 100% validity mark especially if it relies upon behaviors that are supposed to be markers of some underlying and impossible to directly measure element of a human such as intelligence. Degree is a point on a scale, and both that mark and the scale are constructed by human beings, which suggests again the impossibility of perfection. Which leads me again to my earlier question, why don’t we ask those who give us tests to provide more information on its formation. How was the construct determined? Who decided the complex of knowledge, skills, or other attribute to be assessed? How was it concluded that the behaviors or performances that this test is supposed to elicit had a connection to those constructs? Who figured out that the tasks or situations on the test would elicit the complementary actions that would allow for a valid judgment of the designated knowledge, skill, or ability? The critics of testing who call for its dominion wishing or outright abolition might better focus on these questions to improve the validity of tests especially those that have significant consequences. Tomorrow, we’ll talk more about validity, but I warn you that in all of these posts it would be folly to expect that what I am expressing is in any way perfect. Nobody’s perfect particularly me.
I agree that no test is perfect but will always be a compromise between competing priorities (different priorities for the various parties involved, as previously alluded to), the end result chosen from a range of options. I do, however, think that test developers should be able to justify their choices, explaining why they are the ‘best’ or perhaps ‘fairest’ manifestation for that assessment – why that type or mode of assessment, why those specific questions/format, why that order or sequence, why are those answers most appropriate (valid)? For example, an A level paper might have three assessments: a two-hour written paper which is a combination of a section of multiple-choice questions and another section of short-answer questions, a 90-minute practical test, and another two-hour written exam where candidates must write two essays in response to two of six possible questions. I hasten to add that I have made this example up, but it is not untypical. Why use this combination of assessment formats, question types, response frameworks? It’s probably not perfect, but by combining a range of formats etc, does it make the overall assessment fairer, maybe more balanced? Or is it just the least-worst option?! That’s potentially quite a long way from perfect!