If Only the Associated Press Could Call All Our Races: Claims Part II

An Associated Press staff member reading copy from the election tabulator in November 1936.Credit…Associated Press

Did you realize that the Associated Press, which most of us know as AP, called 7000 races on election night in 2020? And that they have a remarkable even extraordinary record for getting every one of them right since the 1930s? NPR, which uses the AP for its own election calls says this is because that organization has a record of “precision and calculation.” (BTW “The AP never did call the 2000 [US Presidential] election; the Supreme Court did.”) The AP isn’t counting the votes; they look at the evidence, the data, analyze it and make a… claim.

As noted in the last post on Testing: a Personal History blog, the purpose of any test is to make a claim, but the life of claims — their utility and value — is much broader than tests. Indeed, one of the people who taught me the most about claims, Bob Mislevy, pointed out that a great deal of the foundation for applying them in the world of educational measurement came from such disparate domains as logic, criminal courts, and probability.

What connects all of them and ties in testing in as well? Evidence. John Henry Wigmore pioneered the analysis of legal evidence in trials in order to make claims of guilt or innocence. Stephen Toulmin “sought to develop practical arguments which can be used effectively in evaluating the ethics behind moral issues.” David Schum studied the “properties, uses, discovery and marshaling of evidence in probabilistic reasoning” for professions such as intelligence analysts. I wish I’d known all of them, but Schum seems particularly interesting since one of his lines of evidence was “the use of love letters as crucial evidence in a murder trial.”

But I was talking about testing, wasn’t I? And Bob Mislevy wrote that when it comes to that field, “Our design framework is based on the principles of evidentiary reasoning and the exigencies of assessment production and delivery. Designing assessment products in such a framework ensures that the way in which evidence is gathered and interpreted bears on the underlying knowledge and purposes the assessment is intended to address.”

Take out the word ‘assessment’ and Bob could be talking about what the AP does in calling elections: they pay a great deal of attention to the way in which evidence is gathered and interpreted before making a claim. They care very much that each one of their claims is valid just as the test makers at places like College Board, ETS, National Board of Medical Examiners, ACT, and so many other testing organizations. But none of them can claim the same level of validity as Associated Press; almost no blown calls in 7000 claims.

One of the reasons for that disparity is because counting votes is a lot less complicated a claim than measuring a person’s ability in critical thinking, to use one example. Despite what some people say, a vote is a vote. We can’t even agree as a country on what constitutes critical thinking or as Bob once put in a paper “the nature of (such a) proficiency and ways it is evidenced.” And that’s only the start of the problem because then we would have to agree on a way of asserting with validity the level of critical thinking someone had when we wanted to make such a claim. Anybody can make a claim; making a claim that is valid is something entirely different and worth wondering about when considering our own personal history of testing.

Validity is the “cardinal virtue in assessment” noted Bob and two other former colleagues Linda Steinberg, & Russell Almond, in 2003. Extending validity arguments into tests that can show what we know remains the main concern of test makers like Caroline Wylie and Elizabeth Stone. But how many test-takers (or teachers or superintendents or parents or admissions counselors) would actually be able to explain what a validity argument is? I owe a debt to David Slomp who after reading our previous post about claims sent me a link to one of his publications, which offers a superb characterization of a validity argument. (My former colleague in ETS’s R&D division, Michael Kane, who just happens to be one of the greatest validity theorists in the world, describes a validity argument as the way in which we confirm an interpretation or use of scores that result from a test. A validity argument will “evaluate the plausibility of the claims based on the scores.” A validity argument will emerge from a systematic “approach” focused on determining whether a sufﬁcient body of evidence exists to justify the use of a test for a particular purpose. Such arguments are critical to the construction of tests and they involve the claims within the claim, the ways in which a test maker ensures that a test is appropriate for its stated purposes.

Just because someone created a test (to paraphrase David’s chart) doesn’t mean that

it’s based on a “robust, research-informed construct”,
captures a balanced and representative sample of that construct,
will produce scores that accurately reflect test takers performance,
will represent results that a test taker “would be expected to obtain over multiple similar tasks completed in multiple assessment sessions”,
or shows us how a test taker would be expected to perform in the relevant real life contexts associated with the construct.

Sounds pretty complicated, doesn’t it? How many tests that you took in high school or college do you think were created using a validity argument that supplied those points in the paragraph immediately above? We might shrug our shoulders and say very few but it doesn’t matter because those tests were for relatively low stakes. But didn’t those tests determine your GPA and its relative ranking among the GPAs of your classmates, a source of many claims about you? And haven’t admissions officers in recent years signaled their intention to pay more attention to GPA and less to standardized admission tests when making their claims about who gets admitted to elite colleges? Maybe the lack of a validity argument for many of those tests mattered more than any of us realized. (Freddie DeBoer makes a good argument that we pay too much attention to the admissions process for elite colleges in our higher education conversations, but that’s a topic for another day.)

My intention is not to inflame further enmity toward testing. Instead, looking at claims and getting others to do the same is a way of improving the way in which tests are used. David Slomp in the paper linked in above also cites a finding by Ydesen & Bomholt in a 2020 paper that nicely summarizes one of the schisms involving testing: “educators frame assessment as a tool for supporting teaching and learning, whereas the general public, politicians, and measurement specialists see large-scale assessment as a tool for surveillance and accountability.” Those two authors get at a philosophical and emotional dimension of our reactions to testing when they reference Onora O’Neill who in the Reith Lecture in 2002 identified one of the problems with the way in which our society has moved towards accountability, and accountability that is based partly on the results of tests. “Our revolution in accountability has not reduced attitudes of mistrust, but rather reinforced a culture of suspicion. Instead of working towards intelligent accountability (emphasis added) based on good governance, independent inspection and careful reporting, we are galloping towards central planning by performance indicators, reinforced by obsessions with blame and compensation. This is pretty miserable both for those who feel suspicious and for those who are suspected of untrustworthy action – sometimes with little evidence.”

Pretty miserable would seem to sum up the situation of testing in education and employment right now. Many parents and students feel that the claims made by tests are unfair and even oppressive. Many educators and psychometricians feel misunderstood and misrepresented regarding their work.

Perhaps naively , my experience at ETS amongst many of the greatest experts in educational measurement in the world persuaded me that if more people actually thought of each test as wanting to make a particular claim about the test taker that they might then ask more questions as to whether that claim was being made in a way that was valid, reliable, and fair. They might even question more effectively whether a test was the best way to make that claim. They might even question the claim. Some of this does occur, but often without either the knowledge that would make such protests meaningful or a substantial suggestion as to what we need to do in situations where claims are important. I think most of us want people who are driving to at some point have passed a driving test. We want our air traffic controllers to be certified by means of a test. Same goes for doctors, psychologists, civil engineers, even accountants. But we should also want to know more about how and why those claims are made. In that part of anti-testing agitators complaints makes a great sense to me

Discomfort with the structures deciding what claims to make extends across our lurid political spectrum. Blake Smith notes in a recent essay that both Christopher Lasch (right-wing deity) and Michel Foucault (left-wing deity) criticized the way that “elites use networks of expertise and injunctions to moral ‘liberation’ to strengthen their domination.” In other words, if elites are the ones deciding what claims to make and how to make them than are they rigging a game in which the rest of us are subject to their decisions. BTW pretending that elites do not set the most important claims we test or worse that there are not elites is foolish.

In the next installment of TAPH, I’m going to continue looking at claims but from a very personal perspective: claims that are made as to whether someone has a learning difference.