Validity is an Imperfect Measurement – Testing: A Personal History

Some things bear repeating like this line from a November post on tbhis blog:

“Validity is the “cardinal virtue in assessment” noted Bob and two other former colleagues Linda Steinberg, & Russell Almond, in 2003. ” There are some disagreements about validity among psychometricians; some think that the test should be tested for validity, not just its score. Harvey Goldstein articulates the protest well at this link:

“After all, if we make claims to be able to measure some very subtle mental processes, why do we seem to have given up on measuring the validity of those same instruments? In fact, it has not always been so. Historically, there have been attempts to produce ‘validity coefficients’ for tests, based upon association measures such as correlations with other tests or judgements
made concurrently or in the future of an individual’s life. I will refer to these as
‘associational validity’ and they include ‘concurrent validity’, ‘predictive validity’,
etc. The striking thing about the current consensus definition is that psychometrics
appears to have given up on attempts to provide quantitative measures of validity.”

That argument about whether validity is a measure of a score or a test or both is for others much more expert than I am, but the idea of tests that appear in their design to lack validity is of great interest.

If you Google “the worst educational tests”, this link from USAToday pops up of a 2018 article about the disparity in test performance by state. Massachusetts is first and that interested me because they spend a great deal of time and money (redundant, I know: time is $$$) on the design of their own statewide tests. The article notes appropriately that ” Parent education levels, for example, which are among the best predictors of student success, are among the highest in Massachusetts.” However, Massachusetts beats out three other states with greater parent education levels. I wonder if the superiority of their assessment design plays a role. California ranking 35th might argue against that idea as they obsess about test design in Sacramento. But Massachusetts engages independent assessments of their Comprehensive Assessment System or MCAS. Those reports at this link here suggest such attention might make a difference.

Such studies contrast with Jake Jacobs’ impassioned argument here that many of the standardized tests used in K-12 lack validity. He writes that “created by corporations like Pearson, Questar, or American Institutes for Research (AIR), the tests and the scoring are built upon incomprehensible formulas, broad presumptions, and subjective, developmentally inappropriate benchmarks which prominent statistical organizations repeatedly urged should not be used in high-stakes decisions.” Is Jacobs right? Do those test scores lack validity? My experience tells me four things:

Such tests and their scores have a high measure of validity, but they are not perfect by any means
They are likely much better designed on average than the tests given by the classroom teachers
They ARE misused flagrantly when states and cities employ the scores to rate teachers (Here Jacobs and I agree.)
The benchmarks came from other educators and not out of some mysterious source

But validity is a measurement and like all measurements subject to error and like all educational measurements subject to greater error than physical measurements. My bathroom scale may be off by a few ounces but its validity is stronger than most essay sections of high school exams.

And that’s not a specific knock on those exams. Michael Kane as summarized in this article by Stuart Shaw and Victoria Crisp provided a dandy dissection of all that must go into considering the validity of a test and its scores: “Kane (2006) perceives the validation process as the assembly of an
extensive argument (or justification) for the claims that are made about
an assessment. According to Kane, “to validate a proposed interpretation
or use of test scores is to evaluate the rationale for this interpretation or
use. The evidence needed for validation necessarily depends on the claims
being made. Therefore, validation requires a clear statement of the
proposed interpretations and uses” …
Kane proposed that any validation activity should necessarily entail
both an interpretive argument (in which a network of inferences and
assumptions which lead from scores to decisions is explicated) and a
validity argument (in which adequate support for each of the inferences
and assumptions in the interpretive argument is provided and plausible
alternative interpretations are considered).”

My bathroom scale (which regularly disappoints me with its arrogant accuracy) is not anywhere near that complicated. As Stuart Shaw and Victoria Crisp conclude in the above article: “Messick (1989) argued that “validity is an evolving property and validation is a continuing process” (p.13). The contemporary
conceptualisation of validity cannot be considered definitive, but as the
current most accepted notion. This, and particularly the role of
consequences as part of validity, is likely to continue to evolve over time.”

My bathroom scale is not evolving.

Validity is a measurement. Measurement in education is imprecise. But knowing the relative validity of a test score is critical because that information allows us to try to make a better measurement and to avoid incorrect inferences. Robert Axelrod once wrote citing Howard Raiffa that “A way to measure the value of any piece of information is to calculate how much better you could do with the information than without it.” We should want more information about the validity of our tests at every level of our education system.

Marianne Talbot January 10, 2022 at 6:20 pm

This brings me again to the questions “validity for whom?” and “validity at what level of granularity?”. The ‘for whom’ we have rehearsed already to some extent, but it is interesting to consider the granularity of validity. Assessment might be valid across the entirety of a qualification (such as across the A level with three assessment components I mentioned in an earlier comment – although the three components disguise even more types or modes of assessment within each of them), or within a component of that qualification (for example, MCQs combined with short-answer items in one exam paper), or within an item (the stem or scenario or stimulus material, the actual question, the key, the distractors, the space given for candidates to respond, any images used, the font, the size of the type…). I believe good assessment needs to consider validity at all levels of granularity – what might appear to be valid at one level really might not be at another level. See Kane’s work on item design and test design/review of the field – I would add qualification design as an additional layer to be considered. In “Current Concerns in Validity Theory” (2001; available at https://doi.org/10.1111/j.1745-3984.2001.tb01130.x ), Kane articulates validity as follows:

“Validity is not a property of the test or of the test scores, but rather an evaluation of the overall plausibility of a proposed interpretation or use of test scores […that…] reflects the adequacy and appropriateness of the interpretation and the degree to which the interpretation is adequately supported by appropriate evidence”.

Furthermore, he says:

“Validation includes the evaluation of the consequences of test uses […and…] inferences and any necessary assumptions are to be supported by evidence; and plausible alternative interpretations are to be examined”

But then he also says: “Validation is difficult at best, but it is essentially impossible if the proposed interpretation is left unspecified”. His concluding remarks are an excellent synopsis of how we have got to where we are…

3 thoughts on “Validity is an Imperfect Measurement”

Marianne Talbot January 10, 2022 at 6:20 pm

This brings me again to the questions “validity for whom?” and “validity at what level of granularity?”. The ‘for whom’ we have rehearsed already to some extent, but it is interesting to consider the granularity of validity. Assessment might be valid across the entirety of a qualification (such as across the A level with three assessment components I mentioned in an earlier comment – although the three components disguise even more types or modes of assessment within each of them), or within a component of that qualification (for example, MCQs combined with short-answer items in one exam paper), or within an item (the stem or scenario or stimulus material, the actual question, the key, the distractors, the space given for candidates to respond, any images used, the font, the size of the type…). I believe good assessment needs to consider validity at all levels of granularity – what might appear to be valid at one level really might not be at another level. See Kane’s work on item design and test design/review of the field – I would add qualification design as an additional layer to be considered. In “Current Concerns in Validity Theory” (2001; available at https://doi.org/10.1111/j.1745-3984.2001.tb01130.x ), Kane articulates validity as follows:

“Validity is not a property of the test or of the test scores, but rather an evaluation of the overall plausibility of a proposed interpretation or use of test scores […that…] reflects the adequacy and appropriateness of the interpretation and the degree to which the interpretation is adequately supported by appropriate evidence”.

Furthermore, he says:

“Validation includes the evaluation of the consequences of test uses […and…] inferences and any necessary assumptions are to be supported by evidence; and plausible alternative interpretations are to be examined”

But then he also says: “Validation is difficult at best, but it is essentially impossible if the proposed interpretation is left unspecified”. His concluding remarks are an excellent synopsis of how we have got to where we are…
Pingback: Mailbox Monday: Our Faithful Correspondents Communicate – Testing: A Personal History
Pingback: What Do the SATs Measure? – Testing: A Personal History

Comments are closed.