Some things bear repeating like this line from a November post on tbhis blog:
“Validity is the “cardinal virtue in assessment” noted Bob and two other former colleagues Linda Steinberg, & Russell Almond, in 2003. ” There are some disagreements about validity among psychometricians; some think that the test should be tested for validity, not just its score. Harvey Goldstein articulates the protest well at this link:
“After all, if we make claims to be able to measure some very subtle mental processes, why do we seem to have given up on measuring the validity of those same instruments? In fact, it has not always been so. Historically, there have been attempts to produce ‘validity coefficients’ for tests, based upon association measures such as correlations with other tests or judgements
made concurrently or in the future of an individual’s life. I will refer to these as
‘associational validity’ and they include ‘concurrent validity’, ‘predictive validity’,
etc. The striking thing about the current consensus definition is that psychometrics
appears to have given up on attempts to provide quantitative measures of validity.”
That argument about whether validity is a measure of a score or a test or both is for others much more expert than I am, but the idea of tests that appear in their design to lack validity is of great interest.
If you Google “the worst educational tests”, this link from USAToday pops up of a 2018 article about the disparity in test performance by state. Massachusetts is first and that interested me because they spend a great deal of time and money (redundant, I know: time is $$$) on the design of their own statewide tests. The article notes appropriately that ” Parent education levels, for example, which are among the best predictors of student success, are among the highest in Massachusetts.” However, Massachusetts beats out three other states with greater parent education levels. I wonder if the superiority of their assessment design plays a role. California ranking 35th might argue against that idea as they obsess about test design in Sacramento. But Massachusetts engages independent assessments of their Comprehensive Assessment System or MCAS. Those reports at this link here suggest such attention might make a difference.
Such studies contrast with Jake Jacobs’ impassioned argument here that many of the standardized tests used in K-12 lack validity. He writes that “created by corporations like Pearson, Questar, or American Institutes for Research (AIR), the tests and the scoring are built upon incomprehensible formulas, broad presumptions, and subjective, developmentally inappropriate benchmarks which prominent statistical organizations repeatedly urged should not be used in high-stakes decisions.” Is Jacobs right? Do those test scores lack validity? My experience tells me four things:
- Such tests and their scores have a high measure of validity, but they are not perfect by any means
- They are likely much better designed on average than the tests given by the classroom teachers
- They ARE misused flagrantly when states and cities employ the scores to rate teachers (Here Jacobs and I agree.)
- The benchmarks came from other educators and not out of some mysterious source
But validity is a measurement and like all measurements subject to error and like all educational measurements subject to greater error than physical measurements. My bathroom scale may be off by a few ounces but its validity is stronger than most essay sections of high school exams.
And that’s not a specific knock on those exams. Michael Kane as summarized in this article by Stuart Shaw and Victoria Crisp provided a dandy dissection of all that must go into considering the validity of a test and its scores: “Kane (2006) perceives the validation process as the assembly of an
extensive argument (or justification) for the claims that are made about
an assessment. According to Kane, “to validate a proposed interpretation
or use of test scores is to evaluate the rationale for this interpretation or
use. The evidence needed for validation necessarily depends on the claims
being made. Therefore, validation requires a clear statement of the
proposed interpretations and uses” …
Kane proposed that any validation activity should necessarily entail
both an interpretive argument (in which a network of inferences and
assumptions which lead from scores to decisions is explicated) and a
validity argument (in which adequate support for each of the inferences
and assumptions in the interpretive argument is provided and plausible
alternative interpretations are considered).”
My bathroom scale (which regularly disappoints me with its arrogant accuracy) is not anywhere near that complicated. As Stuart Shaw and Victoria Crisp conclude in the above article: “Messick (1989) argued that “validity is an evolving property and validation is a continuing process” (p.13). The contemporary
conceptualisation of validity cannot be considered definitive, but as the
current most accepted notion. This, and particularly the role of
consequences as part of validity, is likely to continue to evolve over time.”
My bathroom scale is not evolving.
Validity is a measurement. Measurement in education is imprecise. But knowing the relative validity of a test score is critical because that information allows us to try to make a better measurement and to avoid incorrect inferences. Robert Axelrod once wrote citing Howard Raiffa that “A way to measure the value of any piece of information is to calculate how much better you could do with the information than without it.” We should want more information about the validity of our tests at every level of our education system.