Is it overkill to devote the daily January jolt of this blog to validity yet again? No! Why do we take a test in the first place? So that we can make a claim about ourselves or others feel comfortable making a claim about us whether it is that we have learned something, can perform a particular job, or have a significant chance of succeeding in a course of study. There is a lot riding on such claims even at the earliest levels of education — and, therefore, of assessment.
This paper provides a vivid example: in this country, English tests for second grade had poor validity (50%) and the third grade test had very poor validity (25%). Being at ETS for almost 2 decades provided me with an understanding of the difference that learning English (or more importantly being recognized by some authority as fluent) makes in someone’s life. Colleagues from Asia, Africa, South America, and even Europe told me again and again the economic value of being officially recognized as a competent speaker of English. But in the example above the tests do a poor or very poor job of recognizing someone’s skill. Those kids are hamstrung in their learning of English from the very start because the tests that should be part of their overall learning process lack validity.
What makes for such poor validity in a test? Design is critical but hubris plays a big role too: too many test givers fail to acknowledge the fallibility of their work. For those of us NOT experts in testing, the question of how to determine if a test is valid and reliable isn’t straightforward. This paper gives the story in shorthand.
But for others, it’s not about their overconfidence, but rather failure to see the role that properly designed assessment plays in learning. This piece expresses cogently the argument for better designed tests in the classroom:
“Assessment needs to be an inherent part of the teaching/learning process. For those who follow ‘the learners way’ it is focused on the needs of the learner. It is a tool for an effective teacher to use on a regular basis to check their learner is headed in the right direction. A feedback loop that guides their thinking and keeps track of their progress. It never becomes a thing that is done at the end of a unit. It is instead the sum total of every evaluation that the teacher and more importantly the student makes as they engage with their learning.”
As Florida’s Department of Education put it, validity is the key part of assessment —and they talked about my bathroom scale in explaining the concept:
“Validity refers to the accuracy of an assessment — whether or not it measures what it is supposed to measure. Even if a test is reliable, it may not provide a valid measure. Let’s imagine a bathroom scale that consistently tells you that you weigh 130 pounds. The reliability (consistency) of this scale is very good, but it is not accurate (valid) because you actually weigh 145 pounds (perhaps you re-set the scale in a weak moment)! Since teachers, parents, and school districts make decisions about students based on assessments (such as grades, promotions, and graduation), the validity inferred from the assessments is essential — even more crucial than the reliability.”
What would we do if we wanted tests with greater validity in their scores? Answering that question completely is beyond my expertise, but there are a few things that you can pick up by the the same to other experts. Many states like Kansas in this handout offer pointers and even prescriptions for greater validity in classroom tests. Part of the problem starting before the test has been considered. If the teacher hasn’t begun the design of their overall instruction let alone the design of any assessment with clear learning objectives then the assessment is likely to have problems with validity. Renowned instructional pioneer Roger Schank put it concisely: “A well-written objective provides extremely strong clues about how to assess it” Similarly, if the teacher is simply handed objectives and his or her supervisor never checks to see whether they fully understand what is to be taught then the assessments they craft will also suffer potentially from poor or no validity.
Another problem is using the wrong methodology for the particular claim that you want to make as noted in this guideline for faculty at Kansas State University on pitfalls of assessment.
Consider this research finding about certain types of multiple choice items, which is only a finding and not to my knowledge replicated:
Avoid ‘none of the above’ if you want students to learn
When the correct answer to a test is “A and B, but not C,” students have to jump through a variety of hoops in their brain to reach the right answer. Students are more likely to get stumped by the options and can have more misunderstandings as a result. This makes the complex question format an unreliable tool to measure learning. “The variability in responding among test-takers reduces reliability, which is critical for assessment,” Butler writes.”
But there are test givers who still use complex question formats
And then there is the validity problem of using a test that wasn’t intended for the particular purpose. Think of the times that the teacher pulls some test off the Internet or in earlier days out of the exercise book and employed that when it wasn’t a good match for what had been taught so far all what needed to be learned. Greg Cizek who actually wrote a book with the title Validity expressed the difficulty in this way while making the case that test scores of students should not be used to evaluate teachers:
“The set of best practices for the field of testing embodied by the Standards for Educational and Psychological Testing (AERA, APA, & NCME, 2014) caution against unintended and unvalidated uses when a test was not designed for such use or when sufficient information supporting the unintended or secondary use has not been gathered and analyzed. According to the Standards: ‘If validity for some common or likely interpretation for a given use has not been evaluated…potential users should be strongly cautioned about making unsupported interpretations” (p. 23) and “the improper use of tests…can cause considerable harm to test takers and other parties affected by test-based decisions’”
Finally, in looking at validity we should not ignore the possibility that the test is aiming at the wrong target.
Bad Assessments Are Bad for Learning is the title of this article that makes the point that part of the validity problem with tests may stem from the earliest stage of design; they may be measuring something that’s not so critical to the person or even the community’s development such as “…recall of rote learnt knowledge”. In the age of Google, rote learning while not worthless is certainly less relevant especially for jobs that will provide greater security and rewards. “Assessments (that) are measuring higher-order skills is only one aspect of measuring real learning outcomes, but an important one.” The results provided from poorly designed assessments even if they proved valid for this lesser construct neither allow the student to make a claim about something such as higher order skills nor allow the teacher to alter instruction so as to aid learning.
(Higher order skills are ones that require thinking and not just recall. Newman Burdett gives a good example of the difference in this paper:
“The emphasis largely seems to be on recall of rote learnt knowledge (in lower order skills)… Students may well be leaving examination halls having scored high marks, but not having learnt (sic) anything of use outside that examination hall. They could potentially be leaving school with no useful skills and poor literacy and mathematical skills beyond the very limited repertoire needed to pass the examination. However, if the examination requires them to not only recall knowledge, but understand, apply, and be able to use that knowledge in novel situations, then it is likely that what they learn in school will be useful beyond the examination.”
Closing out this particular feeling validity, I’m reminded again of the unease I feel when people suggest that GPA or grade point average be a very useful substitute for standardized tests. We will return to the whole issue of standardized admission tests later this month, but for now this quote from a chapter by Emily Shaw in the book Measuring Success edited by Jack Buckley, Lynn Letukas, and Ben Wildavsky makes my point: “decades of research has shown that the SAT and ACT are predictive of important college outcomes, including grade point average (GPA) [in college], retention, and completion, and they provide unique and additional information in the prediction of college outcomes over high school grades. (Emphasis added) I’m not happy that’s the case, but until we improve the validity of classroom tests it’s likely to continue whether people like it or not.
Another sharp analysis. I particularly like the formula ‘design + hubris/fallibility = poor validity’, which elegantly shows that even good design (and it’s debatable how much of that there really is!) can be undermined by overconfidence. And I love that quote from Roger Schank – but, again, how many well-written objectives are there, really?
Poor communication, combined with a lack of a shared understanding of what an assessment is trying to assess, and how that matches what is being taught and learnt, seems to me to be at the heart of much poor assessment, along with overcomplex question formats and/or unnecessarily elaborate language employed by test-developers anxious (hopefully subconsciously) to show off their impressive grasp of said language – which just obfuscates meaning and creates barriers for some candidates. I think that previous sentence is an example of a complex piece of text, that would have no place on most assessments.
Perhaps unsurprisingly, given that we lead Chartered Educational Assessor courses together, I am highly sympathetic to the quote from Newman Burdett’s paper too – what’s the point of knowledge if one cannot understand and apply it to life?!