|
Disaggregation
Whats Good for the Goose is Good for the Gander
Assessment for NCLB vs Assessment for Learning Disability Identification
Miriam Cherkes-Julkowski, Ph.D.; July 2004
The No Child Left Behind Act (NCLB) requires that schools disaggregate performance data so that the progress of individual subgroups of children can be viewed without being lost in a larger group analysis. The logic seems impeccable and Dr. Eugene Hickok, Under Secretary of Education, expressed it well (http://www.education-world.com/a_issues/issues371.shtml):
If you look at NCLB, its primary thrust in regards to accountability is to make sure that student performance is not hidden in the averages. Im sure you can go to school districts in Louisiana where the average test scores are very high. But when you disaggregate by groups, youll find out that although the average is high, there might be many or a few groups that are very much below that average. If you dont disaggregate, those students get lost in the averages. You never find out the nature of the achievement gap youve got before you. The purpose behind these subgroups is to make it impossible to hide inadequate student performances.
Or, in President Bushs words (http://www.whitehouse.gov/news/releases/2004/01/20040108-3.html):
we're unwilling to accept the past, where everybody was just kind of measured all together. What we want to know, is we want to know specifically who is succeeding, and who is not.
The only qualification applied to disaggregation in the NCLB Act is that each disaggregated subgroup must constitute a group large enough to support reliable findings.
When this same (very good) logic is applied to the identification of children with learning disabilities, it meets with alarming resistance. Most states have institutionalized the use of composite, i.e., aggregated scores in the identification process. Full scale IQ (WISC-IV, 2003 ) is the most aggregated of scores, but the scale scores consisting of verbal comprehension, processing speed, perceptual reasoning and working memory are also aggregates or composites of individual test scores. Achievement scores are aggregates as well. As an example, the Broad Reading score on the Woodcock Johnson-III (Woodcock, McGrew & Mather, 2001) is an aggregate of letter-word identification and passage comprehension. The use of aggregate scores becomes further ingrained in the LD identification process when schools move in the direction of regression analyses that require the use of composite scores (Bergeson, Heuschel, Harmon & Gill, 2003; Cherkes-Julkowski, 2004).
Just as aggregate data hide subgroup performance for accountability purposes under NCLB, so do they hide subtest, subskill, subability performance for measurement purposes when assessing individual students. Lets use Dr. Hickoks formulation, this time substituting the language of LD identification (changes are in italics):
If you look at LD identification, its primary thrust in regards to accountability is to make sure that student performance is not hidden in the averages. Im sure you can go to an individual students performance where the average test scores are very high. But when you disaggregate by performance area, youll find out that although the average is high, there might be many or a few subtest areas that are very much below that average. If you dont disaggregate, those performance areas get lost in the averages. You never find out the nature of the achievement gap youve got before you. The purpose behind these subtest areas is to make it impossible to hide inadequate student performances.
Doing the same with President Bushs explanation:
we're unwilling to accept the past, where all subtest or individual performance areas were just kind of measured all together. What we want to know, is we want to know specifically in what individual performance areas a child is succeeding, and in what individual performance areas s/he is not.
As with NCLBs disaggregation, disaggregation or, better stated, nonaggregation, of composite, scale, broad or cluster scores is justified only if the contributing subtests are reliable unto themselves. As an example, if the Broad Reading score on the Woodcock Johnson is to be brought back to its original subcomponents, letter-word identification and passage comprehension, the latter two subtests must of themselves be statistically reliable. Any test worth its salt, would not formulate a subtest if it were not a reliable entity. In our example, the
reliability of the letter-word test is .92, passage comprehension, .90. These are highly respectable under any criteria. These reliability figures are representative of most of the subtest reliabilities currently used for achievement and cognitive abilities measurement.
Lets take the parallel one more step, offering examples of each. Assume that a classroom has 50% of very high achievers and 50% of a subgroup of children who are performing well below grade level. To make the math simple, assume that the high achievers average 2 standard deviations higher than grade level and that the low achievers average 2 standard deviations below grade level. When you add up all of their achievement scores and divide by the total number in the classroom, the average class score will be exactly on grade level, exactly average. But, As Dr. Hickok and President Bush explain, this is unacceptable because it makes it impossible to locate any lack of success, You never find out the nature of the achievement gap youve got before you. The purpose behind these subgroups is to make it impossible to hide inadequate student performances.
Now, assume that an individual child has very high cognitive abilities in the area of visuospatial functions (as measured possibly by the WISC IV perceptual reasoning and processing speed subscales) and has very high achievement in math. The same child has very low abilities in language and very low achievement in reading. To make the math simple, assume that high abilities and achievement are 2 standard deviations above average for grade/age. Low abilities and achievement are 2 standard deviations below average. Composite scores will yield an exact average or close to it for both ability and achievement, no discrepancy, no learning disability. And, You never find out the nature of the achievement gap youve got before you. The purpose behind these subtest scores is to make it impossible to hide inadequate student performances.
What is widely and sensibly accepted as sound measurement practice under NCLB, is widely irrationally and sometimes vehemently rejected when it comes to procedures for identifying learning disabilities. Here is a typical example. A third grade child, we shall call him Mike, came to my consulting and diagnostic practice after 3 years of struggling in school and a number of multidisciplinary team meetings all of which ended in the team (without the consensus of the parents) decision that Mike did not have a learning disability. After my testing found that he did in fact have a learning disability, the school performed a complete set of evaluations: educational, psychological and speech and language.
The school found again that Mike did not have a learning disability. School personnel based this conclusion on aggregated scores for all areas tested and refused to disaggregate even though the statistical differences among contributing subtest scores were consistently extreme and therefore signaled that extremes of student performance surely would be hidden.
The WISC IV results, tabulated below, dramatize the distortions created by aggregation and the distortions resulting from absolute resistance to disaggregation:
Verbal Comprehension Subtests
Similarities 8
Vocabulary 9
Comprehension 3
Perceptual Reasoning
Block Design 10
Picture Concepts 11
Matrix Reasoning 11
Working Memory
Digit Span 10
Letter-Number Sequence 7
Processing Speed
Coding 13
Symbol Search 11
Composite Scores
Verbal Comprehension 81
Perceptual Reasoning 104
Working Memory 91
Processing Speed 112
Full Scale IQ 94.
Nowhere in the psychological evaluation report or in the schools contribution to the team discussion was it mentioned that there were important discrepancies among the contributing scores. The school reported low average ability. School personnel discounted any suggestion that scores should be disaggregated to reveal that Mike had abilities higher than low average and some achievement scores far lower than ability as well as far lower than other achievement scores.
Simply looking at the scores reveals a huge fluctuation among subtest scores, ranging from 3 to 13, a range bridging more than 3 standard deviations and spanning from severely below average to above average. Statistical tests (using the WISC IV standards) found discrepancies at or beyond the .05 level for the following comparisons: verbal comprehension vs perceptual reasoning, verbal comprehension vs processing speed, perceptual reasoning vs working memory and working memory vs processing speed.
Another method to assess the magnitude of these discrepancies is to determine what percentage of the population would in fact have such widely discrepant disaggregated scores, of the kind that NCLB says would falsify the average/composite scores, is to determine what percentage of the population would have discrepancies as large. In Mikes case the data were:
verbal comprehension vs perceptual reasoning 5.4%
verbal comprehension vs processing speed 2.9%
working memory vs processing speed 11.1%.
As cited in the Florida state guidelines for identification of learning disability (1996), Rare differences, on the other hand, would be those that are considered abnormal and occurring in only 15% of the population. In our typical example, the differences among Mikes scores far exceeded this degree of rarity.
Without belaboring the point, it is worth mentioning that this degree of discrepancy existed throughout Mikes disaggregated scores. Achievement scores for individual subtests of the Woodcock Johnson - III (Woodcock, McGrew & Mather, 2001) ranged from 122 to 88. Aggregated cluster scores, of course, muted the differences but were still wide, 121 to 95.
None of this has mattered to the team which continues to this day to insist there is no basis for a discrepancy due to no significant negative1 differences between the full scale (aggregated) IQ score and the composite (aggregated) achievement scores. Mikes family, like many others, have tried to explain how this kind of aggregation means you can never find out about the nature of the achievement gap or the processing gap, but to no avail. In fact, this approach to calculating scores will almost always hide a learning disability.
What is good, logical and just as a measurement principle for NCLB is even more critical and embodies an even more essential logic in the identification of children with learning disabilities, whose school success and self esteem are so very dependent on identification, not hid(ing), of real strengths and weaknesses.
References
Bergeson, T., Heuschel, M.A., Harmon, B & Gill, D. (2003). Identification of Students with
Specific Learning Disabilities State of Washington Severe Discrepancy Tables. WAC 392-
172-130.
Cherkes-Julkowski, M. (2004). Regressing toward the Mean vs Progressing toward Individual Differences. Available at: www.educational-advisor.com
FLORIDA DEPARTMENT OF EDUCATION, DIVISION OF PUBLIC SCHOOLS
BUREAU OF STUDENT SERVICES AND EXCEPTIONAL EDUCATION (1996).
Technical Assistance Paper: The Use of Partial Scores with Tests of Intelligence.
Wechsler, D. (2003). Wechsler Intelligence Scale for Children - Fourth Edition, San Antonio, TX: Harcourt Assessment, Inc.
Woodcock, R., McGrew, K. & Mather, N. (2001). Woodcock-Johnson III. Itasca, IL.
|
|
|