Shattering Logic to Explain the Flynn Effect

Flynn brought world attention to the intriguing fact that IQ test scores rose steadily and rather dramatically throughout much of the Twentieth Century, at least in those countries for which we have good data. Years back, he interpreted such inexplicable increases as evidence that IQ tests must surely be flawed. Now he seems to accept unquestioningly their power to capture changes in human intelligence over the sweep of time. He has become the ultimate IQ-man.

The irony is that intelligence researchers themselves, Flynn’s “g-men,” do not accept IQ scores at face value. Unlike Flynn, they have no interest in debating the proper verbal definition of intelligence, but rather seek to understand a major discovery of the last century: the g factor. g refers to the continuum of differences among individuals in their general capacity to learn and reason, almost regardless of task content and context. IQ tests measure g well, and all mental tests measure mostly g, whatever their content. Only at the psychological-behavioral level is g unitary, however, and various disciplines are currently probing its multiple roots in genes and environments, its physiological manifestations in the brain, and its impact on the lives of individuals and nations (Gottfredson, 1997). If g-men are “obsessed,” it is with getting to the bottom of the phenomenon that is g. IQ tests are merely one tool in that endeavor.

Flynn’s Story

Flynn’s new book, What Is Intelligence?, details more fully the tale he sketches here at Cato Unbound. He first recounts how he discovered that performance on IQ tests was rising each decade despite the high heritability of IQ and then how, according to his account, he and William Dickens have resolved this most baffling mystery ever to confront intelligence researchers. He reports that he succeeded only by overthrowing the “conceptual imperialism” of g, which still leads g-men to deny all facts that threaten its hegemony. Once free of their “blinding obsession,” all became clear to him.

In his explanation, the industrial and scientific revolutions set in motion self-perpetuating feedback loops by which human intelligence not only ratcheted itself up, but also enlisted the power of our genes to do so. What many of us mistake as physiology or genetics at work is actually the imprint of shifting cultural priorities. Although recent generations do little better on IQ subtests such as Vocabulary, Arithmetic Reasoning, and General Information, mankind’s donning of “scientific spectacles” has enabled it to answer many more Raven’s Matrices and Similarities items than did earlier generations.

Flynn argues that we need not leave future advancement in human intelligence to chance: “Interventions that may enhance IQ include the efforts of parents, programs that afford an enriched environment to children at risk, adoption, and university study.” Readers might be perplexed how his novel insights point us back to old interventions already known not to raise IQ. He suggests that such socio-educational enrichment might work if “imposed” on us throughout our lives, regardless of our genetic differences. More self-directed individuals can create their own luck by “falling in love with ideas,” thus providing themselves constant access to a “portable gymnasium that exercises the mind.”

Flynn’s Argument

The chief riddle posed by the Flynn Effect is this: How can something so heritable as IQ change so quickly from one generation to the next? To my mind, this paradox signals that we have yet to learn something fundamental about intelligence or current measures of it. Although Flynn does not discuss the matter, there is no evidence that g itself has increased, let alone by strictly cultural factors. He can make his case for the latter only by denying that the empirical phenomenon of g is relevant, specifically, by seeming to reduce it to a collection of independent components for which he can generate separate explanations. Only in this way can he neuter the incontrovertible evidence for g’s existence as a highly replicable empirical phenomenon, its correlations with many aspects of brain physiology, the distributed nature of g-related brain activity, and the strong genetic basis of both g and brain physiology (Gottfredson, 1997; Jung & Haier, 2007)—all of which undercut a strictly cultural explanation for rising IQ scores.

Flynn (2007) makes his case mostly by appeal to analogy (usually sports), that which is “undoubtedly” true (an historical shift from pre- to post-scientific thinking caused an advance from concrete to formal thinking; p. 32), selected bits of evidence of uncertain relevance or accuracy (the brain “autopsies” for elite London cabbies, which were actually MRIs while they were alive), and a confusion of assumptions and metaphors. The g factor is “shattered” like an “atom” to let different cognitive skills “swim free.” An “imperialistic” g (pp. 55ff ) must be restricted to its “proper kingdom” by maintaining a “separation of powers” between the physiological, individual differences, and social levels of intelligence, thereby “giving each dominant construct the potency needed to rebuff the other two”—yet allowing “cross-fertilization” among them. A personified brain is similarly said to “unravel g into its component parts” in order to “pick and choose from the bundle of cognitive abilities wrapped up together by g” (p. 66). Without any empirical referents from him, I don’t know what such claims really mean

However, with g seemingly now dispersed into “separate components” at both the psychometric and physiological levels, all components at both levels now seem separately malleable: for example, cognitive skills on different IQ subtests may be differentially affected by shifting cultural priorities, and various parts of the brain can be subjected to “cognitive exercises” of different sorts, such as driving around London. Disconnected from their common core, g, these narrower cognitive “skills” can be examined without regard to the vast interlocking network of evidence implicating a cross-domain intelligence of great practical value in the social realm.

Flynn rules out biological influences on brain physiology for explaining rising IQs by appealing to the very sorts of evidence that would seem to confirm their importance. Specifically, he eliminates nutrition as a possible cause of rising IQ test performance by noting that the trends for height do not seem consistent, in his view, with the disproportionate gains in IQ in some countries at the lower end of the IQ distribution. However, the very fact that height and other biological traits have changed in tandem with overall increases in IQ in many countries would seem to exclude the strictly cultural explanation that Flynn favors, no matter how fecund the “social multipliers” that he and William Dickens postulate (Mingroni, 2007, p. 812 ).

Flynn’s Fallacies

With characteristic understatement, Flynn says that everything became clear to him when he awoke from “the spell of g” (pp. 41-42). The reader, feeling afloat in a rolling sea of images and warm words, might ask whether he succeeds only by loosing himself from the bonds of evidence and logic. More troubling, his core argument rests on logical fallacies that profoundly misinterpret the evidence. I describe three below. To be fair, they are among the common fallacies bedeviling debates over intelligence testing, and most reflect a failure to appreciate the inherent limitations of psychological tests, including tests of intelligence (Gottfredson, in press).

Averages vs. correlations

Taller people tend to weigh more; that is, height and weight are correlated. If everyone gained 10 pounds, this average gain would have no effect whatsoever on the correlation between height and weight. Taller people would still tend to be heavier. Likewise, the fact that average scores on the Similarities subtest have risen over time but average scores on Vocabulary and Arithmetic Reasoning have not says nothing about whether the correlation between them has changed. In fact, it remains very high. The case for having “shattered” g rests precisely on this confusion, however. The g factor is derived, via factor analysis from the correlations among subtests. Averages do not affect the calculation of correlations. A subtest’s g loading is simply its correlation with the common factor, g, extracted from such correlations. It is an interesting empirical fact that demographic groups (e.g., ages, races, nationalities) yield the same g despite often very different average levels of performance (number of items answered correctly).

I agree with Flynn that it is intriguing that subtest averages have not changed in tandem with their g loadings. If g itself were rising over time, one would expect the most g loaded tests to show the largest increases in raw scores. Because g constitutes the core of all mental abilities, one could construe these contrary results as evidence that it is not g that is not increasing, but perhaps one of the subsidiary factors captured by IQ tests but independent of g (e.g., see Carroll’s, 1993, the 3-stratum hierarchical model, which illustrates how abilities differ primarily in their empirically-determined generality of application across task domains).

Relative vs. absolute levels of ability

IQ tests are excellent measures of relative differences in a general proficiency to learn and reason, or g. But it is important to understand that they do so by providing deviation scores. That is, IQ scores are calculated relative to the average number of items answered correctly by everyone in one’s age group (the scores being transformed to have a mean of 100 and standard deviation of 15 or 16 for ease of use). Untransformed raw scores (numbers of items answered correctly) have no meaning by themselves, nor does the average difference between any two sets of raw scores. The best we can do, which Flynn does admirably, is to plug cross-generation differences in raw scores into the formula for calculating deviation IQs for the current generation. As noted above, however, we do not know whether the transported points represent an increase in g rather than something else.

In his book, Flynn thinks it pointless to continue research on elementary cognitive tasks (e.g., reaction time tests, such as with Jensen’s “button box”). But such tests may provide our first opportunity to measure g in absolute terms (on a ratio scale; Jensen, 2005). Performance on reaction time tests is scored in milliseconds. Unlike IQ scores, time has a zero point and equal-size units. Ratio-level measurement would finally allow us to chart patterns and rates of cognitive growth and decline over the life course as well as over decades. The Flynn Effect might have been explained long ago had we historical data of this sort.

Measure vs. the construct being measured.

No one would mistake a thermometer for heat, nor try to glean its properties from the device’s superficial appearance. Nor should one do so with IQ tests. But people often confuse the yardstick (IQ scores) with the construct (g) actually measured. The manifest content of ability tests items provides no guide to the ability constructs they actually succeed in measuring. The active ingredient in tests of intelligence is the complexity of their items, and it is also the ingredient—“processing complexity”—in functional literacy tasks that makes some more difficult than others (more abstract, more distracting information, require inferences, etc.). To oversimplify only a bit, as long as two tests have similar g loadings, both will predict the same achievement equally well (or poorly), no matter how different their content might seem (Gottfredson, 2002).

Flynn’s peering into the tea leaves of subtest items is a species of the old specificity hypothesis in personnel selection psychology, which held that each ability test measures a different ability and that different jobs and school subjects call upon quite different abilities. For example, it was once received wisdom (but mistaken) that tests of verbal ability would predict reading but not math achievement, whereas tests of arithmetic reasoning would do the reverse. Specificity theory was falsified decades ago, as can be seen in the large literature on “validity generalization” in employment testing. Professor Flynn may believe that the Similarities subtest measures the ability “to classify” and that Vocabulary assays a different cognitive “skill,” but he needs to provide evidence and not mere belief. Belief did not smash the atom. Belief cannot explain the Flynn Effect.

References

Carroll, J. B. (1993). Human cognitive abilities: A survey of factor-analytic studies. Cambridge, UK: Cambridge University Press.

Gottfredson, L. S. (Ed.) (1997). Intelligence and social policy (special issue). Intelligence, 24(1), 1-320.

Gottfredson, L. S. (2002). g: Highly general and highly practical. Pages 331-380 in R. J. Sternberg & E. L. Grigorenko (Eds.), The general factor of intelligence: How general is it? Mahwah, NJ: Erlbaum.

Gottfredson, L. S. (in press). Logical fallacies used to impugn intelligence testing. In R. Phelps (Ed.), Anti-testing fallacies. Washington, DC: American Psychological Association.

Jung R. E., & Haier, R.J. (2007). The Parieto-Frontal Integration Theory (P-FIT) of intelligence: Converging neuroimaging evidence. Behavioral and Brain Sciences (target article), 30, 135-187.

Mingroni, M.A. (2007). Resolving the IQ paradox: Heterosis as a cause of the Flynn effect and other trends. Psychological Review, 114(3), 806-829.

—

Linda S. Gottfredson is Professor of Education at the University of Delaware and co-director of the Delaware-Johns Hopkins Project for the Study of Intelligence and Society.

Shattering Logic to Explain the Flynn Effect

Flynn’s Story

Flynn’s Argument

Flynn’s Fallacies

Also from this issue

Lead Essay

Response Essays

The Conversation