Measuring Two Different Things: People and Trends

It is important to recognize that an instrument measures something aside from its own measurements and to be clear about what it is. Take the early thermometers. To say that heat was the readings they provided would be absurd. Heat was quite a separate thing, namely, what you felt on your skin on a warm day. As for what the early thermometers measured, before they were perfected, they confounded measuring heat with registering atmospheric pressure. Later two separate instruments were developed that disentangled the two: mercury thermometers for the heat alone; and barometers for atmospheric pressure alone.

My book develops a simple thesis in three parts. First, IQ tests inclusive of Raven’s and the ten WISC subtests are instruments of measurement. Second, during the 20th century in America, they have been measuring two distinct things. Comparing individuals at any given time, they have recorded a tendency for a high-IQ person to be superior to the average person on all of these tests, which has led us to say they measure a general intelligence factor called g. Comparing generations over time, they have measured something quite different, namely, various cognitive abilities either remaining stable or being enhanced. Third, the concept of g sheds no light on why these trends have occurred or their significance.

The third point is simply a matter of fact. The various cognitive abilities measured by different tests or subtests show differing magnitudes of gain that have nothing to do with their excellence as a measure of g. They reflect social priorities that have shifted over time. Although we have done a better job of teaching children the mechanics of reading, thanks to a visual culture, they have no larger vocabularies and thus the Vocabulary subtests show minimal gains over time. Thanks to the enhanced demand for people who wear scientific spectacles, people are much better today at classifying the world and using logic to analyze it, that is, using logic as a tool that can deal with abstract categories. Therefore, there have been huge gains on the Similarities subtests and Raven’s. The shifting priorities of society do not reflect g-loadings because society does not value cognitive abilities in terms of how much a gifted person beats an average person on them. Why should it? Lumber may be a humbler thing than a symphony but more necessary.

Once we stop using g to try to make sense of cognitive trends over time, each trend becomes interesting in its own right. Why we have no larger vocabularies to deal with everyday life is interesting and why we tend to classify the world rather than merely manipulate it is interesting. A fascinating history emerges. It tells us how our minds have responded to the evolving cognitive demands that evolving industrial society has made. It is not a matter of some fixed cognitive factor trying to do new things; rather cognition itself is evolving to meet new demands. Our minds are not like a baseball bat that has remained unchanged over 100 years, with only fast balls to cope with up to 1950, and the curve ball coming along on that date. Our minds are like cars. Today’s cars have evolved beyond the Model-T Ford because we now expect more than a means of transport. We expect cars to go faster, have a stereo system, a direction finder, and so forth.

There is a certain sense in which g is stable over time. At Time A, high-IQ people are superior to the average person more on one cognitive skill than another and they beat the average person on all of them. This is the kind of inter-correlation of performances on tests that engenders g. Over time, some skills are enhanced and other not enhanced quite independently of g. But at a time B, a high-IQ person may still be better than the average person on all skills and still be better on the various skills in much the same pattern. So the inter-correlation we call g emerges once again. Therefore, absolute changes in average skills over time are quite consistent with the persistence of correlations calculated at any given time. However, we want to know what happened between the two times and the correlations are not informative.

There is a difference between measuring things on an absolute or on a relative scale. Psychometricians posit latent traits that do not exist as functional traits in the real world. People cannot actually “do” g or “do” full-scale IQ, so we can only assign them numbers in a pecking order that ranks them. In real-world functional terms, you can only read, speak, calculate, classify, or do logic. Things in the real world divide into entities we can measure on an absolute scale, a quasi-absolute scale, or a relative scale. But even the last kind of scale can sometimes be translated into a scale of rough absolute judgments about real-world competencies – if their links to those competencies are strong enough.

Time, space, and counting things that are the same can all be measured absolutely. A ruler has a zero point and we can use it to measure whether something is one or two or three inches long. We can count whether we have no or one or two or three beans. Measuring climate (as distinct from the temperature of other things) with a thermometer is a quasi-absolute scale. It has a zero point (absolute zero) but the degrees do not mean exactly the same thing in terms of climate, which has to do with human comfort. The degrees as the weather gets to freezing or boiling hot are more significant than those between 15 and 20 degrees centigrade, but it is easy to make allowance for this, so no harm is done. The WISC subtests and Raven’s can of course be used simply to rank people against one another on a relative scale consisting of deviation “IQs”. But if we forget such scores for a moment, we will see that getting the items correct is close enough to prerequisites for competencies in the real world, so that they imply a scale of absolute judgments.

We could have an absolute scale for vocabulary by counting the numbers of words someone can define from zero to any number you like. But since our object is to assess how competent someone is in speaking and reading English (in non-specialized speech), it is better to include a sample of the most frequently used words up to say 5,000 with less representation of the less frequently used words. The WISC Vocabulary subtest approximates this. We can make a series of judgments. This person cannot even read the Bobsie Twins, that one could but not Hemingway, that one could but not Huxley. This gives us a scale of absolute judgments running from illiteracy to “can read War and Peace.” The connection between the command of vocabulary and these competencies is strong enough to bridge time. Someone with a 500-word vocabulary could no more have read a serious novel in 1900 than they could today.

As for Similarities and Raven’s, we hypothesize a scale of competencies that link their items to the ability to classify (using dichotomies rather than utilitarian likeness) and to use logic to deal with formal symbols. We posit an absolute scale ranging from “this person lacks even the scientific spectacles to do elementary algebra” to “this person could, but not formal logic” to “this person could, but not tertiary science.” Once again, I posit that the links are strong enough to persist over time. Whether the average person could classify the world only in terms of categories of everyday utilitarian significance, or also classify it using the categories that underpin modern science, is assumed to have persistent real-world significance.

In contrast, when full-scale IQ gives us a relative ranking of people, the link between their score and real-world competencies is not robust enough to persist over time. It simply lumps too many things together that have differing functional significances. We need to know whether the IQ is high because of an unusually large vocabulary or an unusual ability to do three-dimensional jigsaw puzzles. The latent trait called g is equally useless in allowing for absolute judgments over time. In so far as it ranks people functioning in the real world, it merely tells us that full-scale IQ is a pretty good measure of how much better you are than the average person on a whole range of tasks lumped together. So again, we get no information that allows us to establish strong links between test performance and functional competencies.

We may dramatize score gains by using the scales of the past, that is, we may say that the average person today would be at the 84th percentile of people at some past time in terms of Similarities or Raven’s. But that is unnecessary. All we need do is to say that people today are a lot better at one cognitive skill (classifying and detaching logic from the concrete) and only marginally better at another (reading serious novels). We dramatize these trends only to counter anyone who might say they are trivial. But the fact remains that they allow us to derive a rough substitute for absolute measurements. And the fact remains that g is useless in analyzing trends over time because it takes its very meaning from a pattern of relative rankings and it lacks the specificity to shatter that limitation.

Now to deal with the three commentaries in turn.

Linda Gottfredson

I will stick to the usual language of scholarly discourse because that is the best way forward. This is not difficult because I have met, like, and respect Gottfredson. It should now be clear that I commit none of the fallacies she alleges. I do not confuse averages and correlations. The fact there is an “absolute” gain in average performances over time does not affect the fact that performances at a given time inter-correlate (that some people do well on everything) and thus engender g. I do not confuse absolute and relative scales but rather, use scales with rough absolute significance. I do not confuse the instrument of measurement with what is being measured. The only one who has done that in this ongoing debate is Jensen (1972, p. 76) when he said that intelligence is what IQ tests measure. Thanks to his high level of intelligence (capacity to learn), he quickly abandoned that position (Jensen, 1979). Absolutely nothing in my case as to why g is useless in understanding trends over time depends on such confusions.

That case in no way detracts from the efforts of some psychologists to understand g, its roots in genes and environment, and hypotheses that different full scale IQs in different nations will causally interact with their economic development. My point is simply that those lines of research will not illuminate the cognitive history of the United States. Gottfredson speculates that while gains over time are not caught by g, they might constitute gains on some subsidiary factor revealed by factor analysis of IQ tests. That is a dead end. IQ trends over time are not factor invariant on ANY of the factors that performance at a give time reveals, that is, not on verbal, spatial, or memory factors (Wicherts et al, 2004). The way forward is not factor analysis of static performance but sociology, which can show us society and its demands evolving in all of their complexity.

It is good that Gottfredsom agrees that the theory of intelligence must address the problems posed by massive IQ gains over time. Those who think g theory adequate should come forward with their solutions. The best way to replace a defective theory is by way of a better one. I will bet that whatever emerges will depends on sociology rather than g and, since it addresses a historical problem, will best me by greater plausibility rather than hard evidence.

A few other points:

Gottfredson implies that I am sanguine about the success of interventions designed to raise IQ. The opposite is the case as the Dickens-Flynn model of cognitive development entails. I see nothing wrong about the speculation that those who fall in love with demanding cognitive pursuits will make the most of their potential.
She rejects my attempt to show that nutrition does not account for IQ gains over the last half century in America and that hybrid vigor (the Mingroni hypothesis) does not account for those of the last century. But since she does not rehearse my evidence, she says nothing to refute it. The reader can read Chapter 5 of my book and judge for himself (Flynn, 2007).
She is unimpressed by the Dickens-Flynn model but offers no alternative explanation of the puzzle it attempts to solve: how environment is so feeble in the twin studies but so potent over time.
To understand Blair’s results on the decentralization of brain functions, she should read Blair if my account is insufficient.
Reaction time studies may make absolute assessments of trends over time, but they are theoretically bankrupt because they do not measure brain physiology or neural speed. We could measure whether people can stuff more beans up their nose today on an absolute scale, but that would not advance our understanding of cognition.
Anyone who doubts that the Similarities subtest measures the ability to classify should simply take the test and look at the scoring directions. If anyone doubts that Raven’s measures the ability to use logic to make sense of the sequences of shapes, they should do the same.

Eric Turkheimer

He takes issue with only one assertion in my book. I believe that g may well signal something “real” when we compare the cognitive performance of one person to another at a given time. Beyond dispute is that some people do better than the average person at a whole range of cognitive tasks. That poses three distinct questions.

In the realm of individual differences, does g have predictive validity, that is, if someone gets a high full scale IQ on a test with cognitively complex tasks, can we use that score to predict their fate? Turkheimer does not dispute this. Your full scale WISC IQ gives significant predictions as to whether you will get good grades or qualify for an elite profession. As I point out, that does not mean it cannot be improved on. Sternberg has gotten better predictions by supplementing conventional IQ tests with more creative tasks such as writing an essay on the octopus’s sneakers. Jim Heckman has shown that non-cognitive measures of social skills, self-control, etc, are equally powerful predictors.

Does g show that different real-world cognitive tasks have something functional in common? I argue that it does not. The same kind of person may do well at the high jump and sprints. But increasing your sprint speed will not improve your high jump performance. Similarly, the same person may do well on Raven’s and the Arithmetic subtest of the WISC. But enhancing your ability to solve matrix problems will not improve your ability to do arithmetic. The same people are good at them but they have little functional in common.

Does the fact that some people do better than others on a variety of cognitive tasks have a cause in brain physiology? Turkheimer shows that the fact that g emerges from factor analysis, and what we know about what correlates with g, such as that it is heritable and so forth, leaves this an open question. I agree and did not explore this in my book because I felt I had quite enough new things to sell. But also my suspicion is that progress in brain physiology may show that certain individual differences underlie generally high performance. Some people best me at all sports due to a better cardiovascular system and a faster reflex arc. Some people may beat me at all cognitive skills because of a better blood supply to the brain and better dopamine sprayers (dopamine strengthens whatever neural connections are in use when we learn things).

Stephen Ceci

Steve Ceci has done me the service of making an important point that is in danger of getting lost. Unless massive IQ gains over time are dismissed as mere test sophistication, and there is conclusive evidence that they are not, their mere existence means a re-evaluation of theories of intelligence based on g. G was supposed to be so robust as to bridge even differences between the species (Jensen, 1980). Some explanation must be offered as to how the malleability of IQ and the persistence of g are compatible. Talk about one being an instrument and the other being the thing measured is just saying something has gone wrong but I know not what. My theory offers a solution but at the price of confining the relevance of g to individual differences and rendering it irrelevant to cognitive history. Once again, I will wager that any better solution will have the same effect.

References

Flynn, J.R. (2007). What is intelligence? Beyond the Flynn effect. Cambridge: Cambridge University Press.

Jensen, A.R. (1972). Genetics and education. London: Methuen.

Jensen, A.R. (1979). The nature of intelligence and its relation to learning. Journal of Research and Development in Education, 12: 79-85.

Jensen, A.R. (1980). Bias in mental testing. London: Methuen.

Wicherts, J.M., Dolan, C.V., Hessen, D.J., Oosterveld, P., van Baal, G.C.N, Boomsma, D.I., and Span, M.M. (2004). Are intelligence tests measurement invariant over time? Investigating the Flynn effect. Intelligence, 32: 509-538.

Measuring Two Different Things: People and Trends

Linda Gottfredson

Eric Turkheimer

Stephen Ceci

Also from this issue

Lead Essay

Response Essays

The Conversation