Research Methods Implicitify Research Team

A Short History of Motive Content Coding: From Murray's TAT to LLM Scoring

Motive content coding is a narrow, technical craft inside personality psychology: the practice of inferring a person's motivational dispositions — chiefly the need for achievement, the need for affiliation, and the need for power — from samples of imaginative or spoken text. It is older than most of the constructs that currently dominate the field, and its history is, for the most part, a history of attempts to drag a fundamentally interpretive activity into something that behaves like measurement. The story is worth telling carefully, because the same problems keep recurring under new vocabulary.

Murray and the architecture of needs and presses

The lineage starts with Henry A. Murray and the staff of the Harvard Psychological Clinic. Explorations in Personality (1938) proposed a taxonomy of roughly twenty manifest needs (achievement, affiliation, dominance, nurturance, succorance, and so on) interacting with environmental presses — the situational forces that elicit, frustrate, or shape those needs. Murray's framework was deliberately catholic. It tried to do justice to the obvious fact that motivated behavior arises at the intersection of persons and situations, and it took as given that motives are not always available to introspection.

To get at those harder-to-reach motives, Murray and Christiana Morgan introduced the Thematic Apperception Test (TAT) in 1935. The TAT presents ambiguous pictures and asks the respondent to tell a story about each one. The clinical conceit is straightforward: in the absence of a determinate stimulus, the structure of the story will reflect the storyteller. The procedural conceit is that one can read those stories systematically rather than impressionistically.

Murray's own scoring was rich, idiographic, and difficult to reproduce. As a clinical instrument the TAT survives to this day; as a measurement instrument it would have died young if no one had imposed a stricter discipline on the coding side.

McClelland and the move to empirical scoring

The discipline arrived with David McClelland and his collaborators in the late 1940s and 1950s. The crucial methodological turn was experimental rather than theoretical: rather than declare a priori what a "story full of achievement motivation" looks like, McClelland's group manipulated the motivational state of subjects — for example, by inducing achievement arousal through a competitive task — and then asked which features of subsequently written TAT stories actually differed between aroused and control conditions. Categories that discriminated were retained. Categories that did not were discarded.

Out of this work came the first generation of empirically derived coding manuals for the so-called "Big Three" social motives. The need for achievement (nAch) — concern with a standard of excellence, whether as competition with that standard, unique accomplishment, or long-term involvement in attaining a goal — was operationalized in McClelland, Atkinson, Clark, and Lowell's The Achievement Motive (1953). The need for affiliation (nAff), a concern with establishing, maintaining, or restoring positive affective relationships, received its scoring system from Heyns, Veroff, and Atkinson, who worked under arousal manipulations involving sociometric and rejection conditions. The need for power (nPow), a concern with having impact, control, or influence over others, was a line Veroff initiated and Winter (1973, The Power Motive) substantially revised and extended.

Two features of this period deserve emphasis. First, the manuals were long, detailed, and operational: they specified scoring categories, illustrative passages, and decision rules. Second, training was empirical in a strict sense — coders were not certified by reading the manual but by reaching prespecified levels of agreement with expert-coded practice materials.

The proliferation problem

By the 1960s the field had a problem of its own making. Each motive had its own manual, often with multiple competing versions. Atkinson's edited Motives in Fantasy, Action, and Society (1958) collected many of them but did not unify them. Different manuals used different scoring units (the story, the sentence, the thought), different category sets, and different conventions for handling negation, subjunctive mood, or attributed (rather than enacted) motives. Scoring nAch, nAff, and nPow on the same protocol therefore meant running it through three largely independent coding regimes, each with its own training materials and its own inter-rater reliability budget.

The practical consequence was that motive scoring was expensive. A trained coder represented weeks of supervised practice; reaching the conventional categorical agreement targets — typically a category-by-category percentage agreement at or above the high 80s, or a corrected coefficient at or above .85 — required ongoing calibration. Reliability slippage between studies, and across the boundary between TAT-style imaginative material and naturalistic running text (speeches, letters, interviews), was a chronic complaint. None of this is a scandal; it is what construct validation under operational scoring looks like when the construct is defined by a manual rather than by a self-report scale. But it limited adoption.

Winter's integrated running-text system

The pivot in this history is David Winter's integrated coding system for running text, published in its mature form in the early 1990s and revised through the mid-1990s. Winter did three things at once.

First, he unified the scoring of achievement, affiliation/intimacy, and power into a single manual with a common scoring unit and a common set of conventions. A coder trained once could code all three motives on the same pass through a text.

Second, he generalized the procedure beyond TAT-style imaginative protocols. Winter showed that the same categories — adjusted for the linguistic surface of naturalistic speech — could be applied to inaugural addresses, letters, interview transcripts, and other archival text. This opened the door to the at-a-distance studies of political leaders for which Winter became best known, and to large historical comparisons that no TAT-based program could have supported.

Third, the manual was tight enough that interrater reliability was attainable without exotic effort. The conventional benchmark — category agreement in the .85+ range with expert-coded practice materials — became a routine training target rather than a research aspiration.

It is appropriate to be precise about what Winter's system did and did not accomplish. It did not resolve the long-standing question of why TAT-derived motive scores correlate so weakly with self-report measures of ostensibly the same constructs (the implicit/explicit dissociation that McClelland, Koestner, and Weinberger formalized in 1989). It did not eliminate the labor of training human coders. What it did was give the field a single, defensible, reasonably efficient instrument for content-coding motives in any reasonably substantial sample of free text — and that, for roughly two decades, was the working state of the art.

State of the art: dictionaries, classifiers, language models

Computerized text analysis enters this story late, and it enters as a partial substitute rather than a clean successor. Three families of methods are worth distinguishing.

Dictionary methods. The General Inquirer (Stone and colleagues, 1966) was the prototype: a program that counts occurrences of words assigned to theory-driven categories. Its descendants in current use are DICTION (Hart, 1984 and onward), oriented toward political and rhetorical text, and LIWC (Pennebaker and colleagues, original 2001, with major revisions through LIWC-22), the most widely used general-purpose dictionary in psychology. Dictionary methods are fast, fully reproducible, and trivially scalable to corpora that no human coder could touch. Their limitation is structural rather than incidental: they count words, not the propositional structure in which words appear. A sentence that negates a power-related verb, or attributes a power motive to a third party, contributes the same dictionary hits as one that asserts the motive of the speaker. For motive coding, where the manuals turn precisely on who wants what, with what intensity, in what direction, the gap between dictionary output and trained-coder output is real and well-documented. Dictionary scores are best understood as features correlated with the underlying construct, not as substitutes for it.

Supervised classifiers. Beginning in the 2000s, machine-learning approaches tried to close that gap by training models directly on human-coded protocols. Pang and Pennebaker's work on linguistic features, and a scattered literature on supervised motive scoring, demonstrated that classifiers trained on Winter-coded material could recover much of the human signal — provided the training corpus was large enough and the test corpus was distributionally similar. The familiar caveats apply: supervised models inherit the labels they are trained on, including the labels' errors and the labels' implicit definition of the construct, and they generalize uneasily across genre and era.

Large language model scoring. The current frontier is the use of general-purpose large language models, prompted with the relevant scoring manual or with curated exemplars, to score text directly for motive content. Several research groups have published comparisons of LLM-produced motive scores against expert-coded benchmarks; the results are encouraging in the sense that agreement statistics are competitive with, and sometimes approach, between-human-coder agreement on the same materials. This is consistent with the broader pattern in computational social science, where LLMs have proved capable of imitating trained-rater behavior on a variety of constrained annotation tasks. It is also consistent with the construct-validation tradition's longest-standing warning: agreement with a reference standard is a necessary condition for valid scoring, not a sufficient one. The reference standard is itself an operational definition. A scoring system — human, dictionary, classifier, or model — can agree closely with that operational definition while drifting from the construct the manual was originally intended to operationalize.

If you want to see this lineage applied to your own narrative protocols, the Picture Story Exercise (PSE) on this site estimates your standing on achievement, affiliation, and power motives using an automated deterministic lexical content-analysis approximation — in the tradition of Winter's running-text categories, but with no LLM and no manual coding. Readers interested in adjacent self-report instruments may also want to compare with the SADT schizoid–avoidant battery, which sits on the explicit side of the explicit/implicit divide that this article opens.

Selected references

Atkinson, J. W. (Ed.). (1958). Motives in fantasy, action, and society. Van Nostrand.
Hart, R. P. (1984). Verbal style and the presidency: A computer-based analysis. Academic Press.
McClelland, D. C., Atkinson, J. W., Clark, R. A., & Lowell, E. L. (1953). The achievement motive. Appleton-Century-Crofts.
McClelland, D. C., Koestner, R., & Weinberger, J. (1989). How do self-attributed and implicit motives differ? Psychological Review, 96(4), 690–702.
Morgan, C. D., & Murray, H. A. (1935). A method for investigating fantasies: The Thematic Apperception Test. Archives of Neurology and Psychiatry, 34(2), 289–306.
Murray, H. A. (1938). Explorations in personality. Oxford University Press.
Pennebaker, J. W., Boyd, R. L., Jordan, K., & Blackburn, K. (2015). The development and psychometric properties of LIWC2015. University of Texas at Austin.
Stone, P. J., Dunphy, D. C., Smith, M. S., & Ogilvie, D. M. (1966). The General Inquirer: A computer approach to content analysis. MIT Press.
Winter, D. G. (1973). The power motive. Free Press.
Winter, D. G. (1994). Manual for scoring motive imagery in running text (4th ed.). Department of Psychology, University of Michigan.

Related assessment

ImplicitifyAI offers validated instruments covering the constructs in this article.

Take the Picture Story Exercise (PSE) More articles