Comparing Implicit Motive Models
Part of my research involves using machine learning and natural language processing to automate content coding. In social science there has long been recognition that humans make inferences about the world based on information from a variety of sources. Much of what we infer about other people is based on what they say or how they say it. Whole subfields (i.e. sociolinguistics) are based on the study of how what we say reflects larger constructs (such as social categories like class).
Content coding and scale development
Content coding is one way of measuring variables in text that can be related to theoretical constructs, with many such scales being developed over the years for various purposes. I would note that while it is relatively easy to develop a scale (see some recent work on tools for scale development here), it is much more difficult to develop a reliable scale that consistently measures a construct of interest and can also be applied systematically.
In psychology, there has been a good bit of work on developing scales for measuring personality constructs. Most of these scales are based on self-report questionnaires, and while they can be effective (Big Five, MBTI) and are well-used by practitioners, they are subject to the general problems of self-reported data (i.e. social desirability, recall, and reference biases). In contrast, content coding can offer a more valid measure, as the participants are unaware of how what they write/say could affect the outcome.
In light of this, some early work on personality used content coding to derive personality profiles for individuals. This started primarily with the Thematic Apperception Test (TAT), which gets participants to generate stories based on prompts or images. The TAT was developed more systematically for personality assessment as the Picture Story Exercise (PSE) which is a well-validated means of measuring personality.
Implicit motives and automation
The PSE is a standardized set of pictures and related prompts that request respondents to write stories about the pictures they are presented with. Each story is then coded according to the theory of Implicit Motives, following the Winter (1994) manual, to give a motive score in three categories: Achievement, Affiliation, and Power. The motive profile of an individual is considered to be the relative total amount of Achievement, Affiliation, and Power imagery in their responses (often corrected for wordcount), and correlates with a host of behavioural outcomes.
Content coding of variables is an extremely time-consuming process and requires training, which becomes more burdensome with the complexity of the coding system. In the case of TATs and implicit motive coding, automation attempts have been made since the 1960s, but most approaches tried to do this with dictionaries. Since imagery does not depend solely on the meaning of individual words, such approaches largely failed to achieve the kind of reliable correlations that would allow them to replace the manual human-annotated method.
However, in the last few years, with the advent of more complex language models and related tools, automation of implicit motive coding seems to be within reach. As a case in point, there are now three (!!) models that can code text for implicit motives. There are some necessary caveats regarding interpretation of implicit motive codes for non-PSE texts, which I will address at the end of this post. The majority of the remainder of this post introduces the three models and walks through Python code that illustrates how they can be used to code for motives. We use the Winter (1994) benchmark dataset to compare their output and assess how well they match with human-coded scores.
Implicit Motive coding models
There are three recent models that apply
machine learning and NLP techniques to classify
text for implicit motives, using more recent
transformer architectures and improving on
efforts by Pang
& Ring (2020). Brede
et al trained a model using the
setfit library, where sentence
embeddings serve as features for a model trained
on PSE data to classify text. Nilsson
et al use a similar approach, finetuning a
RoBERTa model on PSE data to classify text for
individual motives. Pang
et al use ELECTRA, trained on a combination
of PSE and other data, to classify text for all
three motives.
Each of the models are available for use by
researchers, and all can be loaded within the
transformers library, hosted on Huggingface
(a community platform for AI models). Nilsson et
al’s model is available via theharmonylab
organization and can also be found as part of
the text
library for R. Brede et al’s
model is part of the automatedMotiveCoder
organization, and Pang et al’s ELECTRA model is
available from encodingai.
The remainder of this post illustrates how the
models can be used easily via Python. While I do
program in R as well as Python,
Python is my primary programming language, and
so for ease of illustration it’s what we’ll be
using below. Additionally, we’ll be accessing
the benchmark dataset from Winter (1994) as
recommended by the Pang & Ring (2020) paper,
and which is used for comparison in the Pang et
al paper.
Loading the models in Python
The first thing we need to do is set up an environment for running the code, and to download the models themselves, which may require a (free) Huggingface account. I’ll assume that you have a virtual environment running Python 3.9 or higher (I’ve tested this code on both Python 3.9 & 3.10), and that you’re somewhat familiar with running terminal commands and Python scripts. As you follow along with this post, you should be able to copy/paste each code block into the same Python script sequentially, in order to run it in the terminal.
To make sure things are set up properly, with the virtual environment active, ensure that you have all the necessary libraries installed with the following command in the terminal.
pip install "transformers<5.0" setfit pandas openpyxlThe double quotes surround our
transformers library and a flag
that ensures we get a version of the library
that is compatible with the setfit
library (used for the Brede et al model). The
pandas library is for handling
dataframes and the openpyxl library
allows us to work with excel spreadsheets. The
use of transformers may also
require you to create a (free) account on huggingface.co
in order to download the models. Once you have
ensured that you have the necessary libraries,
we can get the actual models, which for all
three models can be done with the following
code.
from transformers import pipeline # load transformers pipeline
# load the electra model (~520mb on first download)
model1 = "encodingai/electra-base-discriminator-im-multilabel-V3"
# instantiate the classifier with the text-classification pipeline
classifier1 = pipeline("text-classification", model=model1)
name1 = "electra" # store the name for print statements
from setfit import SetFitModel # load the setfit pipeline
# load the setfit model (~2.25gb on first download)
model2 = "automatedMotiveCoder/setfit"
# instantiate the classifier with the setfit pipeline
classifier2 = SetFitModel.from_pretrained(model2)
name2 = "amc-setfit" # store the name for print statements
# load the RoBerta models (~1.42gb for each model, ~4.26gb total on first download)
# and instantiate each classifier with the text-classification pipeline
model3 = "theharmonylab/implicit-motives-achievement-roberta-large"
classifier3ach = pipeline("text-classification", model=model3)
model3 = "theharmonylab/implicit-motives-affiliation-roberta-large"
classifier3aff = pipeline("text-classification", model=model3)
model3 = "theharmonylab/implicit-motives-power-roberta-large"
classifier3pow = pipeline("text-classification", model=model3)
name3 = "roberta" # store the name for print statements
Note that the ELECTRA and RoBERTa models
both use the transformers pipeline
for text classification, while the
AutomatedMotiveCoder (AMC) uses the
setfit model parameters. Also,
there is a difference in size and format -
ELECTRA is a single 0.5gb model for multilabel
classification, the AMC is a single 2.25gb model
comprised of 4 “one-vs-rest” classifiers, and
the RoBERTa model is actually 3 separate models
(one for each motive) of 1.42gb each (4.26gb
total).
Running this code will allow you to download the models. Once they are downloaded they will be stored in your cache and used for future inference. Now that we have the models, let’s test them on a single sentence to make sure everything is working properly. The sentence we will use is taken from the Winter (1994) training manual, and is scored for both Affiliation and Power.
# this is a sentence from the Winter manual that is double-scored for Power and Affiliation
sentence = """The recollection of skating on the Charles, and the time she had
pushed me through the ice, brought a laugh to the conversation; but
it quickly faded in the murky waters of the river that could no
longer freeze over."""
# predict on the test sentence using the electra model and return probabilities
result = classifier1(sentence, top_k=3) # we want all three labels, not just the most likely
scores1 = {x['label']: x['score'] for x in result}
print(f'Probabilities for each label (test sentence, {name1}): {scores1}')
rounded1 = {k: int(round(v)) for k, v in scores1.items()} # round the scores
print(f'Rounded scores for each label (test sentence, {name1}): {rounded1}')
# Probabilities for each label (test sentence, electra): {'Aff': 0.9999632835388184, 'Pow': 0.9999274015426636, 'Ach': 2.489300641173031e-06}
# Rounded scores for each label (test sentence, electra): {'Aff': 1, 'Pow': 1, 'Ach': 0}
# predict on the test sentence using the setfit model and return probabilities
# the setfit model returns a score for each of the three motives and 'null'
# since we only want Ach, Aff, Pow, we only get the first three indices
result = classifier2.predict_proba(sentence).numpy() # ach, aff, pow, null
scores2 = {'Ach': float(result[0]), 'Aff': float(result[1]), 'Pow': float(result[2])}
print(f'Probabilities for each label (test sentence, {name2}): {scores2}')
rounded2 = {k: int(round(v)) for k, v in scores2.items()} # round the scores
print(f'Rounded scores for each label (test sentence, {name2}): {rounded2}')
# Probabilities for each label (test sentence, amc-setfit): {'Ach': 0.018920687518636643, 'Aff': 0.04094580107598384, 'Pow': 0.6419628438800001}
# Rounded scores for each label (test sentence, amc-setfit): {'Ach': 0, 'Aff': 0, 'Pow': 1}
# predict on the test sentence using the roberta model and return probabilities
# the roberta models return two probabilities - 'LABEL_0' is the probability of no classification
# and 'LABEL_1' is the probability of a classification
# since we only want the likelihood of classification, we get just the score for 'LABEL_1'
resultach = [{x['label']: x['score'] for x in classifier3ach(sentence, top_k=2)}['LABEL_1']] # achievement
resultaff = [{x['label']: x['score'] for x in classifier3aff(sentence, top_k=2)}['LABEL_1']] # affiliation
resultpow = [{x['label']: x['score'] for x in classifier3pow(sentence, top_k=2)}['LABEL_1']] # power
result = [dict(zip(['Ach', 'Aff', 'Pow'], item)) for item in list(zip(resultach, resultaff, resultpow))]
scores3 = result[0]
print(f'Probabilities for each label (test sentence, {name3}): {scores3}')
rounded3 = {k: int(round(v)) for k, v in scores3.items()} # round the scores
print(f'Rounded scores for each label (test sentence, {name3}): {rounded3}')
# Probabilities for each label (test sentence, roberta): {'Ach': 0.0005982170696370304, 'Aff': 0.47605016827583313, 'Pow': 0.38252514600753784}
# Rounded scores for each label (test sentence, roberta): {'Ach': 0, 'Aff': 0, 'Pow': 0}
If you have loaded everything correctly,
you should see that all three models are able to
score the sentence, with some differences in
classification. While the ELECTRA and AMC models
score the sentence for Power, only ELECTRA
double scores the sentence for Affiliation as
well. We can also observe that the RoBERTa model
differs in its probability scores from the other
models, approaching a correct
double-classification but ultimately not hitting
the 0.5 probability that would
round up to 1.
Getting the Winter 1994 benchmark dataset and testing models
This is hardly a fair comparison, however, and test datasets can certainly affect the performance of a model, particularly depending on whether or not the test data has a similar distribution to the training data. To compare the models on a larger dataset we can make use of the benchmark dataset suggested by Pang & Ring (2020), which is available via this OSF link.
The data is from Winter (1994), the manual used to train implicit motive researchers in the coding system for identifying Achievement, Affiliation, and Power imagery in text. Pang & Ring split the training sets from the manual into sentences (1,358) and checked/rescored the data, ignoring the “second sentence rule” (see the paper for more details). Human coders are expected to achieve over 0.85 agreement with this dataset in order to be considered trained in the coding system and able to score other texts. We would expect an automated model to achieve similar agreement in order to be able to use it for research.
The following code assumes that you have
downloaded the excel spreadsheet and renamed it
as Winter1994_sentences.xlsx. Here
we read the data into a dataframe, get the list
of sentences, and classify the sentences using
each model that we previously loaded. We then
store each set of predictions in its own
dataframe and check correlations with the
original dataset.
import pandas as pd # import pandas to work with excel data
# Winter benchmark dataset: https://osf.io/6fnz5
dataset = "Winter1994_sentences.xlsx" # downloaded and renamed dataset
df = pd.read_excel(dataset) # read the data
texts = df['Text'].tolist() # get the sentences as a list
# predict probabilities for each sentence using electra model
result = classifier1(texts, top_k=3)
predictions = [] # store rounded predictions in this list
for res in result:
predictions.append({x['label']: int(round(x['score'])) for x in res}) # rounded
edf = pd.DataFrame.from_records(predictions) # make a new dataframe to store the results
# print(edf.head())
# predict probabilities for each sentence using setfit model
result = classifier2.predict_proba(texts).numpy() # ach, aff, pow, null
predictions = [] # store rounded predictions in this list
for res in result:
predictions.append({'Ach': int(round(res[0])), 'Aff': int(round(res[1])), 'Pow': int(round(res[2]))}) # rounded
setdf = pd.DataFrame.from_records(predictions) # make a new dataframe to store the results
# print(setdf.head())
# predict probabilities for each sentence using RoBerta model
resultach = [{x['label']: x['score'] for x in res}['LABEL_1'] for res in classifier3ach(texts, top_k=2)] # achievement
resultaff = [{x['label']: x['score'] for x in res}['LABEL_1'] for res in classifier3aff(texts, top_k=2)] # affiliation
resultpow = [{x['label']: x['score'] for x in res}['LABEL_1'] for res in classifier3pow(texts, top_k=2)] # power
result = [dict(zip(['Ach', 'Aff', 'Pow'], item)) for item in list(zip(resultach, resultaff, resultpow))]
predictions = []
for res in result:
predictions.append({k: int(round(v)) for k, v in res.items()}) # rounded
robdf = pd.DataFrame.from_records(predictions)
# print(robdf.head())
# check correlations
print(f'Winter 1994 Pearson correlations for {name1}:\n{df.corrwith(edf, drop=True)}\n')
print(f'Winter 1994 Pearson correlations for {name2}:\n{df.corrwith(setdf, drop=True)}\n')
print(f'Winter 1994 Pearson correlations for {name3}:\n{df.corrwith(robdf, drop=True)}\n')
# Winter 1994 Pearson correlations for electra:
# Pow 0.784608
# Ach 0.855252
# Aff 0.770717
#
# Winter 1994 Pearson correlations for amc-setfit:
# Pow 0.606628
# Ach 0.593990
# Aff 0.639362
#
# Winter 1994 Pearson correlations for roberta:
# Pow 0.652232
# Ach 0.597013
# Aff 0.711778
Here we can see that the ELECTRA model
shows higher Pearson correlations with this
dataset. Arguably, Pearson is not the right
metric to use for agreement, however, so let’s
see if we can observe the relationship a bit
differently.
Comparing results with visualization
The Pearson correlations can be easily
returned using the built-in function from
pandas as illustrated above. The
library also has a way of visualizing actual
counts of correspondences in a table. The code
below flattens the Winter (1994) dataset and the
predictions for each model, then calculates how
many match. First we set up a Python dict to
easily iterate through the dataframes - we will
reuse this later. Then we get the raw counts of
correspondences, i.e. the matches and
discrepancies.
dfdict = {name1: edf, name2: setdf, name3: robdf} # set up a dict to easily compare dataframes
cols = ['Ach', 'Aff', 'Pow'] # the columns we're comparing
# flatten the two dataframes with 3 columns: 'Pow', 'Ach', 'Aff'
actual_flat = df[cols].melt(var_name='column', value_name='actual')['actual']
# for each dataframe of results, get a confusion matrix
for pred in dfdict.keys():
pred_flat = dfdict[pred][cols].melt(var_name='column', value_name='predicted')['predicted']
# get a confusion matrix of the correspondences using `crosstab`
confusion_matrix = pd.crosstab(actual_flat, pred_flat,
rownames=['Actual'],
colnames=['Predicted'])
print(f'Confusion matrix for {pred} (totals):\n', confusion_matrix) # visualize
# Confusion matrix for electra (totals):
# Predicted 0 1
# Actual
# 0 3537 64
# 1 100 373
#
# Confusion matrix for amc-setfit (totals):
# Predicted 0 1
# Actual
# 0 3456 145
# 1 168 305
#
# Confusion matrix for roberta (totals):
# Predicted 0 1
# Actual
# 0 3414 187
# 1 115 358
You might notice that this doesn’t
differentiate between the different labels
(‘Ach’, ‘Aff’, ‘Pow’), and it might be helpful
to visualize in a different way. To get the
correspondences by motive code, we need a
slightly more complex approach. For this we can
access the scikit-learn library and
for visualization we can use
matplotlib and
seaborn. Ensure these libraries are
installed in your virtual environment with the
following terminal command:
pip install scikit-learn seaborn matplotlibThe following code imports the necessary libraries and functions. It then defines a function to create a figure with subplots for each motive.
# import libraries and functions
from sklearn.metrics import multilabel_confusion_matrix, classification_report
import seaborn as sns
import matplotlib.pyplot as plt
def plot_confusion_matrix(confusion_matrix, axes, class_label, class_names, fontsize=12):
"""
Plot a confusion matrix for each class label, taking the computed matrix, axes, names as arguments.
"""
# make a dataframe for the confusion matrix results
df_cm = pd.DataFrame(
confusion_matrix, index=class_names, columns=class_names,
)
try:
# use the heatmap plotting function to visualize
heatmap = sns.heatmap(df_cm, annot=True, fmt="d", cbar=False, ax=axes)
except ValueError:
raise ValueError("Confusion matrix values must be integers.")
# populate the labels
heatmap.yaxis.set_ticklabels(heatmap.yaxis.get_ticklabels(), rotation=0, ha='right', fontsize=fontsize)
heatmap.xaxis.set_ticklabels(heatmap.xaxis.get_ticklabels(), rotation=0, ha='right', fontsize=fontsize)
axes.set_ylabel('True label')
axes.set_xlabel('Predicted label')
axes.set_title("Confusion Matrix for " + class_label)
The code below puts this all together.
We first create a list of the actual codes (the
Winter 1994 benchmark data), and then for each
model’s results we construct a multilabel matrix
with the actual codes and the predicted codes
(using the dict that we instantiated earlier).
Finally, we plot each result as a heatmap using
the plot_confusion_matrix function
defined above. Here we are simply showing the
plot as a pop-up window, but you can also save
the figure to a file by uncommenting the
relevant line.
# make a list of the actual (expected) codes
y_expected = [list(map(float, z)) for z in zip(df['Pow'], df['Ach'], df['Aff'])]
# go through each model
for pred in dfdict.keys():
# make a list of the predicted codes
y_pred = [list(map(float, z)) for z in zip(dfdict[pred]['Pow'], dfdict[pred]['Ach'], dfdict[pred]['Aff'])]
# make a multi label confusion matrix
matrix = multilabel_confusion_matrix(y_expected, y_pred)
# visualize in the terminal
print(f'Confusion matrix {pred} model')
confusion_matrix_A = matrix[0]
print(confusion_matrix_A)
confusion_matrix_B = matrix[1]
print(confusion_matrix_B)
confusion_matrix_C = matrix[2]
print(confusion_matrix_C)
# make a subplot for each motive code
fig, ax = plt.subplots(1, 3, figsize=(9, 4))
# plot the results for each motive
for axes, cfs_matrix, label in zip(ax.flatten(), matrix, cols):
plot_confusion_matrix(cfs_matrix, axes, label, ["0", "1"])
# make a title
fig.suptitle(f'Predictions on Winter dataset for {pred} model')
fig.tight_layout() # tighten the layout
# plt.savefig(f'Winter_confusion_matrix_{pred}.png') # save the plot
plt.show() # show the plot
## uncomment the code below to print a classification report for each model
# print(classification_report(y_expected, y_pred, output_dict=False, target_names=['Pow', 'Ach', 'Aff']))
# Confusion matrix electra model
# [[1148 21]
# [ 46 143]]
# [[1226 20]
# [ 11 101]]
# [[1163 23]
# [ 43 129]]
#
# Confusion matrix amc-setfit model
# [[1105 64]
# [ 64 125]]
# [[1223 23]
# [ 52 60]]
# [[1128 58]
# [ 52 120]]
#
# Confusion matrix roberta model
# [[1086 83]
# [ 42 147]]
# [[1217 29]
# [ 48 64]]
# [[1111 75]
# [ 25 147]]
A common set of metrics in machine
learning are ‘Precision’ [true pos / (true pos +
false pos)], ‘Recall’ [true pos / (true pos +
false neg)], and ‘F1 score’ [(2 * Precision *
Recall) / (Precision + Recall)] - scores closer
to 1.0 are better. These metrics
can be computed using sklearn’s
built-in classification_report.
Uncommenting the code in the last line of the
loop above produces the following reports for
our datasets.
# Classification report electra model
# precision recall f1-score support
#
# Pow 0.87 0.76 0.81 189
# Ach 0.83 0.90 0.87 112
# Aff 0.85 0.75 0.80 172
# Classification report amc-setfit model
# precision recall f1-score support
#
# Pow 0.66 0.66 0.66 189
# Ach 0.72 0.54 0.62 112
# Aff 0.67 0.70 0.69 172
# Classification report roberta model
# precision recall f1-score support
#
# Pow 0.64 0.78 0.70 189
# Ach 0.69 0.57 0.62 112
# Aff 0.66 0.85 0.75 172
Another common procedure for identying
correspondence between coders/datasets is to use
the Intra-class Correlation Coefficient (ICC).
The pandas library does not have a
built-in function for this, but the metric can
be computed using the pingouin
library. Since this post is already quite long,
I won’t be illustrating the process here, but
may write another post in the future to show how
this can be computed.
Conclusion
Here I have illustrated how you can automatically code text for implicit motives using three existing models. On my Macbook pro, running the complete code (loading all three models and running inference on the benchmark dataset) takes about 2 minutes, once the models have been downloaded. There are a few caveats regarding interpretation that we should be aware of, however.
The main issue is that the text we’re using for our benchmark was generated using the PSE task. This also reflects the kind of data that the models were trained on, and means that the test data is likely to reflect the distribution (more or less) of the training data. Pang et al (2026) also highlight this, showing that while these models perform decently on text generated via the PSE, text generated in a different context (i.e. political speeches) results in lower correlations for the various models.
Since most of the research on implicit motives uses the PSE task, they should be quite useful in allowing for the automation of PSE coding. Still, the fact that not all models perform to the same degree on a particular PSE dataset highlights the need for independent evaluation of a given model on such datasets. Pang et al (2026) recommend human-coding a portion of any new dataset and assessing the model’s performance against a human coder before continuing to (auto-) code the remainder. Beyond the context of the PSE, more research is needed. Human coders are pretty good at adjusting to implicit motive imagery in different kinds of texts, but existing automated implicit motive coding models still need a bit of work to be able to do so consistently.