When conducting comparative studies on linguistic features, we are essentially observing patterns. In a previous post I mentioned one problem we need to deal with, related to how we represent the features we want to compare. For comparison based on lexical items, our gold standard is transcription in the IPA. Related to this, we can make the observation that patterns in a given language have developed over time, whether patterns in sound structures and related word forms, or patterns in word classes and grammatical structures. While transcriptions and grammatical descriptions (and corpora more generally) reflect a snapshot of language at a given timepoint, the development of these forms may follow relatively idiosyncratic paths.
Since languages (and features in languages) develop at different rates under different conditions, one of the challenges becomes how to identify which changes are consistent with universal properties and which are specific to a particular context. We can also observe that changes in a given language may be due to language-internal or family-internal structural properties, or that the changes may be due to influence from other languages spoken nearby. The potential effect of language family and/or geographical area on language (and cultural) change is known as Galton’s problem, and has long been a concern for linguists and others who research aspects of human culture and behavior. How do we disentangle apparent developments from influences brought about largely by the context in which they occur?
Controlling for descent and area
Because of this concern regarding possible autocorrelation, it is now relatively common to attempt to control for effects of descent (language family) and for geography (language area) in typological studies. Early typological studies did this mainly by sampling (see Greenberg 1963/1990, Dryer 1989, De Garbo et al 2023) in part due to lack of data. The potential for statistical autocorrelation is less of an issue with large datasets, but given that most cultural/linguistic comparison has been (and in many cases continues to be) conducted with small sample sizes, there are various approaches that have been suggested to deal with this. The most comprehensive discussion I have seen regarding different approaches is in Guzman Naranjo and Becker (2022). The general observation is that any linear relationship or correlation between linguistic variables may be confounded by the simple fact of descent from a common ancestor or the languages being spoken in the same geographical area. One relatively simple way to control for this is by including language family and geographical area as variables in a linear regression via a mixed model.
The rest of this post illustrates one such
approach in Python. Here I walk through how to
choose variables, how to read and format the
data using pandas, and how to
conduct both standard linear regressions and
linear mixed-effects regressions using the
statsmodels library. I show that
this approach can duplicate findings from
current typological research.
Choosing variables and getting data
In order to observe the potential effect of language family and area on particular typological features, we need some features to test. Fortunately for us, a recent paper (Verkerk et al 2025) has investigated a large number of purported universals using data from Grambank, which contains information on grammatical structures in a large number of languages. The dataset they used for testing each universal is available on Github, and contains numbering for each universal as well as datasheets containing observed languages and codes that correspond to presence/absence of the relevant variables.
For the purpose of illustration, we will select the purported universal #005KA (“If a language has dominant SOV order and the genitive follows the governing noun, then the adjective likewise follows the noun.”) - which is an “implicational universal”, a subset of statistical universals in linguistic typology. In the Verkerk et al study, this was found to be significant in a Bayesian analysis without family/area controls, but when language family and area were controlled for, the effect was non-significant. The datasheet for this feature contains a list of Glottocodes for languages, followed by two columns coded for presence/absence based on the Grambank data: “GB133_GB065” (whether a language is SOV and the genitive follows the noun), and “GB193” (whether the adjective follows the noun).
The following python code assumes you have
downloaded the data from the linked Github
repository and renamed it as
0005KA-BT_data.txt, and uses
pandas to read it into a
dataframe.
import pandas as pd # import pandas for tabular data
featfile = "0005KA-BT_data.txt" # tab delimited file with relevant features
# read the csv and name the columns
fdf = pd.read_csv(featfile, delimiter="\t", header=None, names=["glottocode", "GB133_GB065", "GB193"])
print(fdf.head()) # visualize the data in the terminal
# glottocode GB133_GB065 GB193
# 0 abun1252 0 1
# 1 area1240 0 1
# 2 axam1237 0 1
# 3 cash1251 1 0
# 4 chuk1273 0 0
# ...We then want to match up our dependent variable (SOV order + [noun > genitive]) with the independent variable ([noun > adj]) for each language in the dataset. In the case of the Verkerk et al data, the relevant tags have already been coded, which makes it easy to extract. If you are working with other variables, this might take a bit more work.
The next step is to get family and area
information for each language in the dataset.
For this we can use the language classifications
and macroarea categories from Glottolog,
which are updated regularly and can be accessed
programmatically using the pyglottolog
library. This data is also available
on Github for the dataset that we are working
with, which makes it easier for us. The code
below assumes you have downloaded this file into
the same directory.
glottofile = "Glottolog_Languages.csv" # comma-separated file with glottocodes and other info
gdf = pd.read_csv(glottofile) # read in the data
print(gdf.head()) # visualize
# glottocode name isocodes ... longitude Family_ID Family_name
# 0 3adt1234 3Ad-Tekles NaN ... NaN afro1255 Afro-Asiatic
# 1 aala1237 Aalawa NaN ... NaN aust1307 Austronesian
# 2 aant1238 Aantantara NaN ... NaN nucl1709 Nuclear Trans New Guinea
# 3 aari1239 Aari aiw ... 36.5721 sout2845 South Omotic
# 4 aari1240 Aariya aay ... NaN book1242 Bookkeeping
# ...This is more data than we need, so the
following code reduces our dataset to only what
we are interested in: the relevant languages
with the variables of interest and the columns
with language family and macroarea information.
Note that the data contains roughly 50 isolates
(languages with no known relatives) which share
NaN as their “Family_name”
identifier. In the code below we give each
isolate a unique identifier to avoid creating a
spurious grouping of unrelated languages.
# reduce the Glottolog data to only grouping factors
gdf = gdf[['glottocode', 'macroarea', 'Family_name']]
# merge this with the feature data in a new dataframe
df = pd.merge(fdf, gdf, on='glottocode', how='inner')
# replace NaN values under 'Family_name' (isolates) with unique identifiers
df['Family_name'] = df['Family_name'].fillna(df['glottocode'])
print(df.head()) # visualize
# glottocode GB133_GB065 GB193 macroarea Family_name
# 0 abun1252 0 1 Papunesia abun1252
# 1 area1240 0 1 Papunesia Austronesian
# 2 axam1237 0 1 Papunesia Austronesian
# 3 cash1251 1 0 South America Pano-Tacanan
# 4 chuk1273 0 0 Eurasia Chukotko-Kamchatkan
# ...Linear regression (OLS)
We can then set up a standard linear
regression with our dependent and independent
variables using the statsmodels
Python library. The standard linear regression
(ordinary least squares) is our base case,
without controls. Here we see that the
correlation between the dependent and
independent variables is significant (p <
0.001).
import statsmodels.formula.api as smf # import statsmodels with the formula api
dv = 'GB133_GB065' # dependent variable (SOV order + [noun > genitive])
iv = 'GB193' # independent variable (noun > adj)
formula = f'{dv} ~ {iv}' # final formula
print("standard linear regression (OLS)")
## model without grouping
model = smf.ols(formula, df) # OLS model
result = model.fit().summary() # run the linear model and get the summary
print(result) # visualize
# standard linear regression (OLS)
# OLS Regression Results
# ==============================================================================
# Dep. Variable: GB133_GB065 R-squared: 0.028
# Model: OLS Adj. R-squared: 0.028
# Method: Least Squares F-statistic: 52.19
# Date: Mon, 16 Mar 2026 Prob (F-statistic): 7.43e-13
# Time: 08:26:43 Log-Likelihood: -1126.2
# No. Observations: 1786 AIC: 2256.
# Df Residuals: 1784 BIC: 2267.
# Df Model: 1
# Covariance Type: nonrobust
# ==============================================================================
# coef std err t P>|t| [0.025 0.975]
# ------------------------------------------------------------------------------
# Intercept 0.4011 0.017 23.713 0.000 0.368 0.434
# GB193 -0.1584 0.022 -7.224 0.000 -0.201 -0.115
# ==============================================================================
# Omnibus: 1764.316 Durbin-Watson: 2.034
# Prob(Omnibus): 0.000 Jarque-Bera (JB): 302.212
# Skew: 0.804 Prob(JB): 2.37e-66
# Kurtosis: 1.785 Cond. No. 2.91
# ==============================================================================Linear mixed model with grouping
Because we want to observe multiple possible influences on the data, we then use the “linear mixed model” implementation. This requires a “groups” argument that considers the effect of grouped observations on possible correlations. To control for family and area, we can take two approaches. The first is to include either family or area as a grouping factor, individually, is done with the following code.
## models using each group separately as controls
groups = ['Family_name', 'macroarea'] # a list of grouping factors to use as random effects
for g in groups:
print("grouping variable:", g)
## model a single group with a mixed linear model
model = smf.mixedlm(formula, df,
groups=df[g],
)
result = model.fit().summary() # run the model and get the summary
print(result) # visualize
# grouping variable: Family_name
# Mixed Linear Model Regression Results
# =========================================================
# Model: MixedLM Dependent Variable: GB133_GB065
# No. Observations: 1786 Method: REML
# No. Groups: 232 Scale: 0.0917
# Min. group size: 1 Log-Likelihood: -576.9150
# Max. group size: 391 Converged: Yes
# Mean group size: 7.7
# ----------------------------------------------------------
# Coef. Std.Err. z P>|z| [0.025 0.975]
# ----------------------------------------------------------
# Intercept 0.442 0.031 14.437 0.000 0.382 0.502
# GB193 -0.026 0.018 -1.454 0.146 -0.062 0.009
# Group Var 0.139 0.065
# =========================================================
#
# grouping variable: macroarea
# Mixed Linear Model Regression Results
# =========================================================
# Model: MixedLM Dependent Variable: GB133_GB065
# No. Observations: 1786 Method: REML
# No. Groups: 6 Scale: 0.1813
# Min. group size: 55 Log-Likelihood: -1023.9012
# Max. group size: 511 Converged: Yes
# Mean group size: 297.7
# ----------------------------------------------------------
# Coef. Std.Err. z P>|z| [0.025 0.975]
# ----------------------------------------------------------
# Intercept 0.344 0.074 4.625 0.000 0.198 0.490
# GB193 -0.063 0.022 -2.890 0.004 -0.105 -0.020
# Group Var 0.031 0.048
# =========================================================Here we see that when we control for language family, the correlation is not significant (p = 0.146). But when controlling for only macroarea the correlation is still significant (p = 0.004).
The second approach is to treat all languages/observations as a single group, and include language family and macroarea as random effects in the linear regression.
print("random effects: Language Family + Area")
## mixed linear model with multiple random effects
vc_form = {g: f"0 + C({g})" for g in groups} # dict to store values
df['group'] = 1 # set group to `1` in order to model multiple random effects
model = smf.mixedlm(formula, df,
groups=df['group'],
vc_formula=vc_form,
)
result = model.fit().summary() # run the linear model and get the summary
print(result) # visualize
# random effects: Language Family + Area
# Mixed Linear Model Regression Results
# ==========================================================
# Model: MixedLM Dependent Variable: GB133_GB065
# No. Observations: 1786 Method: REML
# No. Groups: 1 Scale: 0.0917
# Min. group size: 1786 Log-Likelihood: -568.6317
# Max. group size: 1786 Converged: Yes
# Mean group size: 1786.0
# ----------------------------------------------------------
# Coef. Std.Err. z P>|z| [0.025 0.975]
# ----------------------------------------------------------
# Intercept 0.402 0.076 5.307 0.000 0.254 0.551
# GB193 -0.029 0.018 -1.610 0.107 -0.064 0.006
# Family_name Var 0.117 0.056
# macroarea Var 0.029 0.075
# ==========================================================Once family and area are included together as controls, the independent variable (noun > adj) is not significantly correlated (p = 0.107) with the dependent variable (SOV + [noun > gen]). This suggests that such developments are not independent of descent or language contact, and thus do not reflect a universal property of language.
Some thoughts
The goal of this post has been to highlight some of the concerns relevant to linguistic typology when it comes to understanding how languages change and evolve over time, and to provide some Python code to implement controls for language family and area with a simple example. Galton’s problem illustrates a more general property of linguistic data related to collinearity/multicollinearity, where multiple features may all be correlated with the same variables, making it difficult to disentangle the primary cause or driver of change.
I should note that there are various assumptions we are making here about the data, the primary one being that it is normally distributed. This kind of analysis is supported in part because we have a large number of observations (1,786), and if we had fewer, given the large number of covariates, it is possible that our models wouldn’t converge. We can also observe that this is essentially categorical data, and if we were working with continuous variables (gradients or proportions of some kind) we would likely want to scale the values to avoid certain statistical pitfalls. For individual problems, a slightly different approach might be necessary, and there is some debate about what measure or grouping criteria should be used in linguistic typology.
But hopefully this exercise has at least illustrated that implementing controls for possible effects from language family or area is not a difficult task. And by using such techniques, particularly with freely available data, we can verify findings and get closer to understanding the mechanisms of language change that drive the typological similarities and differences we find in the world’s languages.