The taggedPBC - a massive dataset for crosslinguistic investigations

2025-05-22

About a year ago I was looking for annotated texts to investigate some questions I had about word order. My main question was whether it was possible to extract word order from corpora, but I realized that none of the datasets I was looking at had the kind of information that would allow for this kind of query. In order to identify word order in a language, you have a couple options. You can depend on expert opinions regarding the basic word order of a language, or you can look at corpora yourself. The latter option means you have to learn the language to some degree. With 7,000+ languages in the world this is a tall order.

Finding data

However, what if there was a way to extract this information from a corpus? In order to get close to being able to extract word order, you’d need at least to be able to identify nouns and verbs in a language. But most of the language data available for languages of the world are essentially monolingual, with little to no annotation. Exceptions are datasets produced by linguists, interlinearized texts or linguistic examples, that contain some additional information about grammar or word classes. Unfortunately, individual datasets (like the Pnar corpus) are difficult to find, and aggregated data of this type largely consists of individual words or short examples. One exception to this general rule is the Universal Dependencies Treebanks project, but even this dataset is pretty small (only 160 or so languages with more than 200 annotated sentences).

It just so happens that there is a large parallel dataset of Bible texts (the Parallel Bible corpus) with data for over 1,500 languages. However, this is not annotated, which means we still need a way of annotating the data. I wondered if it might be possible to automate this process using computational tools, and my attempts to do so have led to the development of the taggedPBC.

The taggedPBC

The taggedPBC is based on the Parallel Bible Corpus, expanded with additional language data, and with a portion of verses automatically annotated for parts of speech. The baseline dataset is sufficiently annotated for broad comparison, but a majority of the languages could use much more detailed annotation. More details are at the linked Github repo, and I’ve written it up as a paper (The taggedPBC: Annotating a massive parallel corpus for crosslinguistic investigations).

This dataset represents about 25% of the world’s languages, still just a drop in the bucket in terms of language data. And there is still a lot of work to be done in terms of annotation for specific languages. But I’m excited to see what kinds of research can be done with this dataset.