Comparative data and localization

2026-03-10

When working with actual language data, one major concern is how to represent that data. Many times as researchers the things we want to compare are not equivalent, so we must use the best approximations we have available. This introduces noise into any study, which can be mitigated to some degree by increasing the size of a dataset, but how do you ensure the data is comparable to begin with?

Comparing linguistic data

Much of the comparative data linguists traditionally work with has been oriented toward understanding language history. This includes how sound patterns change and how languages are related to each other. Comparing the sounds of cognate words to group related languages (and reconstruct their ancestral states or “proto-languages”) has been a central concern of linguists, perhaps most famously by the Brothers Grimm. It’s only relatively recently that linguists have even considered trying to apply similar comparative approaches to other units of analysis like syntax.

To compare words between languages, linguists developed the International Phonetic Alphabet (IPA), which is a systematic framework for notating the actual sounds that humans (and presumably other animals) can produce. This took time and effort to produce, with various revisions over the years (deserving another blog post), but by now it is extremely rare to encounter a speech sound that can’t be transcribed. As anyone who has taken an introductory phonetics course will tell you, once you master the transcription system it is hard to forget, but it does require consistent practice to train your ear to recognize the various sounds and their places/manners of articulation.

Dealing with different scripts

Transcription in IPA is basically the gold standard for comparative data, but even at this fine-grained level noise can get introduced. This is because different linguists will hear slightly different sounds. In general the differences tend to be minor, and cognate sets can still be readily established, but it’s something to keep in mind. It also depends on what you’re intending to compare - if your interest is mainly syntax, then the comparative data may not need to be as detailed or equivalent at the phonetic/phonemic level, but you might need more detail about morpheme breaks, for example.

In the case of the taggedPBC we have a large set of data available to us, but there are a number of different challenges in terms of making it comparable. One concern was that although the original Parallel Bible Corpus (PBC) had a lot of parallel verses, many of the texts used different scripts. The challenge here was to represent each of the texts in a similar fashion so that they could be compared and, ultimately, annotated for additional comparison.

In order to achieve this for the taggedPBC, I decided to convert all scripts to a romanized form. While roman characters are not as precise as a phonemic transcription, you are still getting closer to representing the actual sounds that people say than if you were to use a pictographic or “abjad” representation. The former might correspond to a complete syllable/word, while the latter traditionally only represents constants. In the case of Arabic and other scripts that started as abjads, the use of diacritics does allow for nearly phonemic transliteration. Additionally, along these lines there has been some work toward romanizing various scripts and there’s even a Python library - uroman - to implement computer transliteration.

The case of Chinese

One particularly tricky case, however, are Chinese varieties. There are many different “dialects” spoken in major regions in China, united by a common writing system. Although the Chinese characters can be read by most educated folks, the local pronunciations differ quite a bit, so much so that a person from one region may find it difficult to understand someone from a far away region, if they each speak in their local variety. This is similar to the case in Germany, where a local from the north might find conversing with a southerner rather difficult, unless they speak to each other in Standard (high) German, which is learned in school.

For the PBC there are two bible translations written in Chinese characters - one in Mandarin and one in Cantonese. But both sets of characters are treated by the uroman tool as “Chinese” and are transliterated accordingly. Despite the pronunciations being rather different in the original, the resulting texts end up looking much more similar than they should. To combat this I found a different tool - pycantonese - to romanize the Cantonese text, with a much better result.

Need for better NLP tools & localization

This highlights the need for better localization efforts for many languages. While some work has been done along these lines (see my paper for a few links to libraries supporting word segmentation for various scripts), there is still more to be done. Imagine collecting data in rural China, for example - would you transcribe in IPA or in Chinese characters? If the latter, how would you then be able to compare varieties? For the purpose of linguistic comparison it is important to have a common framework. Hopefully as we continue to work on the many spoken varieties of the world’s languages, we’ll be able to collect this kind of fine-grained data in order to gain further insight into how languages work.