Things have been quite busy over the last year or so, when I last posted. My wife and I moved to Zurich with our 4-month old, I helped organize a workshop on word order here at UZH, a proceedings volume from the last ICAAL that I co-edited got published in December, we attended a bunch of conferences (AAS in Denver, SEALS in Tokyo, ICHL in Canberra), I lost my passport on the way to ICAAL in Chiang Mai (I've since applied for another), and various other things happened. In the meantime we have also been hard at work digitizing, transcribing, and annotating data from multiple Austroasiatic languages. Alongside this effort we have been developing semi-automated ways of comparing clauses and identifying possible correspondences for syntactic reconstruction. The field of syntactic reconstruction has been gaining traction over the past decade as a viable area for study in historical linguistics (see here, here, and here for some work), and it's exciting to be working on ways that computers can help us in this task. One interesting observation we can make is that our methodology does actually identify crosslinguistic structural similarities. We can see this in the following plot, which compares the number of clauses deemed 'similar' by our method in two datasets (thanks to Damian Blasi for suggesting this means of assessing our method). The first dataset is our current dataset with over 9,000 clauses annotated. Across 10 languages in 5 subgroups, this results in over 23 million pairwise comparisons. The second dataset is composed of the same clauses, but with the elements in each clause randomized by language. The plotted lines are the distribution of similarity judgments across each dataset. We can see that using our method for clause comparison the randomized dataset shows a normal distribution - which is what we expect from unstructured data. With the same method, however, the dataset of annotated clauses in Austroasiatic languages shows a non-normal distribution. This tells us that the real language data is structured AND that our method for measuring similarity picks up on this structure, identifying a higher degree of similarity between clauses in languages that we know are related.
This raises a lot of new questions and highlights the need for more testing to identify the best way of assessing similarity between clauses in a systematic and linguistically appropriate manner. Fortunately our project is not yet over!
2 Comments
At the end of last year the ICAAL 7 proceedings volume was published by University of Hawai'i Press as a special issue of the Journal of the South-East Asian Linguistics Society (JSEALS). The 8th ICAAL was just held in Chiang Mai, and so I think it's worth writing a bit about the 7th ICAAL proceedings, even at such a late date. The issue was edited by myself and Felix Rau (University of Cologne), and I wrote more details about it in a Twitter thread when it was first released. It was my first attempt at (co-)editing an issue/volume, and it was a good experience, made more so by an excellent co-editor, timely responses generally from authors and reviewers alike, and the support/advice of Mark Alves, Paul Sidwell, and Mathias Jenny. It was such a good experience, in fact, that Felix and I have agreed to edit the proceedings from ICAAL 8. I won't go into great detail about the papers, since the issue is open-access, and so anyone interested can follow the link above and download/read the abstracts/papers at their leisure. But I do want to highlight a few general points about the encouraging direction it shows for Austroasiatic studies. There is also an extensive backstory to the history of the International Conference on Austro-Asiatic Linguistics (ICAAL) that provides a bit more context. One take on at least part of that backstory can be found here, and more can be found here. The seventh International Conference on Austro-Asiatic Linguistics (attendees pictured above) was held in Kiel, Germany in 2017. One point that we note in the introduction to these papers is that this is only the fourth published proceedings volume since the conference's inception in 1973. Over a span of 40+ years, 7 ICAAL meetings have been held, and proceedings have been published for just over half. There are various reasons for this, but we hope that this special issue is part of a trend.
Bolstering this trend is the fact that the majority of the papers in this special issue are by relatively young linguists. In the field of Austroasiatic linguistics there are well-known and well-cited scholars such as Harry Shorto, Gerard Diffloth, Philip Jenner, Eugenie Henderson, Norman Zide, Geoffrey Benjamin, Ilia Peiros, Patricia Donegan and Michel Ferlus, but the majority of their work was done in the 60s-90s. Some of these scholars have passed on, and only a few scholars such as Niclas Burenhult, Nicole Kruspe, Paul Sidwell, Greg Anderson, Mark Alves, Nathan Badenoch and Mathias Jenny have 'carried the torch', as it were, and worked to extend and expand our knowledge of the Austroasiatic languages, especially in the last 10 years. Thanks to their efforts, however, and especially to their mentorship, there is a growing number of young scholars who are working on these languages, providing important insights and datasets for other scholars. The focus on data and the attempt to make primary data accessible is a particularly heartening feature of this issue. The data is accessible either through online, open-access repositories or through included examples, tables, or appendices. While previous work on Austroasiatic languages included such data, the inclusion of online repositories follows a growing trend in the social sciences where underlying data can be assessed and results can be replicated by other scientists, or an analysis can be contradicted or refined. The benefits of this 'open science' approach are mainly that the focus is taken off of the individual and whether they argue well for a position, and instead the focus is placed on what the best interpretation of the data is, and whether the data supports the individual's argument. This is only possible when the data is accessible. The last post was a bit of a brain dump to make sure I didn't forget a few lessons I learned, in part because I knew I was quitting the job that involved doing ML type things. While I was working there of course I learned a lot and (I think) acquitted myself pretty well, but language processing and machine learning are not really what I spent 4 years doing for my PhD. Python is a programming language that I picked up to make my work in grammatical description and syntax easier, and while I find ML (and programming) pretty interesting, my main interest lies in understanding how languages work through comparison, with the ultimate goal of reconstructing linguistic structures and (hopefully) prehistory. A year and a half ago or so I started working on a grant proposal for that exact thing with some researchers at the University of Zurich. This is a relatively young department that is doing some really cool research in typology, processing, and language acquisition from a corpus-based perspective on multiple languages (both Indo-European and non-IE families/phyla). At the same time historical linguistics is a huge focus in the department, as is modeling language change. This is super exciting because I take the perspective that language is spoken by individuals in communities who acquire language from their forbears (history), use it as a tool for communication (processing), which gives rise to statistical tendencies that all languages share (typology). Since it is individuals using language, this is done in an idiosyncratic way, but since language is learned and guided by principles of processing, the only way to get at both the commonality and the uniqueness of language is by investigating actual language corpora (recordings, transcriptions, etc). Of course the story of how languages change is much more complex and involves many more factors, but be that as it may, this is a great place to be. Picture: Lake Zurich from the hill above the university So, long story short, we found out last October that the grant had been funded, and the family and I started making plans to move to Zurich. More on that later, perhaps. With this project, our goal at the moment is to build a database of Austroasiatic language corpora that we can then investigate for all sorts of interesting phenomena, but focusing (initially at least) on word order. By comparing word order in multiple languages of the same family we intend to make an effort toward reconstructing the form of the parent languages from which the present-day spoken languages diverged, and also to identify language contact and interaction effects to contribute to discussions about the development of word order patterns cross-linguistically.
I've been here only a few weeks, and our first year of the project involves a lot of data collection, so I'll be traveling quite a bit and having to learn some more languages (working on Swiss-German and Burmese right now). But even in the first few weeks we've made some progress and I'm excited to share more as the research continues. I'm definitely doing Python programming, and it looks like I'll learn some Javascript for various tools we intend to build. Maybe I'll even get to use machine learning at some point. |
About meI'm a linguist and singer-songwriter. I write about life, travel, language and technology. Archives
January 2022
Categories
All
prev. blog
|