The past week has been quite exciting, as we've been in the final stages of preparing a final manuscript on machine learning. I've been rather quiet about this collaboration since the end of 2017, what with the new focus of the project in Zurich, a growing family, and the general pressures of life, not to mention the difficulty of developing a text classifier using neural networks. This particular classification problem is made harder by the fact that motive content is much more diverse in its linguistic manifestation than, for example, sentiment (positive/negative polarity).
This equal-authored paper, however, marks the first phase of what we hope will be a long-running and ultimately successful collaborative research focus on developing a neural network model that can automate the time-consuming nature of assessing implicit motive imagery (a personality measure) in written text. If you're interested in reading about it (we've tried to make the research fields of Implicit Motives, Machine Learning, and Natural Language Processing somewhat accessible), the citation for the online publication is below. And if you follow the URL link, you should be able to read the paper in its entirety (without needing a Springer access account).
If you're interested in checking out how a text classifier for implicit motives might work, you can visit the web app that we built, which uses an underlying CNN model for classification - keep in mind that not all text content has implicit motive imagery, and that this classifier does not yet perform classifications on par with trained human coders. However, it is a start. I may highlight some of the niceties that this particular neural network model can/can't handle in a future post.
In addition to the paper and links above, we provide trained models, data, and some dataset descriptives on the Open Science Framework website, which can be cited and linked to below.
I am hoping to write up an even more accessible summary of the paper in the coming weeks, but we'll see how it goes. Since the family is currently under COVID 'lockdown' in Singapore with two kids under 2, the only opportunities for doing anything not 'kid-wrangling' related come when the kids are sleeping, which is also when parents have to eat/shower/clean and do housework. This doesn't leave much time for anything else, which is another reason it was exciting to have finished the paper!
Things have been quite busy over the last year or so, when I last posted. My wife and I moved to Zurich with our 4-month old, I helped organize a workshop on word order here at UZH, a proceedings volume from the last ICAAL that I co-edited got published in December, we attended a bunch of conferences (AAS in Denver, SEALS in Tokyo, ICHL in Canberra), I lost my passport on the way to ICAAL in Chiang Mai (I've since applied for another), and various other things happened.
In the meantime we have also been hard at work digitizing, transcribing, and annotating data from multiple Austroasiatic languages. Alongside this effort we have been developing semi-automated ways of comparing clauses and identifying possible correspondences for syntactic reconstruction. The field of syntactic reconstruction has been gaining traction over the past decade as a viable area for study in historical linguistics (see here, here, and here for some work), and it's exciting to be working on ways that computers can help us in this task.
One interesting observation we can make is that our methodology does actually identify crosslinguistic structural similarities. We can see this in the following plot, which compares the number of clauses deemed 'similar' by our method in two datasets (thanks to Damian Blasi for suggesting this means of assessing our method). The first dataset is our current dataset with over 9,000 clauses annotated. Across 10 languages in 5 subgroups, this results in over 23 million pairwise comparisons. The second dataset is composed of the same clauses, but with the elements in each clause randomized by language. The plotted lines are the distribution of similarity judgments across each dataset.
We can see that using our method for clause comparison the randomized dataset shows a normal distribution - which is what we expect from unstructured data. With the same method, however, the dataset of annotated clauses in Austroasiatic languages shows a non-normal distribution. This tells us that the real language data is structured AND that our method for measuring similarity picks up on this structure, identifying a higher degree of similarity between clauses in languages that we know are related.
This raises a lot of new questions and highlights the need for more testing to identify the best way of assessing similarity between clauses in a systematic and linguistically appropriate manner. Fortunately our project is not yet over!
At the end of last year the ICAAL 7 proceedings volume was published by University of Hawai'i Press as a special issue of the Journal of the South-East Asian Linguistics Society (JSEALS). The 8th ICAAL was just held in Chiang Mai, and so I think it's worth writing a bit about the 7th ICAAL proceedings, even at such a late date. The issue was edited by myself and Felix Rau (University of Cologne), and I wrote more details about it in a Twitter thread when it was first released. It was my first attempt at (co-)editing an issue/volume, and it was a good experience, made more so by an excellent co-editor, timely responses generally from authors and reviewers alike, and the support/advice of Mark Alves, Paul Sidwell, and Mathias Jenny. It was such a good experience, in fact, that Felix and I have agreed to edit the proceedings from ICAAL 8.
I won't go into great detail about the papers, since the issue is open-access, and so anyone interested can follow the link above and download/read the abstracts/papers at their leisure. But I do want to highlight a few general points about the encouraging direction it shows for Austroasiatic studies. There is also an extensive backstory to the history of the International Conference on Austro-Asiatic Linguistics (ICAAL) that provides a bit more context. One take on at least part of that backstory can be found here, and more can be found here.
The seventh International Conference on Austro-Asiatic Linguistics (attendees pictured above) was held in Kiel, Germany in 2017. One point that we note in the introduction to these papers is that this is only the fourth published proceedings volume since the conference's inception in 1973. Over a span of 40+ years, 7 ICAAL meetings have been held, and proceedings have been published for just over half. There are various reasons for this, but we hope that this special issue is part of a trend.
Bolstering this trend is the fact that the majority of the papers in this special issue are by relatively young linguists. In the field of Austroasiatic linguistics there are well-known and well-cited scholars such as Harry Shorto, Gerard Diffloth, Philip Jenner, Eugenie Henderson, Norman Zide, Geoffrey Benjamin, Ilia Peiros, Patricia Donegan and Michel Ferlus, but the majority of their work was done in the 60s-90s. Some of these scholars have passed on, and only a few scholars such as Niclas Burenhult, Nicole Kruspe, Paul Sidwell, Greg Anderson, Mark Alves, Nathan Badenoch and Mathias Jenny have 'carried the torch', as it were, and worked to extend and expand our knowledge of the Austroasiatic languages, especially in the last 10 years. Thanks to their efforts, however, and especially to their mentorship, there is a growing number of young scholars who are working on these languages, providing important insights and datasets for other scholars.
The focus on data and the attempt to make primary data accessible is a particularly heartening feature of this issue. The data is accessible either through online, open-access repositories or through included examples, tables, or appendices. While previous work on Austroasiatic languages included such data, the inclusion of online repositories follows a growing trend in the social sciences where underlying data can be assessed and results can be replicated by other scientists, or an analysis can be contradicted or refined. The benefits of this 'open science' approach are mainly that the focus is taken off of the individual and whether they argue well for a position, and instead the focus is placed on what the best interpretation of the data is, and whether the data supports the individual's argument. This is only possible when the data is accessible.
I just returned from fieldwork in Myanmar, where we spent several days training staff and students at two different universities (Mandalay University and Yangon University of Foreign Languages) in fieldwork and data-gathering techniques. I also spent a week with the team in a village where a Palaung language is spoken (in the mountains of Shan State), conducting fieldwork with staff and students from Mandalay University's Department of Anthropology. We were able to get lots of great recordings of stories, histories, and explanations of cultural practices in the local language, with translations in Burmese.
This will be part of an ongoing collaboration with University of Zurich and these universities in Myanmar, where we are helping to build capacity and share our different kinds of expertise between these departments. There is still a lot of descriptive work that needs to be done among these language groups, and regarding their connections to larger Myanmar society. I'm still learning a lot about how Myanmar culture and language works (I'll be taking a Burmese language course at UZH in the coming semester), and I'm looking forward to working more with the universities in Myanmar in the future.
On some of our 'off' days, we went sightseeing in Yangon and in Bagan. While I'd been to the Shwedagon pagoda before, Bagan was a pretty incredible place (although possibly the hottest place I've ever been) simply because there are SO MANY temples and pagodas. Literally everywhere you look, if you have any kind of a view above the trees, you see a temple or a pagoda, or multiples of both. This time of year is not the best for visiting (so hot), and apparently you used to be able to climb more temples a few years ago, before the earthquakes made this dangerous, but it was still pretty amazing.
The last post was a bit of a brain dump to make sure I didn't forget a few lessons I learned, in part because I knew I was quitting the job that involved doing ML type things. While I was working there of course I learned a lot and (I think) acquitted myself pretty well, but language processing and machine learning are not really what I spent 4 years doing for my PhD. Python is a programming language that I picked up to make my work in grammatical description and syntax easier, and while I find ML (and programming) pretty interesting, my main interest lies in understanding how languages work through comparison, with the ultimate goal of reconstructing linguistic structures and (hopefully) prehistory.
A year and a half ago or so I started working on a grant proposal for that exact thing with some researchers at the University of Zurich. This is a relatively young department that is doing some really cool research in typology, processing, and language acquisition from a corpus-based perspective on multiple languages (both Indo-European and non-IE families/phyla). At the same time historical linguistics is a huge focus in the department, as is modeling language change. This is super exciting because I take the perspective that language is spoken by individuals in communities who acquire language from their forbears (history), use it as a tool for communication (processing), which gives rise to statistical tendencies that all languages share (typology). Since it is individuals using language, this is done in an idiosyncratic way, but since language is learned and guided by principles of processing, the only way to get at both the commonality and the uniqueness of language is by investigating actual language corpora (recordings, transcriptions, etc). Of course the story of how languages change is much more complex and involves many more factors, but be that as it may, this is a great place to be.
Picture: Lake Zurich from the hill above the university
So, long story short, we found out last October that the grant had been funded, and the family and I started making plans to move to Zurich. More on that later, perhaps. With this project, our goal at the moment is to build a database of Austroasiatic language corpora that we can then investigate for all sorts of interesting phenomena, but focusing (initially at least) on word order. By comparing word order in multiple languages of the same family we intend to make an effort toward reconstructing the form of the parent languages from which the present-day spoken languages diverged, and also to identify language contact and interaction effects to contribute to discussions about the development of word order patterns cross-linguistically.
I have been quite lax with posting here mainly because I have been working very hard on a difficult problem. My current job involves (among other things) trying to automate a psychological coding system for text. This is pretty similar to research in Sentiment Analysis (SA, see this brief introduction), a task in Natural Language Processing (NLP) which attempts to predict the 'positivity' or 'negativity' of sentences (i.e. classification), such as from Twitter. Companies find this useful for (among other things) getting a broad understanding of how consumers respond to their products and services.
The coding system I am trying to automate is a bit more complex, but I am using similar techniques as in SA. I started initially with SVM and other classifiers using the SciKit-Learn library, and I have now moved on to using Neural Networks. These are essentially Machine Learning models that allow a computer to generalize patterns in data that correspond to particular outputs. Humans do this pretty naturally - we recognize patterns in language, for example, that allow us to parse sounds into words and words into units of meaning that, when combined, help us communicate.
A data-driven approach
The idea with neural networks is that given enough labeled data, the computer's statistical model can capture the patterns that correspond to the labels and predict what the labels should be for data it has never seen before. This is essentially how character recognition (OCR) works - enough data has been fed to the computer that it has 'learned' what an "A" character looks like in a variety of images. But the model can only make good generalizations if it has a lot of data and a good diversity of data.
It is a common issue, especially in the field of Linguistics, that the amount of data available for a particular problem is limited (i.e. description of languages is often based on a 5-hour recorded corpus supplemented with elicitation, psycholinguistic experiments are usually conducted with sample sizes of 5-20 participants, though larger sample sizes are more ideal). This is partly because of how time-consuming data collection can be - combine this with the fact that you might be dealing with multiple languages and the issue is compounded. But even within a single language, particular problems only have limited datasets, and if we want to automate a system that has typically required trained humans (consider that PhD students who describe a language usually train/read for at least a year before fieldwork, after having completed an MA), there might be even less data than usual available for training, particularly data in the correct format.
Addressing problems in the neural network
This lack of data is a significant problem that we have been working to overcome by creating more coded data. In building ML models I have continued working with the existing data to achieve decent results on a small dataset, the hope being that once more coded data is available it can be incorporated and give a better result. Along the way I have come across several important learning points that I thought I would write here as a reference of sorts, to add to other great posts that have been helpful to me.
Final thoughts and further links
These are just a few of the things that I have learned in the past few months while trying to sort out my classification problems. Papers and online forums have been extremely helpful in developing my understanding of the issues involved, and I have benefitted particularly from this blog and this one (among others) and examples of models on the Keras GitHub repository. For a good (though brief) discussion of trying to implement state of the art text classification models, see this post. Ultimately, as one person noted on a forum, developing neural networks is as much an art as a science, requiring experimentation and intuition to figure out how to apply a particular model architecture to solve a particular problem.
One of the concerns that has occupied my mind for that past few years is the question of data accessibility in the field of Linguistics. I am happy to announce that the data that underpins my grammatical description of Pnar is now freely available as a downloadable archive in audio and text form (anonymized where requested by participants). You can find the link to the dataset at the bottom of this post, but in the meantime I'd like to explain my views surrounding data access and give a brief explanation of the tool I've used to make my linguistic data accessible.
Why accessible data?
Those who are familiar with linguistics understand that traditional descriptions of language are often based on recorded, transcribed and translated interviews and stories by speakers of the language. Although some theoretical work may be based on a few utterances or a single example, most linguistic work is based on many actual examples from utterances that real speakers produce.
One issue here is that there is such between- and within-speaker variation in speech that unless the data you use is actually accessible to other linguists, one can easily question the veracity of a particular analysis. In the interest of scientific enquiry, then, it is incumbent on the analyst-linguist to make their actual data accessible to other researchers in at least some form, whether in an archive or in a database. Having the data accessible to multiple researchers may lead to disagreements about analysis (there may be more than one way of analyzing a particular linguistic structure, for example), but ultimately such disagreements are healthy because they expand our knowledge.
This touches on a larger issue in the world of science, that of verifiability and reproducibility of research, which has galvanized the larger scientific community towards Open Science (see this blog post for an explanation, and check out this OSF paper), and in some fields such as Psychology, has actually resulted in a whole journal devoted to "replication studies". These kind of studies are aimed at trying to replicate results and findings of a particular study by following the same procedure as the original researchers. When replication studies uphold a particular result, it makes it more likely that the original study's findings were not the result of a statistical anomaly or falsification of data, which is a very serious problem that can lead to erroneous claims requiring retraction. For more on this visit RetractionWatch.com.
What this means in the case of linguistic data is that the recordings, transcriptions, and translations that underlie a grammatical description or other study, whenever possible, should be made accessible to other linguists. Data sharing can be a touchy issue simply because of a) the ethical concerns of the providers of the data, b) potential cultural taboos, and c) because of the interests of the linguist who initially made and processed the data.
With proper permissions sought and precautions taken, these concerns can be minimized or dealt with appropriately. A linguist needs to (minimally) communicate to participants about how the data will be used, take the time to anonymize recordings and annotations when necessary, and create a license that constrains how the data can be used in the future. Ideally, if you are doing your research correctly, your university's Institutional Review Board will have already helped you to think through these things. There are also some excellent books, papers and chapters that deal (at least somewhat) with this subject, and there are a set of standards for social science research (with human subjects) and specifically for linguistics that researchers should be aware of.
Some reasons linguists don't share data
The final point (C, the interests of the linguist) is really the sticking point for most people. The reality is that many linguists do not want to release data for several reasons:
This brings me back to the earlier ruminations that started this post, namely that data produced by a linguist and which underpins their work ought to be accessible to other scientists and linguists. When I first submitted my PhD at NTU, I took a look at some of the options for data archiving, and I approached the university library (which keeps digital copies of all theses submitted at the university) to see if they could also store my audio and transcription data (over 1GB). About a year ago, they contacted me to let me know that they were developing such a service, something called DataVerse, and wanted to know if they could use my dataset to test it. I was happy to have them do so, and after some tweaking and time, this tool is now available for use.
DataVerse is a database/archive tool developed at Harvard University that allows researchers to store datasets that other researchers can download for use and testing. It supports the Open Science initiative by making data accessible and open. It also solves one of the problems I noted above by creating a unique url identifier and citation for the dataset. You can check out my dataset at its DOI here and download it for research and non-commercial purposes.
As I was thinking about this previously, I realized that what I wanted was not really an archive but a database that would allow me to develop and annotate my data further. Unfortunately DataVerse is not that - it is basically just a storage tool. What is nice is that it provides versioning, so the curator of the dataset can upload and publish changes. I think I may have to create my own database if I want something that will let me explore the data better. But for now, the data is freely accessible for other linguists (even though my analysis isn't perfect), which is a bit of a load off my mind.
In a previous post I discussed some of the benefits I discovered in using LaTeX with LyX as a front-end. Another extremely useful tool to learn how to use is a Bibliography manager. If you are like I was, and often develop a new bibliography for each paper you write, this is something you might want to consider. On the other hand, you might be used to Microsoft applications and already be familiar with bibliography managers (such as Zotero) that integrate pretty well with the MS Office family.
As I started my PhD and began writing more papers, I noticed that many of my citations were the same. Rather than copy-pasting from previous papers and then adjusting the formatting for each submission, I realized it would be much easier to have a centralized location for all the papers I wanted to cite, and have the computer deal with the formatting according to the style sheet I needed to use.
Fortunately, support for this is ‘baked-in’ to LaTeX/LyX. There are various good tools that integrate with BibTeX (as the bibliography manager in LaTeX is called), but the one that I found to be most useful for my purposes is BibDesk. Rather than explain how I do it, I’ll point you to this excellent tutorial for Mac, which describes how to set it up. In the rest of this post I’ll simply give my reasons for using BibDesk.
Each of these reasons alone are worth getting your act together and creating a single bibliography repository. You can likely think of other good reasons, which just means there is no excuse to not do so.
Another issue that I am thinking about is how to make my library of citations/documents available on any computer with internet access via the cloud. This would ensure at the very least that I wouldn't worry as much about losing it if my computer dies (though I'm still going to make backups regularly). A fellow academic and friend of mine has managed to integrate all of his citations and PDFs with Zotero, and make it available on his phone in an app like Google Drive, so that at conferences he can remember a publication or search for one in a conversation and pull up the reference and/or associated document to show people. This is super cool and super useful - I’ll write about it if I can figure out how to pull it off on my own, or maybe I’ll get him to write about it.
One of the first things I did after passing my PhD confirmation exercise (like a qualifying exam in the USA) was to research the best way to write my thesis. As a side note, I use the word 'thesis' to refer to any large written work, including a PhD, while other English speakers might use the word 'dissertation' to refer specifically to the work that a PhD student produces. In any case, the relevant information here is the term "large", since I knew I was going to be writing a lot. I now consider the tools I'm writing about here to be essential for a productive workflow, and so this post continues the theme of an earlier post on linguistic tools.
In researching how to write my thesis, I asked friends and fellow linguists who had written grammatical descriptions. Most of them had used MS Word, and told me horror stories of lost work, un-editable documents due to the sheer size of their files, difficulties formatting and printing the thing, etc.. So that was out of the question for me, at least at the time (2011; I think more recent versions of MS Office may have fixed some of these issues). But one of them mentioned a program called LaTeX (the funny capitalization is actually part of the name), and that it made typesetting and organization a breeze. And it's free! Which is pretty important to students (if not everyone).
So I checked it out, and ended up spending the next few months learning how to set it up on my computer and how to use it (I use MacTeX as the backend). I am fortunate that I have a little background in coding, because LaTeX is essentially a markup language. You write the text of what you want, formatting parts of it by using special combinations of characters and commands (or 'tags') that tell a program how to format them. Then you run a 'compiler' that outputs everything in the correct layout in a PDF. This is pretty brilliant, because it lets you (the writer) worry mostly about the content rather than the format. But learning how to fiddle with the code is rather time-consuming, so if you're not a hardcore programmer (and I still don't really consider myself one of the hardcore types) there is quite a learning curve. Worth it, but steep.
This is where a visual editor like LyX comes in. LyX is, pretty much out of the box, a simple way of interacting with your LaTeX code. It hides most of the code and offers formatting options, similar to MS Word or other word processors. Unlike them, however, you choose the general formatting parameters and let the backend handle the layout. You can also fiddle directly with the code if you need to, or add code to the front of the document for particular use cases, like a PhD cover page, interlinearized glossed text (IGT) examples, and more. Basically anything you need to add has probably been coded or figured out by someone, and if you're a troubleshooter like me you can run a Google search and find forums (and contribute to some yourself) that deal with your particular problem or at least something similar. And the assistance you get can be pretty phenomenal.
LyX does take a bit of configuration, and I might write another post that explains how I set it up for my use case(s). But for now, I’ll just say that using LaTeX/LyX was one of the best decisions I made as a PhD student. It really simplified my writing process and allowed me to do so much more. Rather than spending the final month on formatting my thesis, I was writing and making final changes all the way up to the deadline. I probably wrote more, and re-organized the structure more, in the last month than I had in the previous three. And the text file that contains my 700+ pages of analysis, examples, and appendices is only ~6 MB. Possibly the greatest benefit was that LyX kept track of all my linked example sentences, and formatted them all properly. Once I got it set up this saved me days and weeks of man-hours. The learning curve was totally worth it.
In closing, if you are seriously considering using LaTeX/LyX, there’s lots of good articles about this online. Here’s one, and here’s a discussion on the topic, to get you started.
My next post was intended to be about streamlining a workflow by using various LaTeX authoring tools and a bibliography manager, but last week my Mac died. It is an older computer admittedly (from 2011), but it just had the graphics/motherboard replaced a year and a half ago due to a recall and I had high hopes that it would last at least until the end of the year. Unfortunately that didn’t happen - while at a conference over the weekend it gave up the ghost, leaving me with just a USB drive that had my presentation, and the last-minute changes to a Praat script that I had managed to upload to GitHub the night before.
Fortunately I have been saving up towards a new computer, since I knew this would probably happen. It’s had a good run, but when I got back to Singapore, replacing the mother/logic board turned out not to be worth it - I’d be better off putting the money I would spend fixing it towards a new computer.
What I find remarkable, however, is that I literally lost nothing except some emails that are stored on the email server anyhow, and maybe some intermediate changes to the Praat script. This is partly because I saw that the computer was starting to fail and backed up all my files with Time Machine on an external hard drive, but partly also because my workflow lends itself to backing up.
As per my previous post regarding Git, it is important to have a workflow that encourages you to snapshot changes. This is what the Git “commit” command does - fail to integrate it into your Git workflow at your peril. Another way to do ensure that your files are backed up is to use online cloud storage like Google Drive and Dropbox. Most likely, if you’re an academic you probably use both these services, but maybe your backup system is a bit ad-hoc. In my case, the death of my computer has suggested to me that I need to use multiple cloud storage services.
So here’s what I am currently doing using a temporary user account on my wife’s computer:
The main reason this works for me is that I generally don’t operate with Dropbox and Google Drive on - I only start these services manually when I want to sync with the online server. In some of my online research, it seems that there MAY be issues with running a Git repo or editing files when cloud storage is syncing, so I get around this potential issue by simply doing my offline edits, and then syncing cloud storage manually.
So as part of my workflow:
As I go, I am also discovering more files that need to be in a cloud server to avoid data loss. This is mitigated to some extent by regular Time Machine backups, but I’m beginning to wonder if I also need to invest in expanded cloud storage. The only other drawback of this process is that since I’m backing up in multiple cloud services, the storage takes up twice as much space on my hard drive as it otherwise would. I’m still thinking about how to deal with that issue…
I'm a linguist and singer-songwriter. I write about life, travel, language and technology.