The last post was a bit of a brain dump to make sure I didn't forget a few lessons I learned, in part because I knew I was quitting the job that involved doing ML type things. While I was working there of course I learned a lot and (I think) acquitted myself pretty well, but language processing and machine learning are not really what I spent 4 years doing for my PhD. Python is a programming language that I picked up to make my work in grammatical description and syntax easier, and while I find ML (and programming) pretty interesting, my main interest lies in understanding how languages work through comparison, with the ultimate goal of reconstructing linguistic structures and (hopefully) prehistory.
A year and a half ago or so I started working on a grant proposal for that exact thing with some researchers at the University of Zurich. This is a relatively young department that is doing some really cool research in typology, processing, and language acquisition from a corpus-based perspective on multiple languages (both Indo-European and non-IE families/phyla). At the same time historical linguistics is a huge focus in the department, as is modeling language change. This is super exciting because I take the perspective that language is spoken by individuals in communities who acquire language from their forbears (history), use it as a tool for communication (processing), which gives rise to statistical tendencies that all languages share (typology). Since it is individuals using language, this is done in an idiosyncratic way, but since language is learned and guided by principles of processing, the only way to get at both the commonality and the uniqueness of language is by investigating actual language corpora (recordings, transcriptions, etc). Of course the story of how languages change is much more complex and involves many more factors, but be that as it may, this is a great place to be.
Picture: Lake Zurich from the hill above the university
So, long story short, we found out last October that the grant had been funded, and the family and I started making plans to move to Zurich. More on that later, perhaps. With this project, our goal at the moment is to build a database of Austroasiatic language corpora that we can then investigate for all sorts of interesting phenomena, but focusing (initially at least) on word order. By comparing word order in multiple languages of the same family we intend to make an effort toward reconstructing the form of the parent languages from which the present-day spoken languages diverged, and also to identify language contact and interaction effects to contribute to discussions about the development of word order patterns cross-linguistically.
I have been quite lax with posting here mainly because I have been working very hard on a difficult problem. My current job involves (among other things) trying to automate a psychological coding system for text. This is pretty similar to research in Sentiment Analysis (SA, see this brief introduction), a task in Natural Language Processing (NLP) which attempts to predict the 'positivity' or 'negativity' of sentences (i.e. classification), such as from Twitter. Companies find this useful for (among other things) getting a broad understanding of how consumers respond to their products and services.
The coding system I am trying to automate is a bit more complex, but I am using similar techniques as in SA. I started initially with SVM and other classifiers using the SciKit-Learn library, and I have now moved on to using Neural Networks. These are essentially Machine Learning models that allow a computer to generalize patterns in data that correspond to particular outputs. Humans do this pretty naturally - we recognize patterns in language, for example, that allow us to parse sounds into words and words into units of meaning that, when combined, help us communicate.
A data-driven approach
The idea with neural networks is that given enough labeled data, the computer's statistical model can capture the patterns that correspond to the labels and predict what the labels should be for data it has never seen before. This is essentially how character recognition (OCR) works - enough data has been fed to the computer that it has 'learned' what an "A" character looks like in a variety of images. But the model can only make good generalizations if it has a lot of data and a good diversity of data.
It is a common issue, especially in the field of Linguistics, that the amount of data available for a particular problem is limited (i.e. description of languages is often based on a 5-hour recorded corpus supplemented with elicitation, psycholinguistic experiments are usually conducted with sample sizes of 5-20 participants, though larger sample sizes are more ideal). This is partly because of how time-consuming data collection can be - combine this with the fact that you might be dealing with multiple languages and the issue is compounded. But even within a single language, particular problems only have limited datasets, and if we want to automate a system that has typically required trained humans (consider that PhD students who describe a language usually train/read for at least a year before fieldwork, after having completed an MA), there might be even less data than usual available for training, particularly data in the correct format.
Addressing problems in the neural network
This lack of data is a significant problem that we have been working to overcome by creating more coded data. In building ML models I have continued working with the existing data to achieve decent results on a small dataset, the hope being that once more coded data is available it can be incorporated and give a better result. Along the way I have come across several important learning points that I thought I would write here as a reference of sorts, to add to other great posts that have been helpful to me.
Final thoughts and further links
These are just a few of the things that I have learned in the past few months while trying to sort out my classification problems. Papers and online forums have been extremely helpful in developing my understanding of the issues involved, and I have benefitted particularly from this blog and this one (among others) and examples of models on the Keras GitHub repository. For a good (though brief) discussion of trying to implement state of the art text classification models, see this post. Ultimately, as one person noted on a forum, developing neural networks is as much an art as a science, requiring experimentation and intuition to figure out how to apply a particular model architecture to solve a particular problem.
Well now that I've submitted my thesis I have been happily doing other things than writing. Among which, moving house (I'm staying with a friend for a couple months), planning adventures, hanging out with friends, and applying for jobs. It's been good to catch up with people and not have something hanging over my head any longer.
Last week I also tidied up my desk and tied up some loose ends, one of which was a Praat script. As I have mentioned before, online communities are a fabulous place to learn about various things, and one that is particularly active in regards to Praat can be found here on Yahoo. Praat is a free program to image and manipulate sound files (particularly for linguistic purposes) and scripts that automate various features can be extremely useful. I tried to write one that would plot a series of vowels with a circle encompassing the vowel spaces and a single character of the vowel in the center of each ellipse, but I was having trouble. José was able to rewrite the script for a cleaner and better result, for which I am extremely grateful.
For other linguists interested in a script that plots a series of vowels (F1, F2) from a CSV file, check out the link above or the file below.
Travel plans are coming together nicely, so I'll be updating a bit more often in the coming weeks, unless I get too busy having fun!
As I've been working with code to try and do some programming to get the computer to format my text properly, I've run into some issues. It's got me thinking... You know how computers think... wait, you do?! No you don't! Computers don't think, unfortunately, that's the problem. Computers aren't good at connecting the dots or making inferences like humans are. All they can do is connect the dots that a human tells them to. There's the rub. The computer is only as smart as you are.
Fortunately, when I'm writing a program to go through my 80,000+ words of text (times 6, since there's 4 lines of interlinearization plus one of free translation = 480,000) which it parses in an instant, the computer tells me when it fails. Or rather, since I'm writing the code, when I FAIL. You know exactly where you stand with a computer, because there's only one right way for a code to run, and that's if all the processes are logical and well-formed according to the rules of the code's architecture.
I must say I'm glad that life isn't that way. Yes, there are principles that can be recognized and lived. You generally receive from life based on what you put into relationships, study, work, etc... But there's no single perfect way to run. It's not like the world is a giant piece of code architecture and your life is a logical process from one thing to another. Life is dynamic. It can change and be changed by a small movement in one direction or another. And failure is just the beginning of a new direction.
On the way back to the office from dinner the other night (see how much time this coding takes if I go back to the office after dinner!) I was talking with one of my friends about job prospects and how life changes. There's a lot of uncertainty, but I said that one thing I've learned is to figure out what is important to you and make it part of your life. I guess I'm still figuring...
I'm a linguist and singer-songwriter. I write about life, travel, language and technology.