The past week has been quite exciting, as we've been in the final stages of preparing a final manuscript on machine learning. I've been rather quiet about this collaboration since the end of 2017, what with the new focus of the project in Zurich, a growing family, and the general pressures of life, not to mention the difficulty of developing a text classifier using neural networks. This particular classification problem is made harder by the fact that motive content is much more diverse in its linguistic manifestation than, for example, sentiment (positive/negative polarity).
This equal-authored paper, however, marks the first phase of what we hope will be a long-running and ultimately successful collaborative research focus on developing a neural network model that can automate the time-consuming nature of assessing implicit motive imagery (a personality measure) in written text. If you're interested in reading about it (we've tried to make the research fields of Implicit Motives, Machine Learning, and Natural Language Processing somewhat accessible), the citation for the online publication is below. And if you follow the URL link, you should be able to read the paper in its entirety (without needing a Springer access account).
If you're interested in checking out how a text classifier for implicit motives might work, you can visit the web app that we built, which uses an underlying CNN model for classification - keep in mind that not all text content has implicit motive imagery, and that this classifier does not yet perform classifications on par with trained human coders. However, it is a start. I may highlight some of the niceties that this particular neural network model can/can't handle in a future post.
In addition to the paper and links above, we provide trained models, data, and some dataset descriptives on the Open Science Framework website, which can be cited and linked to below.
I am hoping to write up an even more accessible summary of the paper in the coming weeks, but we'll see how it goes. Since the family is currently under COVID 'lockdown' in Singapore with two kids under 2, the only opportunities for doing anything not 'kid-wrangling' related come when the kids are sleeping, which is also when parents have to eat/shower/clean and do housework. This doesn't leave much time for anything else, which is another reason it was exciting to have finished the paper!
I have been quite lax with posting here mainly because I have been working very hard on a difficult problem. My current job involves (among other things) trying to automate a psychological coding system for text. This is pretty similar to research in Sentiment Analysis (SA, see this brief introduction), a task in Natural Language Processing (NLP) which attempts to predict the 'positivity' or 'negativity' of sentences (i.e. classification), such as from Twitter. Companies find this useful for (among other things) getting a broad understanding of how consumers respond to their products and services.
The coding system I am trying to automate is a bit more complex, but I am using similar techniques as in SA. I started initially with SVM and other classifiers using the SciKit-Learn library, and I have now moved on to using Neural Networks. These are essentially Machine Learning models that allow a computer to generalize patterns in data that correspond to particular outputs. Humans do this pretty naturally - we recognize patterns in language, for example, that allow us to parse sounds into words and words into units of meaning that, when combined, help us communicate.
A data-driven approach
The idea with neural networks is that given enough labeled data, the computer's statistical model can capture the patterns that correspond to the labels and predict what the labels should be for data it has never seen before. This is essentially how character recognition (OCR) works - enough data has been fed to the computer that it has 'learned' what an "A" character looks like in a variety of images. But the model can only make good generalizations if it has a lot of data and a good diversity of data.
It is a common issue, especially in the field of Linguistics, that the amount of data available for a particular problem is limited (i.e. description of languages is often based on a 5-hour recorded corpus supplemented with elicitation, psycholinguistic experiments are usually conducted with sample sizes of 5-20 participants, though larger sample sizes are more ideal). This is partly because of how time-consuming data collection can be - combine this with the fact that you might be dealing with multiple languages and the issue is compounded. But even within a single language, particular problems only have limited datasets, and if we want to automate a system that has typically required trained humans (consider that PhD students who describe a language usually train/read for at least a year before fieldwork, after having completed an MA), there might be even less data than usual available for training, particularly data in the correct format.
Addressing problems in the neural network
This lack of data is a significant problem that we have been working to overcome by creating more coded data. In building ML models I have continued working with the existing data to achieve decent results on a small dataset, the hope being that once more coded data is available it can be incorporated and give a better result. Along the way I have come across several important learning points that I thought I would write here as a reference of sorts, to add to other great posts that have been helpful to me.
Final thoughts and further links
These are just a few of the things that I have learned in the past few months while trying to sort out my classification problems. Papers and online forums have been extremely helpful in developing my understanding of the issues involved, and I have benefitted particularly from this blog and this one (among others) and examples of models on the Keras GitHub repository. For a good (though brief) discussion of trying to implement state of the art text classification models, see this post. Ultimately, as one person noted on a forum, developing neural networks is as much an art as a science, requiring experimentation and intuition to figure out how to apply a particular model architecture to solve a particular problem.
I'm a linguist and singer-songwriter. I write about life, travel, language and technology.