I have been quite lax with posting here mainly because I have been working very hard on a difficult problem. My current job involves (among other things) trying to automate a psychological coding system for text. This is pretty similar to research in Sentiment Analysis (SA, see this brief introduction), a task in Natural Language Processing (NLP) which attempts to predict the 'positivity' or 'negativity' of sentences (i.e. classification), such as from Twitter. Companies find this useful for (among other things) getting a broad understanding of how consumers respond to their products and services.
The coding system I am trying to automate is a bit more complex, but I am using similar techniques as in SA. I started initially with SVM and other classifiers using the SciKit-Learn library, and I have now moved on to using Neural Networks. These are essentially Machine Learning models that allow a computer to generalize patterns in data that correspond to particular outputs. Humans do this pretty naturally - we recognize patterns in language, for example, that allow us to parse sounds into words and words into units of meaning that, when combined, help us communicate.
A data-driven approach
The idea with neural networks is that given enough labeled data, the computer's statistical model can capture the patterns that correspond to the labels and predict what the labels should be for data it has never seen before. This is essentially how character recognition (OCR) works - enough data has been fed to the computer that it has 'learned' what an "A" character looks like in a variety of images. But the model can only make good generalizations if it has a lot of data and a good diversity of data.
It is a common issue, especially in the field of Linguistics, that the amount of data available for a particular problem is limited (i.e. description of languages is often based on a 5-hour recorded corpus supplemented with elicitation, psycholinguistic experiments are usually conducted with sample sizes of 5-20 participants, though larger sample sizes are more ideal). This is partly because of how time-consuming data collection can be - combine this with the fact that you might be dealing with multiple languages and the issue is compounded. But even within a single language, particular problems only have limited datasets, and if we want to automate a system that has typically required trained humans (consider that PhD students who describe a language usually train/read for at least a year before fieldwork, after having completed an MA), there might be even less data than usual available for training, particularly data in the correct format.
Addressing problems in the neural network
This lack of data is a significant problem that we have been working to overcome by creating more coded data. In building ML models I have continued working with the existing data to achieve decent results on a small dataset, the hope being that once more coded data is available it can be incorporated and give a better result. Along the way I have come across several important learning points that I thought I would write here as a reference of sorts, to add to other great posts that have been helpful to me.
Final thoughts and further links
These are just a few of the things that I have learned in the past few months while trying to sort out my classification problems. Papers and online forums have been extremely helpful in developing my understanding of the issues involved, and I have benefitted particularly from this blog and this one (among others) and examples of models on the Keras GitHub repository. For a good (though brief) discussion of trying to implement state of the art text classification models, see this post. Ultimately, as one person noted on a forum, developing neural networks is as much an art as a science, requiring experimentation and intuition to figure out how to apply a particular model architecture to solve a particular problem.
I'm a linguist and singer-songwriter. I write about life, travel, language and technology.