Hiram Ring
  • Posts
  • Music
    • Projects
    • Downloads
    • Bio/Press
    • Music Photos
    • Music Links
    • Store
  • Linguistics
    • Travel Photos
    • Useful Linguistic Links
  • Posts
  • Music
    • Projects
    • Downloads
    • Bio/Press
    • Music Photos
    • Music Links
    • Store
  • Linguistics
    • Travel Photos
    • Useful Linguistic Links

Neural Networks: Some lessons from text classification

12/3/2017

1 Comment

 
​I have been quite lax with posting here mainly because I have been working very hard on a difficult problem. My current job involves (among other things) trying to automate a psychological coding system for text. This is pretty similar to research in Sentiment Analysis (SA, see this brief introduction), a task in Natural Language Processing (NLP) which attempts to predict the 'positivity' or 'negativity' of sentences (i.e. classification), such as from Twitter. Companies find this useful for (among other things) getting a broad understanding of how consumers respond to their products and services.

The coding system I am trying to automate is a bit more complex, but I am using similar techniques as in SA. I started initially with SVM and other classifiers using the SciKit-Learn library, and I have now moved on to using Neural Networks. These are essentially Machine Learning models that allow a computer to generalize patterns in data that correspond to particular outputs. Humans do this pretty naturally - we recognize patterns in language, for example, that allow us to parse sounds into words and words into units of meaning that, when combined, help us communicate.

A data-driven approach
  The idea with neural networks is that given enough labeled data, the computer's statistical model can capture the patterns that correspond to the labels and predict what the labels should be for data it has never seen before. This is essentially how character recognition (OCR) works - enough data has been fed to the computer that it has 'learned' what an "A" character looks like in a variety of images. But the model can only make good generalizations if it has a lot of data and a good diversity of data.

It is a common issue, especially in the field of Linguistics, that the amount of data available for a particular problem is limited (i.e. description of languages is often based on a 5-hour recorded corpus supplemented with elicitation, psycholinguistic experiments are usually conducted with sample sizes of 5-20 participants, though larger sample sizes are more ideal). This is partly because of how time-consuming data collection can be - combine this with the fact that you might be dealing with multiple languages and the issue is compounded. But even within a single language, particular problems only have limited datasets, and if we want to automate a system that has typically required trained humans (consider that PhD students who describe a language usually train/read for at least a year before fieldwork, after having completed an MA), there might be even less data than usual available for training, particularly data in the correct format.

Addressing problems in the neural network
  This lack of data is a significant problem that we have been working to overcome by creating more coded data. In building ML models I have continued working with the existing data to achieve decent results on a small dataset, the hope being that once more coded data is available it can be incorporated and give a better result. Along the way I have come across several important learning points that I thought I would write here as a reference of sorts, to add to other great posts that have been helpful to me.
​
  1. Check your code for bugs.
      I am coding in Python simply because it has lots of great libraries for Machine Learning (Keras, Tensorflow, Theano), data transformation and storage (Pandas, Numpy), and language processing (NLTK, SpaCy). Sometimes there are little bugs that give your model the wrong kind of data to generalize, so it learns nothing. Garbage in, garbage out, unfortunately. Printing to the console at key points helps you to identify these bugs.
  2. Check your data for errors.
      I am not a specialist in the coding system I am trying to automate, but I am a linguist, which helps me to recognize when the input text is wrong. For the people labeling the data, this is probably the most time-consuming process - checking each others' work. With linguistic data, I have found that it is helpful at times to normalize the text, but not too much. Basic stuff like lowercasing and removing punctuation can help normalize the data, as can simple spelling correction, but part-of-speech tagging and lemmatizing or stemming may not help at all. Some of this has to do with the degree to which NLP can accurately lemmatize or stem words. For example, WordNet is commonly used for lemmatizing, but if the POS tag is incorrect then it returns entirely the wrong word.
  3. Balance your data.
      This is probably the most important thing to do when pre-processing your data before feeding it into your model, especially when working with small datasets. Your data may be highly skewed toward one label. In my case, with one of the binary classification problems I'm working on the ratio of zeros to ones in the data is roughly 10: 1. If you are splitting your data into training and validation sets (which you should be!), it is possible that your proportions change depending on how the split is set up. With a random split, your training data could end up with a label ratio of 15: 1, while your validation data could be 6: 1. This makes it much harder for the model to generalize over unseen data. Just fixing the proportions in the training and validation sets (and then shuffling each set independently) improved my correlations with unseen data by about 0.20 (Pearson's R) on average.
  4. Predict on completely unseen data.
      Many of the papers I have been reading (and I've been reading a lot!) report the accuracy of their model on validation data, which is usually a subset of the overall dataset, (between 10-50% depending on the problem). This is fine, but the validation data in a model is typically used to refine the weights of the overall model from epoch to epoch. To my mind, 94% accuracy (which is what some papers report for their SA models) only tells you how well the model has learned from the dataset, not how it performs on real-world instances. Ideally, to test this you should keep a separate dataset out of the training/validation altogether and then try to predict the labels using your model. Then looking at the correlations between the predicted labels and the actual labels will give you a much better idea of the accuracy of the model on unseen data.

Final thoughts and further links
  These are just a few of the things that I have learned in the past few months while trying to sort out my classification problems. Papers and online forums have been extremely helpful in developing my understanding of the issues involved, and I have benefitted particularly from this blog and this one (among others) and examples of models on the Keras GitHub repository. For a good (though brief) discussion of trying to implement state of the art text classification models, see this post. Ultimately, as one person noted on a forum, developing neural networks is as much an art as a science, requiring experimentation and intuition to figure out how to apply a particular model architecture to solve a particular problem. 
1 Comment
David Jones link
11/16/2022 09:45:39 am

Phone say job company arm analysis national. Wait boy sit ever police fire consumer. Rule realize us very happy agree whole.

Reply



Leave a Reply.

    About me

    I'm a linguist and singer-songwriter. I write about life, travel, language and technology.

    Archives

    January 2022
    May 2020
    September 2019
    July 2018
    February 2018
    December 2017
    August 2017
    June 2017
    May 2017
    April 2017
    March 2017
    February 2017
    December 2015
    May 2015
    December 2014
    November 2014
    October 2014
    September 2014
    August 2014
    July 2014
    June 2014
    April 2014
    March 2014
    December 2013
    October 2013
    August 2013
    July 2013
    June 2013
    May 2013
    April 2013
    March 2013
    February 2013
    January 2013

    Categories

    All
    3mt
    Abbi
    Acoustic
    Advice
    AI
    Album
    All I Want
    Analysis
    Andaman
    Annotation
    Archive
    Audio
    Austroasiatic
    Backup
    Biate
    Bibliography
    Breathe Deep
    China
    Chords
    Clause Similarity
    Cloud
    Collaboration
    Computers
    Conference
    Culture
    Data
    Data Access
    Datasets
    DataVerse
    Death
    Deixis
    Demonstratives
    Documentation
    Draw
    Duration
    DX
    E920
    Easter
    El Capitan
    E Reader
    E-reader
    Examples
    EXcel
    F0
    Failure
    Feature
    Fieldwork
    Formants
    Forums
    Friends
    Ghana
    Git
    Git Metaphor
    Greet The Dawn
    Hanvon
    HLS20
    Holiday
    Home
    How-to
    ICAAL
    Implicit Motives
    Instruction
    Intensity
    Interlinear
    I've Got A Girl
    Kindle
    Language
    LaTeX
    Linguistics
    LyX
    Mac
    Machine Learning
    Mastering
    Metaphor
    MU
    Myanmar
    Natural Language Processing
    Neural Networks
    New Release
    News
    NLP
    NLTK
    Open Science
    Papers
    Paperwhite
    Pdf
    PhD
    Phonetics
    Phonology
    Pitch
    Plot
    Pnar
    Praat
    Practical
    Process
    Processing
    Production
    Programming
    Psalms
    Psychology
    Publications
    Publicity
    Python
    Radar Radio
    Reasons
    Recording
    Research
    Review
    Scripts
    Sentiment Analysis
    Singapore
    Song
    Soundfarm
    Sports
    Studio
    Subrepo
    Syntactic Reconstruction
    Text Classification
    Thailand
    Thesis
    Things To Know
    This Lamp
    Thoughts
    Tips
    Tone
    Toolbox
    Tools
    Track List
    Transcriber
    Transcriber 1.5.2
    Transcription
    Travel
    Trs2txt
    Update
    USA
    UZH
    Valentine's Day
    Version Control
    Video
    Vowels
    Web App
    Website
    Wedding
    Word - Flesh
    Workflow
    World Cup
    Writing
    YUFL
    Zion's Walls
    Zurich

    RSS Feed

    prev. blog

      Contact me

    Submit
Powered by Create your own unique website with customizable templates.