Hiram Ring
  • Posts
  • Music
    • Projects
    • Downloads
    • Bio/Press
    • Music Photos
    • Music Links
    • Store
  • Linguistics
    • Travel Photos
    • Useful Linguistic Links
  • Posts
  • Music
    • Projects
    • Downloads
    • Bio/Press
    • Music Photos
    • Music Links
    • Store
  • Linguistics
    • Travel Photos
    • Useful Linguistic Links

Automating the coding of implicit motives (Paper announcement)

5/14/2020

0 Comments

 
The past week has been quite exciting, as we've been in the final stages of preparing a final manuscript on machine learning. I've been rather quiet about this collaboration since the end of 2017, what with the new focus of the project in Zurich, a growing family, and the general pressures of life, not to mention the difficulty of developing a text classifier using neural networks. This particular classification problem is made harder by the fact that motive content is much more diverse in its linguistic manifestation than, for example, sentiment (positive/negative polarity).

This equal-authored paper, however, marks the first phase of what we hope will be a long-running and ultimately successful collaborative research focus on developing a neural network model that can automate the time-consuming nature of assessing implicit motive imagery (a personality measure) in written text. If you're interested in reading about it (we've tried to make the research fields of Implicit Motives, Machine Learning, and Natural Language Processing somewhat accessible), the citation for the online publication is below. And if you follow the URL link, you should be able to read the paper in its entirety (without needing a Springer access account).
  • Pang, Joyce S. & Hiram Ring. 2020. Automated coding of implicit motives: A machine-learning approach. Motivation and Emotion, pp. 1–19. doi:10.1007/s11031-020-09832-8. URL https://rdcu.be/b38pm.
If you're interested in checking out how a text classifier for implicit motives might work, you can visit the web app that we built, which uses an underlying CNN model for classification - keep in mind that not all text content has implicit motive imagery, and that this classifier does not yet perform classifications on par with trained human coders. However, it is a start. I may highlight some of the niceties that this particular neural network model can/can't handle in a future post.

In addition to the paper and links above, we provide trained models, data, and some dataset descriptives on the Open Science Framework website, which can be cited and linked to below.
  • Pang, Joyce S. & Hiram Ring. 2020. Automating implicit motive coding: Replication data and descriptives. OSF. doi:10.17605/OSF.IO/AURWB. URL https://osf.io/aurwb/.
I am hoping to write up an even more accessible summary of the paper in the coming weeks, but we'll see how it goes. Since the family is currently under COVID 'lockdown' in Singapore with two kids under 2, the only opportunities for doing anything not 'kid-wrangling' related come when the kids are sleeping, which is also when parents have to eat/shower/clean and do housework. This doesn't leave much time for anything else, which is another reason it was exciting to have finished the paper!
0 Comments

Open Science, Data, and Linguistics

8/18/2017

0 Comments

 
​One of the concerns that has occupied my mind for that past few years is the question of data accessibility in the field of Linguistics. I am happy to announce that the data that underpins my grammatical description of Pnar is now freely available as a downloadable archive in audio and text form (anonymized where requested by participants). You can find the link to the dataset at the bottom of this post, but in the meantime I'd like to explain my views surrounding data access and give a brief explanation of the tool I've used to make my linguistic data accessible.

Why accessible data?
  Those who are familiar with linguistics understand that traditional descriptions of language are often based on recorded, transcribed and translated interviews and stories by speakers of the language. Although some theoretical work may be based on a few utterances or a single example, most linguistic work is based on many actual examples from utterances that real speakers produce.

One issue here is that there is such between- and within-speaker variation in speech that unless the data you use is actually accessible to other linguists, one can easily question the veracity of a particular analysis. In the interest of scientific enquiry, then, it is incumbent on the analyst-linguist to make their actual data accessible to other researchers in at least some form, whether in an archive or in a database. Having the data accessible to multiple researchers may lead to disagreements about analysis (there may be more than one way of analyzing a particular linguistic structure, for example), but ultimately such disagreements are healthy because they expand our knowledge.

Research verifiability/reproducibility
  This touches on a larger issue in the world of science, that of verifiability and reproducibility of research, which has galvanized the larger scientific community towards Open Science (see this blog post for an explanation, and check out this OSF paper), and in some fields such as Psychology, has actually resulted in a whole journal devoted to "replication studies". These kind of studies are aimed at trying to replicate results and findings of a particular study by following the same procedure as the original researchers. When replication studies uphold a particular result, it makes it more likely that the original study's findings were not the result of a statistical anomaly or falsification of data, which is a very serious problem that can lead to erroneous claims requiring retraction. For more on this visit RetractionWatch.com.

What this means in the case of linguistic data is that the recordings, transcriptions, and translations that underlie a grammatical description or other study, whenever possible, should be made accessible to other linguists. Data sharing can be a touchy issue simply because of a) the ethical concerns of the providers of the data, b) potential cultural taboos, and c) because of the interests of the linguist who initially made and processed the data.

With proper permissions sought and precautions taken, these concerns can be minimized or dealt with appropriately. A linguist needs to (minimally) communicate to participants about how the data will be used, take the time to anonymize recordings and annotations when necessary, and create a license that constrains how the data can be used in the future. Ideally, if you are doing your research correctly, your university's Institutional Review Board will have already helped you to think through these things. There are also some excellent books, papers and chapters that deal (at least somewhat) with this subject, and there are a set of standards for social science research (with human subjects) and specifically for linguistics that researchers should be aware of.

Some reasons linguists don't share data
  The final point (C, the interests of the linguist) is really the sticking point for most people. The reality is that many linguists do not want to release data for several reasons:
  1. They haven't had time to go through it themselves to their satisfaction.
      This is often the main reason data doesn't get shared. It is common in fieldwork linguistics to collect many hours of recordings that a linguist never has time to annotate. In my case I recorded something like 11 hours of stories and conversations, but I was only able to transcribe and translate (and annotate) around 8 hours or so. The other 3 hours just didn't get processed, and this is one of my future tasks - to sit with a speaker and spend the time transcribing and translating. Consider, for example, that 5 minutes of recorded speech often takes something like one hour of time to transcribe and annotate with the help of a native speaker. This is very time-consuming and laborious work, which means that often there are recordings that remain with very little annotation and no transcription.
  2. They are worried that their analysis will be critiqued.
      Many linguists who do fieldwork are essentially apprentices, and are just starting to learn how to analyze linguistic structure on their own (i.e. PhD students). It can be extremely intimidating to know that you have lots of questions about how a language works, even after working on it for several years, and to know at the same time that people who have 20+ years of experience on multiple languages may be critiquing your data collection and analysis. I think this happens in many different disciplines, and it can be a barrier to making data public simply because of the personal fear that individuals can have.
  3. They are worried that their work will be 'stolen' or repackaged.
      The fact that this is even a concern is telling about both the field of linguistics and the way data is treated. In the field of linguistics, it is really incumbent on senior linguists to honor data 'ownership' or curation by the primary data collector, by citing data properly.
      Ownership is a bit of an issue to work out sometimes, as beginning field linguists are often paid or supported in their work by a supervisor's grant. I think the best way is to follow a 'time-spent' principle. That is, the person who has spent the most time with the data has the largest share of ownership of the final form of the dataset. This is strictly regarding the annotation, transcription, and translation of the dataset (the speakers who speak on the recordings obviously have a different kind of ownership).
      Other kinds of ownership or use ought to be negotiated very early on by interested parties - in my case, for example, I and my supervisor agreed that the data I created would be available to him for research use, and that I would share it relatively freely with others for non-commercial use. Regarding data ownership, linguists can honor this by citing the source or providing proper attribution, but it may be the case that the data cited is not readily available. It may be printed in the back of a grammatical description (or some portions of it may be), but more often it is located in a collection of notebooks on the shelf of the linguist somewhere, gathering dust.

My Data
  This brings me back to the earlier ruminations that started this post, namely that data produced by a linguist and which underpins their work ought to be accessible to other scientists and linguists. When I first submitted my PhD at NTU, I took a look at some of the options for data archiving, and I approached the university library (which keeps digital copies of all theses submitted at the university) to see if they could also store my audio and transcription data (over 1GB). About a year ago, they contacted me to let me know that they were developing such a service, something called DataVerse, and wanted to know if they could use my dataset to test it. I was happy to have them do so, and after some tweaking and time, this tool is now available for use.

DataVerse is a database/archive tool developed at Harvard University that allows researchers to store datasets that other researchers can download for use and testing. It supports the Open Science initiative by making data accessible and open. It also solves one of the problems I noted above by creating a unique url identifier and citation for the dataset. You can check out my dataset at its DOI here and download it for research and non-commercial purposes.

Further thoughts
As I was thinking about this previously, I realized that what I wanted was not really an archive but a database that would allow me to develop and annotate my data further. Unfortunately DataVerse is not that - it is basically just a storage tool. What is nice is that it provides versioning, so the curator of the dataset can upload and publish changes. I think I may have to create my own database if I want something that will let me explore the data better. But for now, the data is freely accessible for other linguists (even though my analysis isn't perfect), which is a bit of a load off my mind.
0 Comments

    About me

    I'm a linguist and singer-songwriter. I write about life, travel, language and technology.

    Archives

    January 2022
    May 2020
    September 2019
    July 2018
    February 2018
    December 2017
    August 2017
    June 2017
    May 2017
    April 2017
    March 2017
    February 2017
    December 2015
    May 2015
    December 2014
    November 2014
    October 2014
    September 2014
    August 2014
    July 2014
    June 2014
    April 2014
    March 2014
    December 2013
    October 2013
    August 2013
    July 2013
    June 2013
    May 2013
    April 2013
    March 2013
    February 2013
    January 2013

    Categories

    All
    3mt
    Abbi
    Acoustic
    Advice
    AI
    Album
    All I Want
    Analysis
    Andaman
    Annotation
    Archive
    Audio
    Austroasiatic
    Backup
    Biate
    Bibliography
    Breathe Deep
    China
    Chords
    Clause Similarity
    Cloud
    Collaboration
    Computers
    Conference
    Culture
    Data
    Data Access
    Datasets
    DataVerse
    Death
    Deixis
    Demonstratives
    Documentation
    Draw
    Duration
    DX
    E920
    Easter
    El Capitan
    E Reader
    E-reader
    Examples
    EXcel
    F0
    Failure
    Feature
    Fieldwork
    Formants
    Forums
    Friends
    Ghana
    Git
    Git Metaphor
    Greet The Dawn
    Hanvon
    HLS20
    Holiday
    Home
    How-to
    ICAAL
    Implicit Motives
    Instruction
    Intensity
    Interlinear
    I've Got A Girl
    Kindle
    Language
    LaTeX
    Linguistics
    LyX
    Mac
    Machine Learning
    Mastering
    Metaphor
    MU
    Myanmar
    Natural Language Processing
    Neural Networks
    New Release
    News
    NLP
    NLTK
    Open Science
    Papers
    Paperwhite
    Pdf
    PhD
    Phonetics
    Phonology
    Pitch
    Plot
    Pnar
    Praat
    Practical
    Process
    Processing
    Production
    Programming
    Psalms
    Psychology
    Publications
    Publicity
    Python
    Radar Radio
    Reasons
    Recording
    Research
    Review
    Scripts
    Sentiment Analysis
    Singapore
    Song
    Soundfarm
    Sports
    Studio
    Subrepo
    Syntactic Reconstruction
    Text Classification
    Thailand
    Thesis
    Things To Know
    This Lamp
    Thoughts
    Tips
    Tone
    Toolbox
    Tools
    Track List
    Transcriber
    Transcriber 1.5.2
    Transcription
    Travel
    Trs2txt
    Update
    USA
    UZH
    Valentine's Day
    Version Control
    Video
    Vowels
    Web App
    Website
    Wedding
    Word - Flesh
    Workflow
    World Cup
    Writing
    YUFL
    Zion's Walls
    Zurich

    RSS Feed

    prev. blog

      Contact me

    Submit
Powered by Create your own unique website with customizable templates.