Hiram Ring
  • Posts
  • Music
    • Projects
    • Downloads
    • Bio/Press
    • Music Photos
    • Music Links
    • Store
  • Linguistics
    • Travel Photos
    • Useful Linguistic Links
  • Posts
  • Music
    • Projects
    • Downloads
    • Bio/Press
    • Music Photos
    • Music Links
    • Store
  • Linguistics
    • Travel Photos
    • Useful Linguistic Links

Computer-assisted syntactic reconstruction

9/16/2019

2 Comments

 
Things have been quite busy over the last year or so, when I last posted. My wife and I moved to Zurich with our 4-month old, I helped organize a workshop on word order here at UZH, a proceedings volume from the last ICAAL that I co-edited got published in December, we attended a bunch of conferences (AAS in Denver, SEALS in Tokyo, ICHL in Canberra), I lost my passport on the way to ICAAL in Chiang Mai (I've since applied for another), and various other things happened.

In the meantime we have also been hard at work digitizing, transcribing, and annotating data from multiple Austroasiatic languages. Alongside this effort we have been developing semi-automated ways of comparing clauses and identifying possible correspondences for syntactic reconstruction. The field of syntactic reconstruction has been gaining traction over the past decade as a viable area for study in historical linguistics (see here, here, and here for some work), and it's exciting to be working on ways that computers can help us in this task.

One interesting observation we can make is that our methodology does actually identify crosslinguistic structural similarities. We can see this in the following plot, which compares the number of clauses deemed 'similar' by our method in two datasets (thanks to Damian Blasi for suggesting this means of assessing our method). The first dataset is our current dataset with over 9,000 clauses annotated. Across 10 languages in 5 subgroups, this results in over 23 million pairwise comparisons. The second dataset is composed of the same clauses, but with the elements in each clause randomized by language. The plotted lines are the distribution of similarity judgments across each dataset.
Picture
We can see that using our method for clause comparison the randomized dataset shows a normal distribution - which is what we expect from unstructured data. With the same method, however, the dataset of annotated clauses in Austroasiatic languages shows a non-normal distribution. This tells us that the real language data is structured AND that our method for measuring similarity picks up on this structure, identifying a higher degree of similarity between clauses in languages that we know are related.

This raises a lot of new questions and
highlights the need for more testing to identify the best way of assessing similarity between clauses in a systematic and linguistically appropriate manner. Fortunately our project is not yet over!
2 Comments

ICAAL 7 Proceedings volume

9/12/2019

0 Comments

 
Picture
Planting rice near Jowai, NE India
At the end of last year the ICAAL 7 proceedings volume was published by University of Hawai'i Press as a special issue of the Journal of the South-East Asian Linguistics Society (JSEALS). The 8th ICAAL was just held in Chiang Mai, and so I think it's worth writing a bit about the 7th ICAAL proceedings, even at such a late date. The issue was edited by myself and Felix Rau (University of Cologne), and I wrote more details about it in a Twitter thread when it was first released. It was my first attempt at (co-)editing an issue/volume, and it was a good experience, made more so by an excellent co-editor, timely responses generally from authors and reviewers alike, and the support/advice of Mark Alves, Paul Sidwell, and Mathias Jenny. It was such a good experience, in fact, that Felix and I have agreed to edit the proceedings from ICAAL 8.

I won't go into great detail about the papers, since the issue is open-access, and so anyone interested can follow the link above and download/read the abstracts/papers at their leisure. But I do want to highlight a few general points about the encouraging direction it shows for Austroasiatic studies. There is also an extensive backstory to the history of the International Conference on Austro-Asiatic Linguistics (ICAAL) that provides a bit more context. One take on at least part of that backstory can be found here, and more can be found here.
Picture
Presenters and attendees of the 7th ICAAL in Kiel, Germany
​The seventh International Conference on Austro-Asiatic Linguistics (attendees pictured above) was held in Kiel, Germany in 2017. One point that we note in the introduction to these papers is that this is only the fourth published proceedings volume since the conference's inception in 1973. Over a span of 40+ years, 7 ICAAL meetings have been held, and proceedings have been published for just over half. There are various reasons for this, but we hope that this special issue is part of a trend.

Bolstering this trend is the fact that the majority of the papers in this special issue are by relatively young linguists. In the field of Austroasiatic linguistics there are well-known and well-cited scholars such as Harry Shorto, Gerard Diffloth, Philip Jenner, Eugenie Henderson, Norman Zide, Geoffrey Benjamin, Ilia Peiros, Patricia Donegan and Michel Ferlus, but the majority of their work was done in the 60s-90s. Some of these scholars have passed on, and only a few scholars such as Niclas Burenhult, Nicole Kruspe, Paul Sidwell, Greg Anderson, Mark Alves, Nathan Badenoch and Mathias Jenny have 'carried the torch', as it were, and worked to extend and expand our knowledge of the Austroasiatic languages, especially in the last 10 years. Thanks to their efforts, however, and especially to their mentorship, there is a growing number of young scholars who are working on these languages, providing important insights and datasets for other scholars.

The focus on data and the attempt to make primary data accessible is a particularly heartening feature of this issue. The data is accessible either through online, open-access repositories or through included examples, tables, or appendices. While previous work on Austroasiatic languages included such data, the inclusion of online repositories follows a growing trend in the social sciences where underlying data can be assessed and results can be replicated by other scientists, or an analysis can be contradicted or refined. The benefits of this 'open science' approach are mainly that the focus is taken off of the individual and whether they argue well for a position, and instead the focus is placed on what the best interpretation of the data is, and whether the data supports the individual's argument. This is only possible when the data is accessible.
0 Comments

Fieldwork in Myanmar

7/24/2018

0 Comments

 
I just returned from fieldwork in Myanmar, where we spent several days training staff and students at two different universities (Mandalay University and Yangon University of Foreign Languages) in fieldwork and data-gathering techniques. I also spent a week with the team in a village where a Palaung language is spoken (in the mountains of Shan State), conducting fieldwork with staff and students from Mandalay University's Department of Anthropology. We were able to get lots of great recordings of stories, histories, and explanations of cultural practices in the local language, with translations in Burmese.
Tea leaf (lahpet) salad, a favorite snack and a main export of the village where we did fieldwork.
Tea leaf (lahpet) salad, a favorite snack and a main export of the village where we did fieldwork.
This will be part of an ongoing collaboration with University of Zurich and these universities in Myanmar, where we are helping to build capacity and share our different kinds of expertise between these departments. There is still a lot of descriptive work that needs to be done among these language groups, and regarding their connections to larger Myanmar society. I'm still learning a lot about how Myanmar culture and language works (I'll be taking a Burmese language course at UZH in the coming semester), and I'm looking forward to working more with the universities in Myanmar in the future.
Swedagon pagoda
Shwedagon pagoda
On some of our 'off' days, we went sightseeing in Yangon and in Bagan. While I'd been to the Shwedagon pagoda before, Bagan was a pretty incredible place (although possibly the hottest place I've ever been) simply because there are SO MANY temples and pagodas. Literally everywhere you look, if you have any kind of a view above the trees, you see a temple or a pagoda, or multiples of both. This time of year is not the best for visiting (so hot), and apparently you used to be able to climb more temples a few years ago, before the earthquakes made this dangerous, but it was still pretty amazing.
Bagan
Bagan
0 Comments

Moving to Zurich

2/21/2018

1 Comment

 
​The last post was a bit of a brain dump to make sure I didn't forget a few lessons I learned, in part because I knew I was quitting the job that involved doing ML type things. While I was working there of course I learned a lot and (I think) acquitted myself pretty well, but language processing and machine learning are not really what I spent 4 years doing for my PhD. Python is a programming language that I picked up to make my work in grammatical description and syntax easier, and while I find ML (and programming) pretty interesting, my main interest lies in understanding how languages work through comparison, with the ultimate goal of reconstructing linguistic structures and (hopefully) prehistory.

A year and a half ago or so I started working on a grant proposal for that exact thing with some researchers at the University of Zurich. This is a relatively young department that is doing some really cool research in typology, processing, and language acquisition from a corpus-based perspective on multiple languages (both Indo-European and non-IE families/phyla). At the same time historical linguistics is a huge focus in the department, as is modeling language change. This is super exciting because I take the perspective that language is spoken by individuals in communities who acquire language from their forbears (history), use it as a tool for communication (processing), which gives rise to statistical tendencies that all languages share (typology). Since it is individuals using language, this is done in an idiosyncratic way, but since language is learned and guided by principles of processing, the only way to get at both the commonality and the uniqueness of language is by investigating actual language corpora (recordings, transcriptions, etc). Of course the story of how languages change is much more complex and involves many more factors, but be that as it may, this is a great place to be.
Lake Zurich
Picture: ​Lake Zurich from the hill above the university
​So, long story short, we found out last October that the grant had been funded, and the family and I started making plans to move to Zurich. More on that later, perhaps. With this project, our goal at the moment is to build a database of Austroasiatic language corpora that we can then investigate for all sorts of interesting phenomena, but focusing (initially at least) on word order. By comparing word order in multiple languages of the same family we intend to make an effort toward reconstructing the form of the parent languages from which the present-day spoken languages diverged, and also to identify language contact and interaction effects to contribute to discussions about the development of word order patterns cross-linguistically.

I've been here only a few weeks, and our first year of the project involves a lot of data collection, so I'll be traveling quite a bit and having to learn some more languages (working on Swiss-German and Burmese right now). But even in the first few weeks we've made some progress and I'm excited to share more as the research continues. I'm definitely doing Python programming, and it looks like I'll learn some Javascript for various tools we intend to build. Maybe I'll even get to use machine learning at some point.
1 Comment

Neural Networks: Some lessons from text classification

12/3/2017

1 Comment

 
​I have been quite lax with posting here mainly because I have been working very hard on a difficult problem. My current job involves (among other things) trying to automate a psychological coding system for text. This is pretty similar to research in Sentiment Analysis (SA, see this brief introduction), a task in Natural Language Processing (NLP) which attempts to predict the 'positivity' or 'negativity' of sentences (i.e. classification), such as from Twitter. Companies find this useful for (among other things) getting a broad understanding of how consumers respond to their products and services.

The coding system I am trying to automate is a bit more complex, but I am using similar techniques as in SA. I started initially with SVM and other classifiers using the SciKit-Learn library, and I have now moved on to using Neural Networks. These are essentially Machine Learning models that allow a computer to generalize patterns in data that correspond to particular outputs. Humans do this pretty naturally - we recognize patterns in language, for example, that allow us to parse sounds into words and words into units of meaning that, when combined, help us communicate.

A data-driven approach
  The idea with neural networks is that given enough labeled data, the computer's statistical model can capture the patterns that correspond to the labels and predict what the labels should be for data it has never seen before. This is essentially how character recognition (OCR) works - enough data has been fed to the computer that it has 'learned' what an "A" character looks like in a variety of images. But the model can only make good generalizations if it has a lot of data and a good diversity of data.

It is a common issue, especially in the field of Linguistics, that the amount of data available for a particular problem is limited (i.e. description of languages is often based on a 5-hour recorded corpus supplemented with elicitation, psycholinguistic experiments are usually conducted with sample sizes of 5-20 participants, though larger sample sizes are more ideal). This is partly because of how time-consuming data collection can be - combine this with the fact that you might be dealing with multiple languages and the issue is compounded. But even within a single language, particular problems only have limited datasets, and if we want to automate a system that has typically required trained humans (consider that PhD students who describe a language usually train/read for at least a year before fieldwork, after having completed an MA), there might be even less data than usual available for training, particularly data in the correct format.

Addressing problems in the neural network
  This lack of data is a significant problem that we have been working to overcome by creating more coded data. In building ML models I have continued working with the existing data to achieve decent results on a small dataset, the hope being that once more coded data is available it can be incorporated and give a better result. Along the way I have come across several important learning points that I thought I would write here as a reference of sorts, to add to other great posts that have been helpful to me.
​
  1. Check your code for bugs.
      I am coding in Python simply because it has lots of great libraries for Machine Learning (Keras, Tensorflow, Theano), data transformation and storage (Pandas, Numpy), and language processing (NLTK, SpaCy). Sometimes there are little bugs that give your model the wrong kind of data to generalize, so it learns nothing. Garbage in, garbage out, unfortunately. Printing to the console at key points helps you to identify these bugs.
  2. Check your data for errors.
      I am not a specialist in the coding system I am trying to automate, but I am a linguist, which helps me to recognize when the input text is wrong. For the people labeling the data, this is probably the most time-consuming process - checking each others' work. With linguistic data, I have found that it is helpful at times to normalize the text, but not too much. Basic stuff like lowercasing and removing punctuation can help normalize the data, as can simple spelling correction, but part-of-speech tagging and lemmatizing or stemming may not help at all. Some of this has to do with the degree to which NLP can accurately lemmatize or stem words. For example, WordNet is commonly used for lemmatizing, but if the POS tag is incorrect then it returns entirely the wrong word.
  3. Balance your data.
      This is probably the most important thing to do when pre-processing your data before feeding it into your model, especially when working with small datasets. Your data may be highly skewed toward one label. In my case, with one of the binary classification problems I'm working on the ratio of zeros to ones in the data is roughly 10: 1. If you are splitting your data into training and validation sets (which you should be!), it is possible that your proportions change depending on how the split is set up. With a random split, your training data could end up with a label ratio of 15: 1, while your validation data could be 6: 1. This makes it much harder for the model to generalize over unseen data. Just fixing the proportions in the training and validation sets (and then shuffling each set independently) improved my correlations with unseen data by about 0.20 (Pearson's R) on average.
  4. Predict on completely unseen data.
      Many of the papers I have been reading (and I've been reading a lot!) report the accuracy of their model on validation data, which is usually a subset of the overall dataset, (between 10-50% depending on the problem). This is fine, but the validation data in a model is typically used to refine the weights of the overall model from epoch to epoch. To my mind, 94% accuracy (which is what some papers report for their SA models) only tells you how well the model has learned from the dataset, not how it performs on real-world instances. Ideally, to test this you should keep a separate dataset out of the training/validation altogether and then try to predict the labels using your model. Then looking at the correlations between the predicted labels and the actual labels will give you a much better idea of the accuracy of the model on unseen data.

Final thoughts and further links
  These are just a few of the things that I have learned in the past few months while trying to sort out my classification problems. Papers and online forums have been extremely helpful in developing my understanding of the issues involved, and I have benefitted particularly from this blog and this one (among others) and examples of models on the Keras GitHub repository. For a good (though brief) discussion of trying to implement state of the art text classification models, see this post. Ultimately, as one person noted on a forum, developing neural networks is as much an art as a science, requiring experimentation and intuition to figure out how to apply a particular model architecture to solve a particular problem. 
1 Comment

Open Science, Data, and Linguistics

8/18/2017

0 Comments

 
​One of the concerns that has occupied my mind for that past few years is the question of data accessibility in the field of Linguistics. I am happy to announce that the data that underpins my grammatical description of Pnar is now freely available as a downloadable archive in audio and text form (anonymized where requested by participants). You can find the link to the dataset at the bottom of this post, but in the meantime I'd like to explain my views surrounding data access and give a brief explanation of the tool I've used to make my linguistic data accessible.

Why accessible data?
  Those who are familiar with linguistics understand that traditional descriptions of language are often based on recorded, transcribed and translated interviews and stories by speakers of the language. Although some theoretical work may be based on a few utterances or a single example, most linguistic work is based on many actual examples from utterances that real speakers produce.

One issue here is that there is such between- and within-speaker variation in speech that unless the data you use is actually accessible to other linguists, one can easily question the veracity of a particular analysis. In the interest of scientific enquiry, then, it is incumbent on the analyst-linguist to make their actual data accessible to other researchers in at least some form, whether in an archive or in a database. Having the data accessible to multiple researchers may lead to disagreements about analysis (there may be more than one way of analyzing a particular linguistic structure, for example), but ultimately such disagreements are healthy because they expand our knowledge.

Research verifiability/reproducibility
  This touches on a larger issue in the world of science, that of verifiability and reproducibility of research, which has galvanized the larger scientific community towards Open Science (see this blog post for an explanation, and check out this OSF paper), and in some fields such as Psychology, has actually resulted in a whole journal devoted to "replication studies". These kind of studies are aimed at trying to replicate results and findings of a particular study by following the same procedure as the original researchers. When replication studies uphold a particular result, it makes it more likely that the original study's findings were not the result of a statistical anomaly or falsification of data, which is a very serious problem that can lead to erroneous claims requiring retraction. For more on this visit RetractionWatch.com.

What this means in the case of linguistic data is that the recordings, transcriptions, and translations that underlie a grammatical description or other study, whenever possible, should be made accessible to other linguists. Data sharing can be a touchy issue simply because of a) the ethical concerns of the providers of the data, b) potential cultural taboos, and c) because of the interests of the linguist who initially made and processed the data.

With proper permissions sought and precautions taken, these concerns can be minimized or dealt with appropriately. A linguist needs to (minimally) communicate to participants about how the data will be used, take the time to anonymize recordings and annotations when necessary, and create a license that constrains how the data can be used in the future. Ideally, if you are doing your research correctly, your university's Institutional Review Board will have already helped you to think through these things. There are also some excellent books, papers and chapters that deal (at least somewhat) with this subject, and there are a set of standards for social science research (with human subjects) and specifically for linguistics that researchers should be aware of.

Some reasons linguists don't share data
  The final point (C, the interests of the linguist) is really the sticking point for most people. The reality is that many linguists do not want to release data for several reasons:
  1. They haven't had time to go through it themselves to their satisfaction.
      This is often the main reason data doesn't get shared. It is common in fieldwork linguistics to collect many hours of recordings that a linguist never has time to annotate. In my case I recorded something like 11 hours of stories and conversations, but I was only able to transcribe and translate (and annotate) around 8 hours or so. The other 3 hours just didn't get processed, and this is one of my future tasks - to sit with a speaker and spend the time transcribing and translating. Consider, for example, that 5 minutes of recorded speech often takes something like one hour of time to transcribe and annotate with the help of a native speaker. This is very time-consuming and laborious work, which means that often there are recordings that remain with very little annotation and no transcription.
  2. They are worried that their analysis will be critiqued.
      Many linguists who do fieldwork are essentially apprentices, and are just starting to learn how to analyze linguistic structure on their own (i.e. PhD students). It can be extremely intimidating to know that you have lots of questions about how a language works, even after working on it for several years, and to know at the same time that people who have 20+ years of experience on multiple languages may be critiquing your data collection and analysis. I think this happens in many different disciplines, and it can be a barrier to making data public simply because of the personal fear that individuals can have.
  3. They are worried that their work will be 'stolen' or repackaged.
      The fact that this is even a concern is telling about both the field of linguistics and the way data is treated. In the field of linguistics, it is really incumbent on senior linguists to honor data 'ownership' or curation by the primary data collector, by citing data properly.
      Ownership is a bit of an issue to work out sometimes, as beginning field linguists are often paid or supported in their work by a supervisor's grant. I think the best way is to follow a 'time-spent' principle. That is, the person who has spent the most time with the data has the largest share of ownership of the final form of the dataset. This is strictly regarding the annotation, transcription, and translation of the dataset (the speakers who speak on the recordings obviously have a different kind of ownership).
      Other kinds of ownership or use ought to be negotiated very early on by interested parties - in my case, for example, I and my supervisor agreed that the data I created would be available to him for research use, and that I would share it relatively freely with others for non-commercial use. Regarding data ownership, linguists can honor this by citing the source or providing proper attribution, but it may be the case that the data cited is not readily available. It may be printed in the back of a grammatical description (or some portions of it may be), but more often it is located in a collection of notebooks on the shelf of the linguist somewhere, gathering dust.

My Data
  This brings me back to the earlier ruminations that started this post, namely that data produced by a linguist and which underpins their work ought to be accessible to other scientists and linguists. When I first submitted my PhD at NTU, I took a look at some of the options for data archiving, and I approached the university library (which keeps digital copies of all theses submitted at the university) to see if they could also store my audio and transcription data (over 1GB). About a year ago, they contacted me to let me know that they were developing such a service, something called DataVerse, and wanted to know if they could use my dataset to test it. I was happy to have them do so, and after some tweaking and time, this tool is now available for use.

DataVerse is a database/archive tool developed at Harvard University that allows researchers to store datasets that other researchers can download for use and testing. It supports the Open Science initiative by making data accessible and open. It also solves one of the problems I noted above by creating a unique url identifier and citation for the dataset. You can check out my dataset at its DOI here and download it for research and non-commercial purposes.

Further thoughts
As I was thinking about this previously, I realized that what I wanted was not really an archive but a database that would allow me to develop and annotate my data further. Unfortunately DataVerse is not that - it is basically just a storage tool. What is nice is that it provides versioning, so the curator of the dataset can upload and publish changes. I think I may have to create my own database if I want something that will let me explore the data better. But for now, the data is freely accessible for other linguists (even though my analysis isn't perfect), which is a bit of a load off my mind.
0 Comments

Bibliography Management (BibDesk)

8/3/2017

0 Comments

 
In a previous post I discussed some of the benefits I discovered in using LaTeX with LyX as a front-end. Another extremely useful tool to learn how to use is a Bibliography manager. If you are like I was, and often develop a new bibliography for each paper you write, this is something you might want to consider. On the other hand, you might be used to Microsoft applications and already be familiar with bibliography managers (such as Zotero) that integrate pretty well with the MS Office family.

As I started my PhD and began writing more papers, I noticed that many of my citations were the same. Rather than copy-pasting from previous papers and then adjusting the formatting for each submission, I realized it would be much easier to have a centralized location for all the papers I wanted to cite, and have the computer deal with the formatting according to the style sheet I needed to use.

Fortunately, support for this is ‘baked-in’ to LaTeX/LyX. There are various good tools that integrate with BibTeX (as the bibliography manager in LaTeX is called), but the one that I found to be most useful for my purposes is BibDesk. Rather than explain how I do it, I’ll point you to this excellent tutorial for Mac, which describes how to set it up. In the rest of this post I’ll simply give my reasons for using BibDesk.

  1. It centralizes bibliography management.

      Using BibDesk allows me to have a single repository for all the details of every publication I will ever need to cite. I keep this single (small) text file in a cloud folder, so I will (ideally) never lose it. All of my academic writing accesses this same text file database.
      As I read new publications and decide that I want to cite them, I add them to my main bibliography database. I will never face the problem again of having to build a completely new bibliography, unless I decide to change fields completely.

  2. It forces me to be more exacting about my workflow and citation data.

      Previously, I might let one citation slip through that only had a first initial, or another minor error. Since I’m forced to fill out the same form for each citation, I also have to be more precise about the details of the reference.
      I now try to find each author’s full names (sometimes this is impossible, unfortunately), because I know some bibliography styles require full first names, etc. The same goes for full names of journals or a series - the particular style will define how things get abbreviated, if at all. And this has actually led to some interesting discoveries - by forcing myself to find the full names, I have been led to other relevant publications that were ignored by previous writers on a particular topic.

  3. It allows me to annotate and auto-file relevant documents.

      Annotations such as notes to myself (of quotable quotes, page numbers, keywords, etc..) make a particular publication much easier to find. Not only that, but if I can attach a relevant document like a webpage or a PDF, I can much more easily re-read the article. Having a searchable database is incredibly useful, particularly if the article itself is open-source and machine-readable. BibDesk can take an existing document and re-file it for you according to conventions defined by you (the user), which means that you can have a single folder housing all relevant publication documents.
      I work with quite a number of old documents that are out of copyright, and it is extremely useful to be able to download an old scan (from Google or Archive.org) and have it re-named and re-filed simply by dragging it onto the relevant bibliography entry in BibDesk. If I have an OCR’d version, then searching for keywords in the BibDesk search bar can bring up entries that I didn’t realize were even relevant.

Each of these reasons alone are worth getting your act together and creating a single bibliography repository. You can likely think of other good reasons, which just means there is no excuse to not do so.

Another issue that I am thinking about is how to make my library of citations/documents available on any computer with internet access via the cloud. This would ensure at the very least that I wouldn't worry as much about losing it if my computer dies (though I'm still going to make backups regularly). A fellow academic and friend of mine has managed to integrate all of his citations and PDFs with Zotero, and make it available on his phone in an app like Google Drive, so that at conferences he can remember a publication or search for one in a conversation and pull up the reference and/or associated document to show people. This is super cool and super useful - I’ll write about it if I can figure out how to pull it off on my own, or maybe I’ll get him to write about it.
0 Comments

LaTeX and LyX for (thesis) writing

6/16/2017

2 Comments

 
​One of the first things I did after passing my PhD confirmation exercise (like a qualifying exam in the USA) was to research the best way to write my thesis. As a side note, I use the word 'thesis' to refer to any large written work, including a PhD, while other English speakers might use the word 'dissertation' to refer specifically to the work that a PhD student produces. In any case, the relevant information here is the term "large", since I knew I was going to be writing a lot. I now consider the tools I'm writing about here to be essential for a productive workflow, and so this post continues the theme of an earlier post on linguistic tools.

In researching how to write my thesis, I asked friends and fellow linguists who had written grammatical descriptions. Most of them had used MS Word, and told me horror stories of lost work, un-editable documents due to the sheer size of their files, difficulties formatting and printing the thing, etc.. So that was out of the question for me, at least at the time (2011; I think more recent versions of MS Office may have fixed some of these issues). But one of them mentioned a program called LaTeX (the funny capitalization is actually part of the name), and that it made typesetting and organization a breeze. And it's free! Which is pretty important to students (if not everyone).

So I checked it out, and ended up spending the next few months learning how to set it up on my computer and how to use it (I use MacTeX as the backend). I am fortunate that I have a little background in coding, because LaTeX is essentially a markup language. You write the text of what you want, formatting parts of it by using special combinations of characters and commands (or 'tags') that tell a program how to format them. Then you run a 'compiler' that outputs everything in the correct layout in a PDF. This is pretty brilliant, because it lets you (the writer) worry mostly about the content rather than the format. But learning how to fiddle with the code is rather time-consuming, so if you're not a hardcore programmer (and I still don't really consider myself one of the hardcore types) there is quite a learning curve. Worth it, but steep.

This is where a visual editor like LyX comes in. LyX is, pretty much out of the box, a simple way of interacting with your LaTeX code. It hides most of the code and offers formatting options, similar to MS Word or other word processors. Unlike them, however, you choose the general formatting parameters and let the backend handle the layout. You can also fiddle directly with the code if you need to, or add code to the front of the document for particular use cases, like a PhD cover page, interlinearized glossed text (IGT) examples, and more. Basically anything you need to add has probably been coded or figured out by someone, and if you're a troubleshooter like me you can run a Google search and find forums (and contribute to some yourself) that deal with your particular problem or at least something similar. And the assistance you get can be pretty phenomenal.

LyX does take a bit of configuration, and I might write another post that explains how I set it up for my use case(s). But for now, I’ll just say that using LaTeX/LyX was one of the best decisions I made as a PhD student. It really simplified my writing process and allowed me to do so much more. Rather than spending the final month on formatting my thesis, I was writing and making final changes all the way up to the deadline. I probably wrote more, and re-organized the structure more, in the last month than I had in the previous three. And the text file that contains my 700+ pages of analysis, examples, and appendices is only ~6 MB. Possibly the greatest benefit was that LyX kept track of all my linked example sentences, and formatted them all properly. Once I got it set up this saved me days and weeks of man-hours. The learning curve was totally worth it.

In closing, if you are seriously considering using LaTeX/LyX, there’s lots of good articles about this online. Here’s one, and here’s a discussion on the topic, to get you started.
2 Comments

Analyze tone and plot: A Praat script for exploring, analyzing, and visualizing or plotting F0 (pitch) tracks

5/6/2017

1 Comment

 
   ​Along the lines of some of my previous posts on tools for linguistic analysis, I thought I'd post a quick update on a Praat script I've been working on. While not part of the series of posts I've been working on describing my tools and workflow, this particular script is a recent tool that I've developed for analyzing and exploring tonal patterns.
   As I noted in my post on Praat scripting, the Praat program gives linguists a relatively easy way to investigate the properties of speech sounds, and scripts help to automate relatively mundane tasks. But regarding tone, while Praat can easily identify and measure tone, there are no comprehensive scripts that help a linguist to instrumentally investigate tones in a systematic way. Let me first explain a bit.

What is 'Tone'?
   In this post, I use the word 'tone' to refer specifically to pitch that is used in a language at the word or morpheme level to indicate a difference in meaning from another word (or words) that are otherwise segmentally identical. In linguistics, this is a "suprasegmental" feature, which essentially means that it occurs in parallel with the phonetic segments of speech created by the movement of articulators (tongue, mouth, etc..) of the human vocal apparatus.
   Tone has a lot of similarity to pitch as used in music, which is why tones often seem to 'disappear' or be relatively unimportant in songs sung in languages that otherwise have tone (like Mandarin). Another important piece of information to remember is that there are quite a few different kinds of tone languages. The major tone types are 'register' and 'contour' tones - often languages will be categorized as having one or the other, but some have both types. African languages tend to have register tone: this means that pitch height (measured in frequency, often by Hz) is the primary acoustic correlate of tone in these languages. Asian languages tend to have contour tone: this means that across the tone-bearing-unit (TBU) the pitch may rise or fall. 
   There are also languages for which the primary acoustic correlate of tone (or some tones) is not pitch, but rather voice quality: glottalization (i.e. Burmese), jitter, shimmer, creakiness, or length. Languages for which pitch is not the overall primary acoustic correlate of tone are often called 'register' languages. Register languages are found throughout South-East Asia.

Measuring tones
   The pitch correlates of tones are relatively easy for Praat to identify, as its instrumentally observable correspondence is the fundamental frequency (F0) of a sound. Praat can extract the fundamental frequency as well as other frequencies (formants) that are important for various vowels. Tone researchers often use a script to extract F0, and then apply various normalization techniques to the extracted data, plotting the results in a word-processing program like Excel or by using a more robust statistical tool like R (see this paper or this post for more details, and James Stanford has done some good work on this subject - see his presentation on Socio-tonetics here).
   One issue with this approach, however, is that often pitch tracks for each tone are time-normalized along with every other tone. So for example, in image #1 below of Cantonese tones (from this paper), all pitches are plotted across the same duration. This is fine if pitch is the primary acoustic correlate of all tones, but what if duration/length is an important correlate as well? How does a researcher begin to discover whether that is the case?
Picture
image #1
Script for analyzing tone
   ​The problem I tried to solve was basically this: I was investigating a language (Biate) for which no research had been done on tone, and I needed a way to quickly visualize the tonal (F0) properties of a large number of items. I also wanted to easily re-adjust the visualization parameters so that I could output a permanent image. Since I couldn't find an existing tool that could do this, and since I wanted it all to be done in a single program (rather than switching between two/several) I created the Praat script that you can find here at my GitHub page under the folder 'Analyze Tone'. If you look closely, you'll notice that this scripts incorporates several parts (or large chunks) of other people's work, and I'm indebted to their pioneering spirit.
   There is a good bit of documentation at the page itself, so I'll leave you to get the script and explore, and will give only a brief explanation and example below. I have so far only tested this script on a Mac, so if you have a PC and would like to test it, please do so and send me feedback. Or, if you're a GitHub user, feel free to branch the repo and submit corrections.

   The basic function of the script is to automate F0 extraction, normalization, and visualization using Praat in combination with annotated TextGrid/audio file pairs. Segments that are labeled the same in TextGrids belonging to the same audio file are treated the same. You can start out by plotting all tracks, in which case all tones labeled the same will be drawn with the same color. Or you can create normalized pitch traces, in which case a single F0 track will be plotted for all tones with the same label. This script creates audio files on your hard drive as one of its steps, so make sure you have enough free space (the size of each file is usually less than 50Kb, but if you have hundreds, they can add up).
   In order to start, simply open an audio file in Praat and annotate it with the help of a TextGrid (if you don't know how, try following this video tutorial). Create a label tier that only contains tone category labels, and add labels to boundaries that surround each tone-bearing-unit. Then save the textgrid file in the same directory as the audio file, and run the script.

   ​The kind of output you can get with this script is shown below. The first picture shows all the tones plotted for a single speaker, and the normalized F0 tracks for each tone are shown in the second picture. Importantly, F0 tracks are only length-normalized for each tone. You will also notice that the second image zooms in closer so that the relative differences between tones can be seen more easily.
Picture
image #2
Picture
image #3
Reasons for visualization
  These kinds of visualizations are important for several reasons.
  1. Tonal realizations often differ greatly both between speakers and within a single speaker's speech. This is because each individual's vocal apparatus is different, and also because the articulatory context in which they produce each tone differs. A tonal target produced with a high front vowel [i] and after a dental [t] will be different from one produced with an open central [a] and after a velar [g], at least to an instrument. These differences may average out through normalization, but likely only if A) there are enough tokens and B) they are actually the same tone category according to a native speaker.
  2. An instrument like Praat cannot perceive the differences in the same way that a native speaker can. This is also why Praat can give the wrong measurements of fundamental frequency. F0 is only one of the formant frequencies within a sound wave. If the measurement parameters are incorrect, Praat may measure a different frequency than the fundamental frequency - this is why it is important to set the parameters according to each speaker (male fundamental frequencies are usually found between 75-300 Hz and female F0 is usually found between 100-500 Hz; mean F0 is often around 120 Hz for men and 220 Hz for women due to their different physiology).
  3. Immediate visualization means that incorrect acoustic measurements by Praat can be corrected. At the same time, mixed or incorrect plots for pitch production can be observed and the tokens can be filtered manually by the researcher. Unfortunately the labeling must be done manually, but it is relatively easy to check whether a marginal tone is actually present or at least whether pitch is one of its acoustic correlates.

   If you have labeled tones from multiple speakers, it is also relatively easy to see whether differences between tones are actually salient. In image #2 above, for example, we can see that most of the red tones (Tone 1) cluster with a higher (rising) contour, whereas the blue tones (Tone 2) cluster with a lower (falling) contour. A few blue tones have a rising F0 track, which may be outliers. In the lower image #3, we see that normalizing across 100 or so instances of each tone shows little effect from those rising blue tones, which strongly suggests they were outliers.
   We can also see in this normalized plot that there is a clear difference between Tone 1 (red) and Tone 3 (blue), but the marginal Tone 3 (green) was questionable for most speakers that I recorded, and in the plot it is virtually identical to Tone 2. If this is true for the majority of speakers, we can conclude that Tone 3 probably doesn't actually exist in Biate as a tone correlated with pitch.

   Of course all of these visualizations need qualification, and further investigation is needed by any researcher who uses this tool - your methods of data collection and analysis are crucial to ensuring that the information you are gaining from your investigation is leading to the correct analysis. But speakers of a tone language will often clearly identify categories and what you need is a tool to make these categories visible to those who may not perceive them.

Future work
   I'm hoping in future to expand the coverage of this tool in order to assist with visualizing other acoustic correlates of tone, such as glottalization and creakiness. But I'm not sure when I'll get around to it. If you are a Praat script writer and would like to give it a shot, please join the collaboration and make adjustments - I'd be happy to incorporate them. For now, though, enjoy plotting pitch tracks, and let me know if you have any trouble!
1 Comment

Using Git for Linguistics

4/24/2017

0 Comments

 
Recently, I started to explore using Git to maintain and organize my data. This includes both the primary data I work from as a linguist and the various kinds of data I produce in the form of written material (books, papers, etc..). As a field linguist, I primarily work with text that is transcribed from audio and video recordings. Early on I developed an archiving system to preserve the original files in their raw form as much as possible, but I didn't develop a similar system for maintaining my working files, and ended up with lots of duplicates of files that I had copied various places for various purposes.

In this fifth post on Linguistic tools (the overview of which is here), I plan to describe the way I am trying to overcome this issue and become more organized by adopting a version control system. Mostly, though, I'll just try to explain what Git is and what I use it for. This is by no means a complete tutorial, since other people have done that better, but hopefully it will provide some direction for others who are interested in streamlining and organizing their workflow.

When I started learning to program in Python and was introduced to Git, I saw the benefits for coding, but began to wonder if it was possible/useful to use it for other organizational tasks in relation to other kinds of writing and data structures.  There are a few blog posts about using git for paper writing here and here and here, but they are written primarily for coders who also write papers (I'm looking at you, academic comp-sci folks!) and don't translate super well for my use case.

What is Git?
  First, what is Git? Git is a version control system, which means it tracks changes in a repository (or 'repo'), basically serving as a series of backups for your work. It also has tools that allow you to make a copy (or 'branch'), work on it separately, and then when you're ready, you can see the changes made to individual files and 'merge' them together into a new version. For a simple tutorial, see here, and here is a great visual walkthrough.

Most explanations of Git use the example of a tree or a river, which do help to understand the process, but for total newbies (and possibly to address the difficulty of finding a good metaphor) I came up with the following metaphor, and find it to be a bit more useful.

The Git Metaphor
  Imagine you have a drawing project. You have an empty piece of paper - this is your initial blank Git repository. Now you draw a picture on the piece of paper and decide it's an ok version - this is a snapshot that is your first 'commit'. You then decide you want to make some adjustments to the drawing, but you want to preserve the initial drawing, so you take out another piece of paper, lay it on top of the first, trace out the shape in pencil, and then begin to modify the edges of the drawing or the background by erasing or adding new bits. When you're happy with that you 'commit' it, and then every time you modify it, you repeat the process of first tracing it out on a piece of paper. Each of these pieces of paper stack, so if you want to go back to an earlier drawing, you can, without losing any of the other drawings.

  Now let's say your friend wants to work on the same drawing project. You can give him your latest drawing, he can copy it and work on it in the same way. Then when he's done he can give it back to you and you can choose which bits of his version you want to incorporate into your current version. This is super useful for projects with lots of collaborators, especially if you're writing code, which are basically text files and relatively easy to merge based on differences in lines.

[Note that this isn't exactly what the Git system does - it doesn't actually create a NEW version of the repository, like the tracing paper, but instead uses the most recently committed version as the new base/foundation for future modifications.]

The most important thing I've had to keep in mind is to write myself detailed commit messages, and to figure out a good protocol or format for these messages. This is because once you've committed something, you can go back and search through your commits (save states), but this is only easy IF YOU KNOW WHAT THE COMMIT CONTAINS!!! Using the full filename of the committed file, for example, is much better than writing out 'New version of Paper'. Giving more information, such as about what you changed in the file, is even better.

My Use Case
My use case, however (as opposed to collaborating on code with a large team), is for a single repository of data in which I will occasionally share a single paper with collaborators. What I want is a bit more like a backup system (think Apple's Time Machine or the versioning system in Mac OS) but with a lot more control. Git, for example, works on the command line and requires you to provide messages for each commit you make, which also requires you, the user, to be clear about what changes are and why they were made. This (ideally) means less clutter and assists with organization.

In thinking about my needs and doing research, I discovered a drawback of version control systems like Git. If I have a single repository but am working on multiple papers at once (like I often do), guess what happens when I save changes on document A and then checkout a previous version of paper B? Yup, everything reverts to the state it was before, and I 'lose' all the changes I just made to A. I put 'lose' in quotes here because I don't actually lose anything, as long as I had committed it properly, but I'd then have to checkout the commit, copy the file, and then revert... anyway it's annoying.

What I need
So what I REALLY need is a single Git repository with the ability to track individual folders/files separately with their own version histories (like embedding repositories). Git can do this in several ways, none of which are particularly intuitive: submodules, subtrees, subrepo, and simple embedding. The first two are designed for somewhat different use cases than mine, namely using (part of) someone else's code in your own project. They're not really meant to create sub-repositories in your local repo to track separate histories. The third, subrepo, looks quite promising but requires that you run everything from the root of your main repo, and since I have lots of organizational subfolders, typing the names every time just gets tedious. So, I use the strategy of simple embedding.

Simple Embedding
By simple embedding, I refer to navigating to a subfolder within my main repo that I want to track and running the command 'git init'. This does two things:
  1. it allows me to track changes within this folder separately from the main git repo. 
  2. it makes the changes in this folder somewhat opaque to the main repo.
This means that changes in the subfolder will propagate only at the intervals where I commit ALL to the main repo. Changes committed in the subfolder, however, will retain the history of the subfolder, before the most recent commit of the main repo. Metaphorically, making a separate repo of the subfolder allows me to return to the basement of a particular room in the structure and rummage around down there while retaining the fully built house around me.

This is probably a hack, and not the way Git was intended to be used, but it works for me. The only problem I foresee happening is making too many Git repositories, and not being able to keep track of the multiplicity. So I plan to use this sparingly. For the most part, I'm ok with committing changes across the whole repository, since one of the real benefits of using a VCS is the freedom to throw things away.

Some other great posts on version control systems include this one on the benefits of version control, this one on several different version control systems for writers, this post on the limitations of Github for writers - the last of a 6-part series worth reading if you are debating whether to start on the Git/version control journey. And finally, this one is about one person's suggestions for academic writing and Git.

In any case, Git or another VCS is definitely worth checking out. I also recently discovered that Git repos are supported 'out of the box' by LyX, my go-to LaTeX editor! More on that to come... I also recently discovered SRC (Simple Revision Control), which focuses on tracking revisions on single files. This might actually be what I need. I'm still exploring this, and it might not be necessary since my LaTeX editor supports Git. We'll see.
0 Comments
<<Previous

    About me

    I'm a linguist and singer-songwriter. I write about life, travel, language and technology.

    Archives

    January 2022
    May 2020
    September 2019
    July 2018
    February 2018
    December 2017
    August 2017
    June 2017
    May 2017
    April 2017
    March 2017
    February 2017
    December 2015
    May 2015
    December 2014
    November 2014
    October 2014
    September 2014
    August 2014
    July 2014
    June 2014
    April 2014
    March 2014
    December 2013
    October 2013
    August 2013
    July 2013
    June 2013
    May 2013
    April 2013
    March 2013
    February 2013
    January 2013

    Categories

    All
    3mt
    Abbi
    Acoustic
    Advice
    AI
    Album
    All I Want
    Analysis
    Andaman
    Annotation
    Archive
    Audio
    Austroasiatic
    Backup
    Biate
    Bibliography
    Breathe Deep
    China
    Chords
    Clause Similarity
    Cloud
    Collaboration
    Computers
    Conference
    Culture
    Data
    Data Access
    Datasets
    DataVerse
    Death
    Deixis
    Demonstratives
    Documentation
    Draw
    Duration
    DX
    E920
    Easter
    El Capitan
    E Reader
    E-reader
    Examples
    EXcel
    F0
    Failure
    Feature
    Fieldwork
    Formants
    Forums
    Friends
    Ghana
    Git
    Git Metaphor
    Greet The Dawn
    Hanvon
    HLS20
    Holiday
    Home
    How-to
    ICAAL
    Implicit Motives
    Instruction
    Intensity
    Interlinear
    I've Got A Girl
    Kindle
    Language
    LaTeX
    Linguistics
    LyX
    Mac
    Machine Learning
    Mastering
    Metaphor
    MU
    Myanmar
    Natural Language Processing
    Neural Networks
    New Release
    News
    NLP
    NLTK
    Open Science
    Papers
    Paperwhite
    Pdf
    PhD
    Phonetics
    Phonology
    Pitch
    Plot
    Pnar
    Praat
    Practical
    Process
    Processing
    Production
    Programming
    Psalms
    Psychology
    Publications
    Publicity
    Python
    Radar Radio
    Reasons
    Recording
    Research
    Review
    Scripts
    Sentiment Analysis
    Singapore
    Song
    Soundfarm
    Sports
    Studio
    Subrepo
    Syntactic Reconstruction
    Text Classification
    Thailand
    Thesis
    Things To Know
    This Lamp
    Thoughts
    Tips
    Tone
    Toolbox
    Tools
    Track List
    Transcriber
    Transcriber 1.5.2
    Transcription
    Travel
    Trs2txt
    Update
    USA
    UZH
    Valentine's Day
    Version Control
    Video
    Vowels
    Web App
    Website
    Wedding
    Word - Flesh
    Workflow
    World Cup
    Writing
    YUFL
    Zion's Walls
    Zurich

    RSS Feed

    prev. blog

      Contact me

    Submit
Powered by Create your own unique website with customizable templates.