Hiram Ring
  • Posts
  • Music
    • Projects
    • Downloads
    • Bio/Press
    • Music Photos
    • Music Links
    • Store
  • Linguistics
    • Travel Photos
    • Useful Linguistic Links
  • Posts
  • Music
    • Projects
    • Downloads
    • Bio/Press
    • Music Photos
    • Music Links
    • Store
  • Linguistics
    • Travel Photos
    • Useful Linguistic Links

Open Science, Data, and Linguistics

8/18/2017

0 Comments

 
​One of the concerns that has occupied my mind for that past few years is the question of data accessibility in the field of Linguistics. I am happy to announce that the data that underpins my grammatical description of Pnar is now freely available as a downloadable archive in audio and text form (anonymized where requested by participants). You can find the link to the dataset at the bottom of this post, but in the meantime I'd like to explain my views surrounding data access and give a brief explanation of the tool I've used to make my linguistic data accessible.

Why accessible data?
  Those who are familiar with linguistics understand that traditional descriptions of language are often based on recorded, transcribed and translated interviews and stories by speakers of the language. Although some theoretical work may be based on a few utterances or a single example, most linguistic work is based on many actual examples from utterances that real speakers produce.

One issue here is that there is such between- and within-speaker variation in speech that unless the data you use is actually accessible to other linguists, one can easily question the veracity of a particular analysis. In the interest of scientific enquiry, then, it is incumbent on the analyst-linguist to make their actual data accessible to other researchers in at least some form, whether in an archive or in a database. Having the data accessible to multiple researchers may lead to disagreements about analysis (there may be more than one way of analyzing a particular linguistic structure, for example), but ultimately such disagreements are healthy because they expand our knowledge.

Research verifiability/reproducibility
  This touches on a larger issue in the world of science, that of verifiability and reproducibility of research, which has galvanized the larger scientific community towards Open Science (see this blog post for an explanation, and check out this OSF paper), and in some fields such as Psychology, has actually resulted in a whole journal devoted to "replication studies". These kind of studies are aimed at trying to replicate results and findings of a particular study by following the same procedure as the original researchers. When replication studies uphold a particular result, it makes it more likely that the original study's findings were not the result of a statistical anomaly or falsification of data, which is a very serious problem that can lead to erroneous claims requiring retraction. For more on this visit RetractionWatch.com.

What this means in the case of linguistic data is that the recordings, transcriptions, and translations that underlie a grammatical description or other study, whenever possible, should be made accessible to other linguists. Data sharing can be a touchy issue simply because of a) the ethical concerns of the providers of the data, b) potential cultural taboos, and c) because of the interests of the linguist who initially made and processed the data.

With proper permissions sought and precautions taken, these concerns can be minimized or dealt with appropriately. A linguist needs to (minimally) communicate to participants about how the data will be used, take the time to anonymize recordings and annotations when necessary, and create a license that constrains how the data can be used in the future. Ideally, if you are doing your research correctly, your university's Institutional Review Board will have already helped you to think through these things. There are also some excellent books, papers and chapters that deal (at least somewhat) with this subject, and there are a set of standards for social science research (with human subjects) and specifically for linguistics that researchers should be aware of.

Some reasons linguists don't share data
  The final point (C, the interests of the linguist) is really the sticking point for most people. The reality is that many linguists do not want to release data for several reasons:
  1. They haven't had time to go through it themselves to their satisfaction.
      This is often the main reason data doesn't get shared. It is common in fieldwork linguistics to collect many hours of recordings that a linguist never has time to annotate. In my case I recorded something like 11 hours of stories and conversations, but I was only able to transcribe and translate (and annotate) around 8 hours or so. The other 3 hours just didn't get processed, and this is one of my future tasks - to sit with a speaker and spend the time transcribing and translating. Consider, for example, that 5 minutes of recorded speech often takes something like one hour of time to transcribe and annotate with the help of a native speaker. This is very time-consuming and laborious work, which means that often there are recordings that remain with very little annotation and no transcription.
  2. They are worried that their analysis will be critiqued.
      Many linguists who do fieldwork are essentially apprentices, and are just starting to learn how to analyze linguistic structure on their own (i.e. PhD students). It can be extremely intimidating to know that you have lots of questions about how a language works, even after working on it for several years, and to know at the same time that people who have 20+ years of experience on multiple languages may be critiquing your data collection and analysis. I think this happens in many different disciplines, and it can be a barrier to making data public simply because of the personal fear that individuals can have.
  3. They are worried that their work will be 'stolen' or repackaged.
      The fact that this is even a concern is telling about both the field of linguistics and the way data is treated. In the field of linguistics, it is really incumbent on senior linguists to honor data 'ownership' or curation by the primary data collector, by citing data properly.
      Ownership is a bit of an issue to work out sometimes, as beginning field linguists are often paid or supported in their work by a supervisor's grant. I think the best way is to follow a 'time-spent' principle. That is, the person who has spent the most time with the data has the largest share of ownership of the final form of the dataset. This is strictly regarding the annotation, transcription, and translation of the dataset (the speakers who speak on the recordings obviously have a different kind of ownership).
      Other kinds of ownership or use ought to be negotiated very early on by interested parties - in my case, for example, I and my supervisor agreed that the data I created would be available to him for research use, and that I would share it relatively freely with others for non-commercial use. Regarding data ownership, linguists can honor this by citing the source or providing proper attribution, but it may be the case that the data cited is not readily available. It may be printed in the back of a grammatical description (or some portions of it may be), but more often it is located in a collection of notebooks on the shelf of the linguist somewhere, gathering dust.

My Data
  This brings me back to the earlier ruminations that started this post, namely that data produced by a linguist and which underpins their work ought to be accessible to other scientists and linguists. When I first submitted my PhD at NTU, I took a look at some of the options for data archiving, and I approached the university library (which keeps digital copies of all theses submitted at the university) to see if they could also store my audio and transcription data (over 1GB). About a year ago, they contacted me to let me know that they were developing such a service, something called DataVerse, and wanted to know if they could use my dataset to test it. I was happy to have them do so, and after some tweaking and time, this tool is now available for use.

DataVerse is a database/archive tool developed at Harvard University that allows researchers to store datasets that other researchers can download for use and testing. It supports the Open Science initiative by making data accessible and open. It also solves one of the problems I noted above by creating a unique url identifier and citation for the dataset. You can check out my dataset at its DOI here and download it for research and non-commercial purposes.

Further thoughts
As I was thinking about this previously, I realized that what I wanted was not really an archive but a database that would allow me to develop and annotate my data further. Unfortunately DataVerse is not that - it is basically just a storage tool. What is nice is that it provides versioning, so the curator of the dataset can upload and publish changes. I think I may have to create my own database if I want something that will let me explore the data better. But for now, the data is freely accessible for other linguists (even though my analysis isn't perfect), which is a bit of a load off my mind.
0 Comments

Bibliography Management (BibDesk)

8/3/2017

0 Comments

 
In a previous post I discussed some of the benefits I discovered in using LaTeX with LyX as a front-end. Another extremely useful tool to learn how to use is a Bibliography manager. If you are like I was, and often develop a new bibliography for each paper you write, this is something you might want to consider. On the other hand, you might be used to Microsoft applications and already be familiar with bibliography managers (such as Zotero) that integrate pretty well with the MS Office family.

As I started my PhD and began writing more papers, I noticed that many of my citations were the same. Rather than copy-pasting from previous papers and then adjusting the formatting for each submission, I realized it would be much easier to have a centralized location for all the papers I wanted to cite, and have the computer deal with the formatting according to the style sheet I needed to use.

Fortunately, support for this is ‘baked-in’ to LaTeX/LyX. There are various good tools that integrate with BibTeX (as the bibliography manager in LaTeX is called), but the one that I found to be most useful for my purposes is BibDesk. Rather than explain how I do it, I’ll point you to this excellent tutorial for Mac, which describes how to set it up. In the rest of this post I’ll simply give my reasons for using BibDesk.

  1. It centralizes bibliography management.

      Using BibDesk allows me to have a single repository for all the details of every publication I will ever need to cite. I keep this single (small) text file in a cloud folder, so I will (ideally) never lose it. All of my academic writing accesses this same text file database.
      As I read new publications and decide that I want to cite them, I add them to my main bibliography database. I will never face the problem again of having to build a completely new bibliography, unless I decide to change fields completely.

  2. It forces me to be more exacting about my workflow and citation data.

      Previously, I might let one citation slip through that only had a first initial, or another minor error. Since I’m forced to fill out the same form for each citation, I also have to be more precise about the details of the reference.
      I now try to find each author’s full names (sometimes this is impossible, unfortunately), because I know some bibliography styles require full first names, etc. The same goes for full names of journals or a series - the particular style will define how things get abbreviated, if at all. And this has actually led to some interesting discoveries - by forcing myself to find the full names, I have been led to other relevant publications that were ignored by previous writers on a particular topic.

  3. It allows me to annotate and auto-file relevant documents.

      Annotations such as notes to myself (of quotable quotes, page numbers, keywords, etc..) make a particular publication much easier to find. Not only that, but if I can attach a relevant document like a webpage or a PDF, I can much more easily re-read the article. Having a searchable database is incredibly useful, particularly if the article itself is open-source and machine-readable. BibDesk can take an existing document and re-file it for you according to conventions defined by you (the user), which means that you can have a single folder housing all relevant publication documents.
      I work with quite a number of old documents that are out of copyright, and it is extremely useful to be able to download an old scan (from Google or Archive.org) and have it re-named and re-filed simply by dragging it onto the relevant bibliography entry in BibDesk. If I have an OCR’d version, then searching for keywords in the BibDesk search bar can bring up entries that I didn’t realize were even relevant.

Each of these reasons alone are worth getting your act together and creating a single bibliography repository. You can likely think of other good reasons, which just means there is no excuse to not do so.

Another issue that I am thinking about is how to make my library of citations/documents available on any computer with internet access via the cloud. This would ensure at the very least that I wouldn't worry as much about losing it if my computer dies (though I'm still going to make backups regularly). A fellow academic and friend of mine has managed to integrate all of his citations and PDFs with Zotero, and make it available on his phone in an app like Google Drive, so that at conferences he can remember a publication or search for one in a conversation and pull up the reference and/or associated document to show people. This is super cool and super useful - I’ll write about it if I can figure out how to pull it off on my own, or maybe I’ll get him to write about it.
0 Comments

LaTeX and LyX for (thesis) writing

6/16/2017

2 Comments

 
​One of the first things I did after passing my PhD confirmation exercise (like a qualifying exam in the USA) was to research the best way to write my thesis. As a side note, I use the word 'thesis' to refer to any large written work, including a PhD, while other English speakers might use the word 'dissertation' to refer specifically to the work that a PhD student produces. In any case, the relevant information here is the term "large", since I knew I was going to be writing a lot. I now consider the tools I'm writing about here to be essential for a productive workflow, and so this post continues the theme of an earlier post on linguistic tools.

In researching how to write my thesis, I asked friends and fellow linguists who had written grammatical descriptions. Most of them had used MS Word, and told me horror stories of lost work, un-editable documents due to the sheer size of their files, difficulties formatting and printing the thing, etc.. So that was out of the question for me, at least at the time (2011; I think more recent versions of MS Office may have fixed some of these issues). But one of them mentioned a program called LaTeX (the funny capitalization is actually part of the name), and that it made typesetting and organization a breeze. And it's free! Which is pretty important to students (if not everyone).

So I checked it out, and ended up spending the next few months learning how to set it up on my computer and how to use it (I use MacTeX as the backend). I am fortunate that I have a little background in coding, because LaTeX is essentially a markup language. You write the text of what you want, formatting parts of it by using special combinations of characters and commands (or 'tags') that tell a program how to format them. Then you run a 'compiler' that outputs everything in the correct layout in a PDF. This is pretty brilliant, because it lets you (the writer) worry mostly about the content rather than the format. But learning how to fiddle with the code is rather time-consuming, so if you're not a hardcore programmer (and I still don't really consider myself one of the hardcore types) there is quite a learning curve. Worth it, but steep.

This is where a visual editor like LyX comes in. LyX is, pretty much out of the box, a simple way of interacting with your LaTeX code. It hides most of the code and offers formatting options, similar to MS Word or other word processors. Unlike them, however, you choose the general formatting parameters and let the backend handle the layout. You can also fiddle directly with the code if you need to, or add code to the front of the document for particular use cases, like a PhD cover page, interlinearized glossed text (IGT) examples, and more. Basically anything you need to add has probably been coded or figured out by someone, and if you're a troubleshooter like me you can run a Google search and find forums (and contribute to some yourself) that deal with your particular problem or at least something similar. And the assistance you get can be pretty phenomenal.

LyX does take a bit of configuration, and I might write another post that explains how I set it up for my use case(s). But for now, I’ll just say that using LaTeX/LyX was one of the best decisions I made as a PhD student. It really simplified my writing process and allowed me to do so much more. Rather than spending the final month on formatting my thesis, I was writing and making final changes all the way up to the deadline. I probably wrote more, and re-organized the structure more, in the last month than I had in the previous three. And the text file that contains my 700+ pages of analysis, examples, and appendices is only ~6 MB. Possibly the greatest benefit was that LyX kept track of all my linked example sentences, and formatted them all properly. Once I got it set up this saved me days and weeks of man-hours. The learning curve was totally worth it.

In closing, if you are seriously considering using LaTeX/LyX, there’s lots of good articles about this online. Here’s one, and here’s a discussion on the topic, to get you started.
2 Comments

Using Git for Linguistics

4/24/2017

0 Comments

 
Recently, I started to explore using Git to maintain and organize my data. This includes both the primary data I work from as a linguist and the various kinds of data I produce in the form of written material (books, papers, etc..). As a field linguist, I primarily work with text that is transcribed from audio and video recordings. Early on I developed an archiving system to preserve the original files in their raw form as much as possible, but I didn't develop a similar system for maintaining my working files, and ended up with lots of duplicates of files that I had copied various places for various purposes.

In this fifth post on Linguistic tools (the overview of which is here), I plan to describe the way I am trying to overcome this issue and become more organized by adopting a version control system. Mostly, though, I'll just try to explain what Git is and what I use it for. This is by no means a complete tutorial, since other people have done that better, but hopefully it will provide some direction for others who are interested in streamlining and organizing their workflow.

When I started learning to program in Python and was introduced to Git, I saw the benefits for coding, but began to wonder if it was possible/useful to use it for other organizational tasks in relation to other kinds of writing and data structures.  There are a few blog posts about using git for paper writing here and here and here, but they are written primarily for coders who also write papers (I'm looking at you, academic comp-sci folks!) and don't translate super well for my use case.

What is Git?
  First, what is Git? Git is a version control system, which means it tracks changes in a repository (or 'repo'), basically serving as a series of backups for your work. It also has tools that allow you to make a copy (or 'branch'), work on it separately, and then when you're ready, you can see the changes made to individual files and 'merge' them together into a new version. For a simple tutorial, see here, and here is a great visual walkthrough.

Most explanations of Git use the example of a tree or a river, which do help to understand the process, but for total newbies (and possibly to address the difficulty of finding a good metaphor) I came up with the following metaphor, and find it to be a bit more useful.

The Git Metaphor
  Imagine you have a drawing project. You have an empty piece of paper - this is your initial blank Git repository. Now you draw a picture on the piece of paper and decide it's an ok version - this is a snapshot that is your first 'commit'. You then decide you want to make some adjustments to the drawing, but you want to preserve the initial drawing, so you take out another piece of paper, lay it on top of the first, trace out the shape in pencil, and then begin to modify the edges of the drawing or the background by erasing or adding new bits. When you're happy with that you 'commit' it, and then every time you modify it, you repeat the process of first tracing it out on a piece of paper. Each of these pieces of paper stack, so if you want to go back to an earlier drawing, you can, without losing any of the other drawings.

  Now let's say your friend wants to work on the same drawing project. You can give him your latest drawing, he can copy it and work on it in the same way. Then when he's done he can give it back to you and you can choose which bits of his version you want to incorporate into your current version. This is super useful for projects with lots of collaborators, especially if you're writing code, which are basically text files and relatively easy to merge based on differences in lines.

[Note that this isn't exactly what the Git system does - it doesn't actually create a NEW version of the repository, like the tracing paper, but instead uses the most recently committed version as the new base/foundation for future modifications.]

The most important thing I've had to keep in mind is to write myself detailed commit messages, and to figure out a good protocol or format for these messages. This is because once you've committed something, you can go back and search through your commits (save states), but this is only easy IF YOU KNOW WHAT THE COMMIT CONTAINS!!! Using the full filename of the committed file, for example, is much better than writing out 'New version of Paper'. Giving more information, such as about what you changed in the file, is even better.

My Use Case
My use case, however (as opposed to collaborating on code with a large team), is for a single repository of data in which I will occasionally share a single paper with collaborators. What I want is a bit more like a backup system (think Apple's Time Machine or the versioning system in Mac OS) but with a lot more control. Git, for example, works on the command line and requires you to provide messages for each commit you make, which also requires you, the user, to be clear about what changes are and why they were made. This (ideally) means less clutter and assists with organization.

In thinking about my needs and doing research, I discovered a drawback of version control systems like Git. If I have a single repository but am working on multiple papers at once (like I often do), guess what happens when I save changes on document A and then checkout a previous version of paper B? Yup, everything reverts to the state it was before, and I 'lose' all the changes I just made to A. I put 'lose' in quotes here because I don't actually lose anything, as long as I had committed it properly, but I'd then have to checkout the commit, copy the file, and then revert... anyway it's annoying.

What I need
So what I REALLY need is a single Git repository with the ability to track individual folders/files separately with their own version histories (like embedding repositories). Git can do this in several ways, none of which are particularly intuitive: submodules, subtrees, subrepo, and simple embedding. The first two are designed for somewhat different use cases than mine, namely using (part of) someone else's code in your own project. They're not really meant to create sub-repositories in your local repo to track separate histories. The third, subrepo, looks quite promising but requires that you run everything from the root of your main repo, and since I have lots of organizational subfolders, typing the names every time just gets tedious. So, I use the strategy of simple embedding.

Simple Embedding
By simple embedding, I refer to navigating to a subfolder within my main repo that I want to track and running the command 'git init'. This does two things:
  1. it allows me to track changes within this folder separately from the main git repo. 
  2. it makes the changes in this folder somewhat opaque to the main repo.
This means that changes in the subfolder will propagate only at the intervals where I commit ALL to the main repo. Changes committed in the subfolder, however, will retain the history of the subfolder, before the most recent commit of the main repo. Metaphorically, making a separate repo of the subfolder allows me to return to the basement of a particular room in the structure and rummage around down there while retaining the fully built house around me.

This is probably a hack, and not the way Git was intended to be used, but it works for me. The only problem I foresee happening is making too many Git repositories, and not being able to keep track of the multiplicity. So I plan to use this sparingly. For the most part, I'm ok with committing changes across the whole repository, since one of the real benefits of using a VCS is the freedom to throw things away.

Some other great posts on version control systems include this one on the benefits of version control, this one on several different version control systems for writers, this post on the limitations of Github for writers - the last of a 6-part series worth reading if you are debating whether to start on the Git/version control journey. And finally, this one is about one person's suggestions for academic writing and Git.

In any case, Git or another VCS is definitely worth checking out. I also recently discovered that Git repos are supported 'out of the box' by LyX, my go-to LaTeX editor! More on that to come... I also recently discovered SRC (Simple Revision Control), which focuses on tracking revisions on single files. This might actually be what I need. I'm still exploring this, and it might not be necessary since my LaTeX editor supports Git. We'll see.
0 Comments

Grammatical Analysis: the Transcriber-Toolbox workflow

4/12/2017

0 Comments

 
This is the fourth installment in a series of blog posts where I discuss some of the tools I use for linguistic analysis. In a previous post I described the tools I use for processing and converting video and audio. This is partly important for using the audio with Praat for acoustic analysis, but also because ultimately I want to transcribe the recordings and analyze the grammatical structure (using the transcriptions) while remaining linked to the recordings. Ideally, such a link would allow me to play back the recording from within the program that I use to analyze the grammatical structure of the language, so I'm not constantly having to open the sound files and scrub to the correct part of the WAV.

Fortunately, this IS possibile, and it currently involves using two larger programs, Transcriber and Toolbox, along with a small tool/script that I wrote for conversion (you can read more details about the script HERE). The original workflow, which my supervisor Alec Coupe taught me, depended on another script developed by Andrew Margetts, which was very useful but required internet access - possibly problematic if one was on fieldwork. Then a few years ago the site went down and conversion became a multi-step process involving emails. So I wrote my own script. But I'm getting ahead of myself.

The workflow:
The first step in this process is getting your audio file transcribed. It is possible to do this in Transcriber (thus the name), and the newly repackaged version for Mac works pretty well (as does 1.5.1 on Windows or in a Windows VirtualBox). You can also use another program to play back the audio (i.e. Audacity, which can slow down the audio for you and let you boost the signal, etc) while transcribing in a text editor. Then you copy/paste the text to Transcriber for time-alignment.

I use Transcriber for time-alignment because it's simple, it allows for extremely short time-stamped windows, and it produces text files (basically) as output. There are other tools such as SayMore (Windows only) that also allow you to transcribe and directly output files for use with Toolbox, but in using it I found that SayMore was a bit of a memory hog and wouldn't let me create time-stamped sections shorter than half a second, which could be problematic for short interjections. And it kept freezing, so that it took a really long time to transcribe a short example.

So with Transcriber I simply open the audio file that I want to time-align, copy the block of text that I've transcribed, and then listen to the audio file, follow along in the text, and insert timestamps ('enter' or 'return' key) at the correct points. Once this is done for the whole file (time-consuming, but not as time-consuming as the transcription) I save the file (a '.trs' format).

Once you have the 'TRS' file, you get to use this converter, which you download to your computer. The converter runs as either a Python script or a Windows executable file that you run in the folder that has your 'TRS' file. When you run it for the first time it asks you for the field names in your Toolbox settings, and creates a configuration file in the folder to store the settings. If it doesn't find the settings file the next time, it will ask again. When it has the settings, it outputs a 'TXT' file for each 'TRS' file in the folder.

Once you've got the 'TXT' file, you can open it up to view it. If you already work with Toolbox, you can see that it is already (mostly) in the correct format. If you don't work with Toolbox, try installing it and go through the tutorials with a new project. Then open the Toolbox-created text file and compare it to the one created by my script from the Transcriber file. You should be able to copy/paste the content of the newly-created text file into your Toolbox project text file with no issues. Now when you open your Toolbox project file in Toolbox, you should be able to see the timestamps and play the sound file (and portions of it) from within Toolbox, provided you put the audio file in the correct location.

Why Toolbox?
This may seem a bit cumbersome, begging the question "Why use Toolbox at all?"
  • "Because Toolbox files are text files"
    This is the short answer. Text files are easy to store (archivable) and extract information from. They are easy to work with in Python when I want to search through the corpus or convert them into LaTeX for display as interlinearized glossed text (IGT) for other linguists.
  • Toolbox is fast
    This program is quite small and not resource-heavy, making it extremely useful for small laptops. I have also used the more resource-heavy Fieldworks Language Explorer (FLEx), but there are some drawbacks.
  • It doesn't make assumptions
    Toolbox treats clitics, affixes, and all other sorts of word types and word formations according to how the user defines them, rather than according to a pre-selected set of (theoretical) assumptions. Other programs such as FLEx seem to make unwarranted assumptions about a language and therefore actually get in the way of exploring how the language works. FLEx is great for wordlists and shared projects, but I find that Toolbox preserves the underlying format better and allows me more manual control. If I decide to change the way I represent things in Toolbox, I do have to manually go back and interlinearize things again, but that is arguably beneficial, as it forces me to get to know the language data better.

A final note (ELAN)
As I end this post, I should note that many linguists use ELAN for transcription and annotation of video/audio. In fact, that the suggestions made by the 'trs2txt' script are for start and end codes labeled ELANBegin and ELANEnd, respectively, and the code for participants is labeled ELANParticipant, which should facilitate importing into ELAN. Unfortunately, I haven't found ELAN to otherwise be particularly useful for my purposes, primarily because it doesn't have the dictionary linking functions that Toolbox has, at least the last time I looked at it. It does have a Toolbox import function, so I have seen people import a Toolbox file, re-link their video, and output an html web page with embedded video for display on a website (thus the code labels in the script). Maybe you could also transcribe in ELAN, export to Toolbox, do your analysis, and import back to ELAN. For now, though, I think I'll stick with the Transcriber-Toolbox workflow.
0 Comments

Praat and scripts: an introduction

3/26/2017

2 Comments

 
In this third post about linguistic tools, I'll be discussing software that I use for acoustic analysis. Praat is one of the premier acoustic analysis tools available for computers. While there are probably commercial software products out there that are more powerful and with more bells and whistles, Praat offers some of the best ways to visualize and manipulate sound while being free and cross-platform. While it's not completely intuitive, it is quite easy to explore the sound space of a recording, especially recorded speech, and I ran a workshop on the basics of how to use it, with online materials that you can practice with if you want to learn more. There are also other great tutorials online that you should search for.

One of the best features of Praat is the ability to segment sounds using TextGrids, which are basically text files that identify sections of a sound file using timestamps. The benefit of this is that once you have properly annotated a sound file you can use scripts to automate analyses, which saves a lot of time that would otherwise be spent taking individual measurements. When I first started my PhD I spent a good amount of time learning to write Praat scripts, which turned out to be a continuation of the programming I learned when I was younger (Basic, QBasic) and a worthy introduction to programming languages like Python.

Since this has turned out to be a post that discusses Praat scripting, I'm going to introduce/attach some of the scripts I wrote/use for acoustic analysis, and link to some of the many other places you can find scripts for your particular use case. In my case these scripts are mainly in service of documentation and description of endangered and unwritten languages, but maybe others will find them useful as well.

Automatically measuring sounds:
dur_f0_f1_f2_f3_intensity.praat
File Size: 9 kb
File Type: praat
Download File

This script ("dur_f0_f1_f2_f3_intensity.praat") is one that I modified (originally from this script but more recently I based it on this script) to give automatic measurements of segmented sounds in a TextGrid. It is an updated version of the “msr&check…" file that I made available along with the workshop I linked to above. At the time, I had recorded several wordlists in Pnar, and I spent countless hours segmenting the sounds in each word. My thinking was that even if my segmentation wasn't precise, the sheer number of sounds and their tabulation would allow me to run valid quantitative analyses. As it worked out, this was mostly the case, and I was able to target the outliers for closer examination. I also got better at recognizing Pnar sounds from all the time I spent with the words. I have now updated this script to work nicely with the following script, which plots vowels for you in the Praat picture window, which can produce print-publication-friendly images.

Vowel plot for formants:
draw_formants_plot_std_dev.praat
File Size: 8 kb
File Type: praat
Download File

Another that I wrote/modified from other bits takes a comma-delimited CSV spreadsheet with formant values and plots them (in the standard vowel chart format) as a Praat drawing with an oval marking their standard deviation (“draw_formants_plot_std_dev.praat”). I wrote this primarily to produce a clearer image than the one produced by JPlotFormants for my PhD thesis. Thanks also to the Praat User Group for their help with getting the script right.
  I recently modified this script to work nicely with the automatic measurement script above. What this means is that you can segment all your words using TextGrids, run the script above to produce a CSV, and then just run this script to plot characters from that CSV. I implemented a 'Sequential' option for the plot so you can plot one vowel at a time, which means that you can leave all the segmented consonants (and VOT annotations) in the CSV file for later analysis. Or you can remove them, up to you. Just keep in mind that if you do have consonants in the CSV, it WILL try to plot them on the chart unless you choose the Sequential option.

Tone Analysis:
tone_analysis.praat
File Size: 5 kb
File Type: praat
Download File

The third script linked here (“tone_analysis.praat”) I recently wrote in order to take continuous measurements of tones without normalization. This is more for exploration of tonal systems on a per-speaker basis, allowing the investigator to identify whether length is potentially a factor in the characteristics of a particular tone. I am planning to modify it to allow for percentage-based analysis (and thus normalization) of tones, which could be used by the investigator to create clearer plots once they identify the characteristics of the individual tones. But I haven’t gotten around to it yet. I'll write another blog post when I do.

As a final note, these scripts are really just the tip of the iceberg when it comes to the kind of analysis you can do in Praat. For more on Praat scripting, check out this great tutorial, Will Styler's excellent blog, the scripts he uses/maintains, these resources at UW and these from UCLA. You can also follow along with Bartlomiej Plichta as he leads you through some scripting lessons in his videos, which are very useful.
2 Comments

Recording and Processing linguistic data

3/13/2017

0 Comments

 
​When you discuss doing language documentation and description, one of the first things to know is that you have to collect language data. The primary source of language data is people who speak the language you're interested in, which then begs the question of how you record the data. There are some great books and papers on doing linguistic fieldwork of a documentary nature (more than what I've linked to here), but this post is focused more on the tools you use to process your data once it is recorded, as a continuation of my 'Linguistic Tools' post. I'll also plan to write a longer post on recording audio/video in the field, but for now I'll assume that you've recorded it already. I'll just briefly say that I like using a digital SLR like the Canon Rebel along with a unidirectional mic, in conjunction with a digital audio recorder like the Zoom H4N (ideally with a lapel mic of some kind).

Once you have your data recorded, the next step is to copy it to your computer for processing. Often the digital recordings will be rather large and cumbersome, and you may want to split them into smaller files, depending on how many stories/interactions you recorded. I find post-processing is important because it means you can focus on the interaction during the recording and then during processing you make notes of all the files, their content, and other metadata that will help later when you're not in the field and can't remember all the details.

In this processing stage you also want to do two very important things:
  1. backup/archive your files with metadata
  2. copy the files to a separate working directory for processing
The first step is extremely important for preserving the raw files that you recorded. I would also recommend backing up on an external drive or two in case your laptop crashes. In my case I have redundant backups, but you might also want to store them securely in 'the cloud' online. The second step listed above is important for ensuring that you don't overwrite your archived data. You should always work on a copy of your data, not on the data itself. Once you've created new versions (split/joined clips, audio stripping, etc..) you might then want to backup these versions with metadata.

I use two programs for converting video: Media Converter and MPEG Streamclip. You could use just MPEG Streamclip (which has a Windows version), but on a Mac I find that Media Converter is much simpler/easier for reducing the size of the file, stripping out the audio, or other purposes. MPEG Streamclip is great, though, for combining multiple clips or splitting one clip into several. In each conversion you want to ensure that the video/audio quality is not compromised, depending on what you want to use it for. In my case I am mostly doing acoustic analysis, so I'm more interested in preserving the audio at CD quality (16 bit, 44.1 khz) which is the standard for acoustic analysis and archiving. In any case, since I've backed up the raw files, I can always copy from them if I mess up my working files and need to restore the quality.

To process/convert and work with audio I use Audacity - this is primarily for processing audio, not for acoustic analysis. Audacity supports a large range of encodings and formats, and you can select portions of the sound file to do basic processing like boosting the signal, removing noise, etc. These are generally not the best things to do to an audio signal, but they can be useful. In my case, this is particularly for when I'm playing the audio back and need to hear what someone said in the background during a conversation, or do other kinds of manipulations.

I can't stress enough the importance of backing up data and copying your data files to a new (staging) folder. This really ensures that you can always rewind the clock and reset, while being confident in exploring the data itself in your working folder. This should become an important part of your workflow so that it is second nature. In some cases we will make mistakes, but understanding the importance of backing up and creating metadata for your backups will help to mitigate perhaps catastrophic events. Happy converting!
0 Comments

Linguistic Tools (Mac/PC)

3/5/2017

1 Comment

 
When I started my PhD program in Linguistics (language documentation and description), I had some experience with linguistic analysis, but not to the degree that I had to learn in order to complete my PhD. I had tuned my ear to be able to hear the sounds of the IPA, and had practice transcribing and learning a range of languages, but I had never analyzed an unwritten language completely by myself. During the course of my PhD I learned much more about how to analyze languages 'from the ground up', so to speak.

Along the way, I discovered that there were some excellent tools that made me much more effective and efficient at the task of documenting and describing an unwritten language. I was fortunate that I already had a good foundation in recording and processing audio from my experiences recording, mixing, and releasing my music, so the fact that the audio data I recorded would form the basis of my analysis didn't phase me. However, there were another whole set of tools that would allow me to investigate the details of the language I planned to work on.

Each of these programs is open source or free, though some are developed for Windows and others are developed for MacOS, which might be a problem for some people. Since I grew up with DOS and Windows but then later switched to a Mac, I'm comfortable with both systems. The Apple/Mac laptop build quality was my first choice for travel and portability combined with power. I say 'was' since some of Apple's recent design choices mean I might be switching back to Windows on my next laptop. But for now I run an old Windows version on my Mac via Virtualbox or bundle Windows software in a Wine port so I can run it as a native app in MacOS.

I'll plan to describe each of these tools in more detail in future posts, but for now here's

A list of the tools I currently use for my linguistic work:
  • Media Converter and MPEG Streamclip: for converting video [READ MORE]
  • Audacity: for importing/processing sound files
  • Praat: for analysis of speech sounds (phonetics/phonology)
  • Transcriber: for time-alignment of text/sound
  • Toolbox: for analysis of grammatical/lexical structure
  • Git: for versioning control
  • LaTeX and LyX: for writing
  • BibDesk: for managing references (or Zotero)
  • Python: for text manipulation/investigation

Tools other linguists use, but that I don't use much:
  • Elan: Many linguists use this for transcription of audio/video, but I have found it difficult to work with for grammatical analysis. I'd really like to learn more about it, but I haven't managed to attend a workshop yet, though if you live near Melbourne, Australia and are free on the 29th of March (2017), maybe you should check this one  out.
  • FLEx (Fieldworks Language Explorer): I find this useful for dictionary searches and such, but the interlinearization options are not as flexible as Toolbox.
1 Comment

Transcriber 1.5.2 released for El Capitan!

3/1/2017

0 Comments

 
Just a quick blog post to mention that one of the tools I use in language documentation and description, Transcriber, is newly repackaged for use with OS X El Capitan! This is a big deal because previous versions (from 2013) failed to work, then the program was supposedly 'updated' (and didn't work), so I've been using the 2005 Windows version in a virtual box. But I just tested the new release (new as of 4 hours ago) and it works great on my Mac (just have to update the settings to default to UTF-8 for character encoding) and also with my trs2txt converter for Toolbox! Happy transcribing!
0 Comments

    About me

    I'm a linguist and singer-songwriter. I write about life, travel, language and technology.

    Archives

    January 2022
    May 2020
    September 2019
    July 2018
    February 2018
    December 2017
    August 2017
    June 2017
    May 2017
    April 2017
    March 2017
    February 2017
    December 2015
    May 2015
    December 2014
    November 2014
    October 2014
    September 2014
    August 2014
    July 2014
    June 2014
    April 2014
    March 2014
    December 2013
    October 2013
    August 2013
    July 2013
    June 2013
    May 2013
    April 2013
    March 2013
    February 2013
    January 2013

    Categories

    All
    3mt
    Abbi
    Acoustic
    Advice
    AI
    Album
    All I Want
    Analysis
    Andaman
    Annotation
    Archive
    Audio
    Austroasiatic
    Backup
    Biate
    Bibliography
    Breathe Deep
    China
    Chords
    Clause Similarity
    Cloud
    Collaboration
    Computers
    Conference
    Culture
    Data
    Data Access
    Datasets
    DataVerse
    Death
    Deixis
    Demonstratives
    Documentation
    Draw
    Duration
    DX
    E920
    Easter
    El Capitan
    E Reader
    E-reader
    Examples
    EXcel
    F0
    Failure
    Feature
    Fieldwork
    Formants
    Forums
    Friends
    Ghana
    Git
    Git Metaphor
    Greet The Dawn
    Hanvon
    HLS20
    Holiday
    Home
    How-to
    ICAAL
    Implicit Motives
    Instruction
    Intensity
    Interlinear
    I've Got A Girl
    Kindle
    Language
    LaTeX
    Linguistics
    LyX
    Mac
    Machine Learning
    Mastering
    Metaphor
    MU
    Myanmar
    Natural Language Processing
    Neural Networks
    New Release
    News
    NLP
    NLTK
    Open Science
    Papers
    Paperwhite
    Pdf
    PhD
    Phonetics
    Phonology
    Pitch
    Plot
    Pnar
    Praat
    Practical
    Process
    Processing
    Production
    Programming
    Psalms
    Psychology
    Publications
    Publicity
    Python
    Radar Radio
    Reasons
    Recording
    Research
    Review
    Scripts
    Sentiment Analysis
    Singapore
    Song
    Soundfarm
    Sports
    Studio
    Subrepo
    Syntactic Reconstruction
    Text Classification
    Thailand
    Thesis
    Things To Know
    This Lamp
    Thoughts
    Tips
    Tone
    Toolbox
    Tools
    Track List
    Transcriber
    Transcriber 1.5.2
    Transcription
    Travel
    Trs2txt
    Update
    USA
    UZH
    Valentine's Day
    Version Control
    Video
    Vowels
    Web App
    Website
    Wedding
    Word - Flesh
    Workflow
    World Cup
    Writing
    YUFL
    Zion's Walls
    Zurich

    RSS Feed

    prev. blog

      Contact me

    Submit
Powered by Create your own unique website with customizable templates.