The last post was a bit of a brain dump to make sure I didn't forget a few lessons I learned, in part because I knew I was quitting the job that involved doing ML type things. While I was working there of course I learned a lot and (I think) acquitted myself pretty well, but language processing and machine learning are not really what I spent 4 years doing for my PhD. Python is a programming language that I picked up to make my work in grammatical description and syntax easier, and while I find ML (and programming) pretty interesting, my main interest lies in understanding how languages work through comparison, with the ultimate goal of reconstructing linguistic structures and (hopefully) prehistory.
A year and a half ago or so I started working on a grant proposal for that exact thing with some researchers at the University of Zurich. This is a relatively young department that is doing some really cool research in typology, processing, and language acquisition from a corpus-based perspective on multiple languages (both Indo-European and non-IE families/phyla). At the same time historical linguistics is a huge focus in the department, as is modeling language change. This is super exciting because I take the perspective that language is spoken by individuals in communities who acquire language from their forbears (history), use it as a tool for communication (processing), which gives rise to statistical tendencies that all languages share (typology). Since it is individuals using language, this is done in an idiosyncratic way, but since language is learned and guided by principles of processing, the only way to get at both the commonality and the uniqueness of language is by investigating actual language corpora (recordings, transcriptions, etc). Of course the story of how languages change is much more complex and involves many more factors, but be that as it may, this is a great place to be.
Picture: Lake Zurich from the hill above the university
So, long story short, we found out last October that the grant had been funded, and the family and I started making plans to move to Zurich. More on that later, perhaps. With this project, our goal at the moment is to build a database of Austroasiatic language corpora that we can then investigate for all sorts of interesting phenomena, but focusing (initially at least) on word order. By comparing word order in multiple languages of the same family we intend to make an effort toward reconstructing the form of the parent languages from which the present-day spoken languages diverged, and also to identify language contact and interaction effects to contribute to discussions about the development of word order patterns cross-linguistically.
When you discuss doing language documentation and description, one of the first things to know is that you have to collect language data. The primary source of language data is people who speak the language you're interested in, which then begs the question of how you record the data. There are some great books and papers on doing linguistic fieldwork of a documentary nature (more than what I've linked to here), but this post is focused more on the tools you use to process your data once it is recorded, as a continuation of my 'Linguistic Tools' post. I'll also plan to write a longer post on recording audio/video in the field, but for now I'll assume that you've recorded it already. I'll just briefly say that I like using a digital SLR like the Canon Rebel along with a unidirectional mic, in conjunction with a digital audio recorder like the Zoom H4N (ideally with a lapel mic of some kind).
Once you have your data recorded, the next step is to copy it to your computer for processing. Often the digital recordings will be rather large and cumbersome, and you may want to split them into smaller files, depending on how many stories/interactions you recorded. I find post-processing is important because it means you can focus on the interaction during the recording and then during processing you make notes of all the files, their content, and other metadata that will help later when you're not in the field and can't remember all the details.
In this processing stage you also want to do two very important things:
I use two programs for converting video: Media Converter and MPEG Streamclip. You could use just MPEG Streamclip (which has a Windows version), but on a Mac I find that Media Converter is much simpler/easier for reducing the size of the file, stripping out the audio, or other purposes. MPEG Streamclip is great, though, for combining multiple clips or splitting one clip into several. In each conversion you want to ensure that the video/audio quality is not compromised, depending on what you want to use it for. In my case I am mostly doing acoustic analysis, so I'm more interested in preserving the audio at CD quality (16 bit, 44.1 khz) which is the standard for acoustic analysis and archiving. In any case, since I've backed up the raw files, I can always copy from them if I mess up my working files and need to restore the quality.
To process/convert and work with audio I use Audacity - this is primarily for processing audio, not for acoustic analysis. Audacity supports a large range of encodings and formats, and you can select portions of the sound file to do basic processing like boosting the signal, removing noise, etc. These are generally not the best things to do to an audio signal, but they can be useful. In my case, this is particularly for when I'm playing the audio back and need to hear what someone said in the background during a conversation, or do other kinds of manipulations.
I can't stress enough the importance of backing up data and copying your data files to a new (staging) folder. This really ensures that you can always rewind the clock and reset, while being confident in exploring the data itself in your working folder. This should become an important part of your workflow so that it is second nature. In some cases we will make mistakes, but understanding the importance of backing up and creating metadata for your backups will help to mitigate perhaps catastrophic events. Happy converting!
I'm a linguist and singer-songwriter. I write about life, travel, language and technology.