When you discuss doing language documentation and description, one of the first things to know is that you have to collect language data. The primary source of language data is people who speak the language you're interested in, which then begs the question of how you record the data. There are some great books and papers on doing linguistic fieldwork of a documentary nature (more than what I've linked to here), but this post is focused more on the tools you use to process your data once it is recorded, as a continuation of my 'Linguistic Tools' post. I'll also plan to write a longer post on recording audio/video in the field, but for now I'll assume that you've recorded it already. I'll just briefly say that I like using a digital SLR like the Canon Rebel along with a unidirectional mic, in conjunction with a digital audio recorder like the Zoom H4N (ideally with a lapel mic of some kind).
Once you have your data recorded, the next step is to copy it to your computer for processing. Often the digital recordings will be rather large and cumbersome, and you may want to split them into smaller files, depending on how many stories/interactions you recorded. I find post-processing is important because it means you can focus on the interaction during the recording and then during processing you make notes of all the files, their content, and other metadata that will help later when you're not in the field and can't remember all the details.
In this processing stage you also want to do two very important things:
I use two programs for converting video: Media Converter and MPEG Streamclip. You could use just MPEG Streamclip (which has a Windows version), but on a Mac I find that Media Converter is much simpler/easier for reducing the size of the file, stripping out the audio, or other purposes. MPEG Streamclip is great, though, for combining multiple clips or splitting one clip into several. In each conversion you want to ensure that the video/audio quality is not compromised, depending on what you want to use it for. In my case I am mostly doing acoustic analysis, so I'm more interested in preserving the audio at CD quality (16 bit, 44.1 khz) which is the standard for acoustic analysis and archiving. In any case, since I've backed up the raw files, I can always copy from them if I mess up my working files and need to restore the quality.
To process/convert and work with audio I use Audacity - this is primarily for processing audio, not for acoustic analysis. Audacity supports a large range of encodings and formats, and you can select portions of the sound file to do basic processing like boosting the signal, removing noise, etc. These are generally not the best things to do to an audio signal, but they can be useful. In my case, this is particularly for when I'm playing the audio back and need to hear what someone said in the background during a conversation, or do other kinds of manipulations.
I can't stress enough the importance of backing up data and copying your data files to a new (staging) folder. This really ensures that you can always rewind the clock and reset, while being confident in exploring the data itself in your working folder. This should become an important part of your workflow so that it is second nature. In some cases we will make mistakes, but understanding the importance of backing up and creating metadata for your backups will help to mitigate perhaps catastrophic events. Happy converting!
When I started my PhD program in Linguistics (language documentation and description), I had some experience with linguistic analysis, but not to the degree that I had to learn in order to complete my PhD. I had tuned my ear to be able to hear the sounds of the IPA, and had practice transcribing and learning a range of languages, but I had never analyzed an unwritten language completely by myself. During the course of my PhD I learned much more about how to analyze languages 'from the ground up', so to speak.
Along the way, I discovered that there were some excellent tools that made me much more effective and efficient at the task of documenting and describing an unwritten language. I was fortunate that I already had a good foundation in recording and processing audio from my experiences recording, mixing, and releasing my music, so the fact that the audio data I recorded would form the basis of my analysis didn't phase me. However, there were another whole set of tools that would allow me to investigate the details of the language I planned to work on.
Each of these programs is open source or free, though some are developed for Windows and others are developed for MacOS, which might be a problem for some people. Since I grew up with DOS and Windows but then later switched to a Mac, I'm comfortable with both systems. The Apple/Mac laptop build quality was my first choice for travel and portability combined with power. I say 'was' since some of Apple's recent design choices mean I might be switching back to Windows on my next laptop. But for now I run an old Windows version on my Mac via Virtualbox or bundle Windows software in a Wine port so I can run it as a native app in MacOS.
I'll plan to describe each of these tools in more detail in future posts, but for now here's
A list of the tools I currently use for my linguistic work:
Tools other linguists use, but that I don't use much:
Just a quick blog post to mention that one of the tools I use in language documentation and description, Transcriber, is newly repackaged for use with OS X El Capitan! This is a big deal because previous versions (from 2013) failed to work, then the program was supposedly 'updated' (and didn't work), so I've been using the 2005 Windows version in a virtual box. But I just tested the new release (new as of 4 hours ago) and it works great on my Mac (just have to update the settings to default to UTF-8 for character encoding) and also with my trs2txt converter for Toolbox! Happy transcribing!
This is my first update in a long while, as I notice that this blog hasn’t been updated since 2015. I blame life and the ease of posting short updates on Instagram, Twitter, and Facebook. Since my last post I started a new job, started dating an amazing lady and got married. No raised eyebrows please! It HAS been over a year and a half; things happen.
What this probably means is that updates are not super-likely to come daily or weekly. It’s hard to change habits, particularly when writing on a blog is not on the top of my list (being replaced with research, family, etc..). However, I am working on a few processes to write a bit more frequently, since it’s good practice and I have a lot of thoughts rattling around my brain.
In particular, I enjoy sharing what I know, whether it’s about products I find useful (like the Hanvon that I reviewed in a series of blog posts [Part 1, Part 2, Part 3]) or about processes. The processes I have been developing in the past year are related to data management and workflow, some of which I was taught, some of which I taught myself, and some of which I learned from the internet.
My initial work as a PhD student in linguistics involved Language Documentation and Description, which I hope to continue to do. Now I’m doing more Python programming and data analysis, which I find requires a somewhat different skill set, yet one informs the other. In what I hope will be a series of blog posts, I’ll try to unpack these things, in hopes that they will be useful for other people traveling similar roads.
Well now that I've submitted my thesis I have been happily doing other things than writing. Among which, moving house (I'm staying with a friend for a couple months), planning adventures, hanging out with friends, and applying for jobs. It's been good to catch up with people and not have something hanging over my head any longer.
Last week I also tidied up my desk and tied up some loose ends, one of which was a Praat script. As I have mentioned before, online communities are a fabulous place to learn about various things, and one that is particularly active in regards to Praat can be found here on Yahoo. Praat is a free program to image and manipulate sound files (particularly for linguistic purposes) and scripts that automate various features can be extremely useful. I tried to write one that would plot a series of vowels with a circle encompassing the vowel spaces and a single character of the vowel in the center of each ellipse, but I was having trouble. José was able to rewrite the script for a cleaner and better result, for which I am extremely grateful.
For other linguists interested in a script that plots a series of vowels (F1, F2) from a CSV file, check out the link above or the file below.
Travel plans are coming together nicely, so I'll be updating a bit more often in the coming weeks, unless I get too busy having fun!
I finally submitted my PhD thesis last week. It has been a crazy final month. The reason I had that deadline was that here in Singapore (at least at NTU) all PhD students are given 4 years of funding in which to research and write their thesis. If your four years end before your thesis is completed, you are required to pay tuition for the remaining semesters that you continue writing.
Additionally, part of the requirement for the linguistics division (I think also for some of the other divisions in Humanities) is that you have to take 6 courses, which I did in my first year. So I only started doing research on Pnar 3 years ago, which is not much time in which to write a grammatical description of a language. Despite that fact, I managed to write a grammatical description of Pnar that totals about 550 pages of description and 100 pages of texts and appendices. So I'm pretty proud of that - though the description still needs a lot of work to clarify analyses and fix mistakes.
I used LyX/LaTeX to write and format the thesis, and I thought it might be useful to other students at NTU or those who use LyX in general if I included some of the files that I created for formatting purposes.
So here is the LyX file that you could use to write your thesis: NTU_PhD_Thesis_template.lyx
And this is the PDF that it can generate: NTU_PhD_Thesis_template.pdf
Of course this doesn't include my actual PhD, but that's partly because it's under examination. The next step in the process is to wait for comments from the examiners and then work furiously to incorporate them into the document, before I can resubmit and hopefully set a date for defense and confirmation of the actual degree. At least at the moment I'm feeling much more free!
At the Himalayan Languages Symposium last week I gave a talk about deixis. This grammatical feature is essentially 'pointing', and words or morphemes in language can point to various things, so grammarians often talk about person or distance-based deixis, social deixis, and temporal deixis.
Distance-based deixis is often encoded in words called 'demonstratives', social deixis in 'honorifics' like "sir", "ma'am" etc.., and temporal deixis is encoded in tense markers. Deixis is actually more complex, though, as deictic morphemes can really point to any point in the communication space, as illustrated in the diagram on the right from Gerner (2009).
Since deixis is such a large topic, my 20-minute talk focused on the way distance-based deixis is encoded in Pnar and in related languages through demonstratives, specifically words that identify the location of nouns in space, relative to a deictic center. Most languages have at least a 2-way contrast (like English "this" and "that"), and rarely more than three. I began to be interested in this feature since in Pnar there is a 5-way contrast in demonstratives and some of the forms resemble similar words in neighboring Tibeto-Burman languages (a completely different language family). Just to illustrate, below on the left are the spatial deictic morphemes in Pnar (the black circle in the middle represents the 'deictic center', which in this case is the person who is speaking), and on the right are the words in some examples of noun phrases in Pnar. You will notice that demonstratives in Pnar are a combination of deictic markers with gender proclitics that identify the noun that the demonstratives are pointing to.
At this point there are a lot of other things I could discuss, but the post is getting a bit long. So I think I'll pause here and my next post will be about the features of demonstratives in neighboring languages. At least now you have a better idea of what deixis is, and how languages can differ significantly in terms of what they can encode in a spatial deictic system.
Gerner, Matthias. 2009. Deictic features of demonstratives: a typological survey with special reference to the Miao group. The Canadian Journal of Linguistics/La revue canadienne de linguistique 54:43–90.
I mentioned that Dr. Anvita Abbi gave a great talk at the Himalayan Languages Symposium on her work on Great Andaman in the Indian Ocean. Here's a map just to show you where that is. [Image credit: Barefoot Holidays]
It's a pretty remote area. In fact, the Nicobar Islands to the south are completely closed to outsiders. When you consider that the speakers of Great Anadaman are down to a single location and the community is switching to Hindi and English as a means of communication, the closed nature of the Nicobar Islands seems somewhat justified. Great Andamanese is actually 10 languages, of which 4 were documented by Dr. Abbi and are spoken by only a handful of speakers. I'll let you check out more about that on this site.
One of the reasons it was fascinating to hear about was because of the highly-developed gender system based on a conceptualization of the world in relation to the human body. Generally, the kind of gender in languages that people are familiar with is that found in Romance languages, where nouns are marked as masculine or feminine, and verbs agree with nouns so that you know which noun is 'controlling' the action (it's more complicated than that of course, but this is just to illustrate a point).
However, gender is simply a noun class system, and nouns can have as many classes as a language (or speakers of a language) find(s) useful. So German has three noun classes (masculine, feminine, neuter), and Bantu languages have a ton (help me out Bantu language experts), and other languages have noun classes based on living things, non-living things, plants, humans, tools, certain kinds of animals, etc..
What is interesting about Great Andamanese is that the same class markers are used on both nouns and verbs in a highly productive way (meaning that they seem to apply in all sorts of ways to both verbs and nouns). These noun class markers identify actions (such as going and coming) as related to one of 7 or so body part prefixes (which also classify nouns) depending on whether the action is conceptualized as relating to mouth (being ingested, digested, etc.. i.e. thinking or being beautiful), or moving in a certain manner (feet), and there are conceptualizations related to all sorts of body parts. Unfortunately I don't have all my notes with me, as I just flew to the US for my brother's wedding, but it's really interesting to think of how this language connects (or doesn't connect) to languages in Southeast Asia and Africa. Read up more on this fascinating system here, and check out Dr. Abbi's new grammar of Great Andamanese, recently published by Brill.
Yesterday afternoon I gave a talk at the Himalayan Languages Symposium, which was held this year at NTU. It's the 20th meeting, and has generally focused on languages of the Himalayan region, which is a pretty broad area when you consider that the Himalayan range stretches from Pakistan to Burma. That's a heck of a lot of languages.
It was a really great conference, thanks to clear papers and engagement on a variety of topics. Phonetics and phonology of individual languages, historical reconstruction, ancient Tibetan, theoretical implications of marking patterns, field reports, typological surveys, Nepali Sign language, child language acquisition, and sociolinguistic studies were only some of the areas covered in the talks. One of the most interesting to me was a report by Anvita Abbi on the languages of Great Andaman, an island in the Andaman-Nicobar chain. These languages are an isolated group that remain unclassified and are in danger of extinction. I'll have to write a separate blog post to explain my fascination.
My talk was on deictic demonstratives in Pnar and the neighboring languages of northeast India. Look for a follow-up post in the next couple days that explains a bit more. For now, I'll just say that it was a great conference and it's back to the thesis in the coming week.
Image Credit: ICIMOD
Last week I had a Eureka!* moment. I love these moments - when you've been trying to figure out a problem (could be big, could be small) and it is frustrating you to no end, and then finally you break through and find the solution! It's pretty amazing.
This Eureka! moment had to do with the linguistic examples I wrote about earlier. They weren't formatting properly, and because of this some of the examples were splitting across pages. Pretty early on in my attempts, I posted on a forum devoted to LyX/LaTeX/TeX, the typesetting program I use. Forums are pretty nifty ways to aggregate knowledge, and I've learned a ton about LaTeX through this particular forum. If you have a specialized industry or tool and you haven't found a forum where people can help each other out, find one quick or make one yourself. It is totally worth it.
Unfortunately, with this particular issue no one was able to help. So I kept troubleshooting, trial and error. Eventually one of the things I tried worked! So satisfying. I imagine this is what I'll feel once I finally submit my PhD thesis... though people tell me a grammatical description is never complete, even if it's over 1,000 pages.
*As I remember, and according to Wikipedia, "Eureka" comes from the Ancient Greek word εὕρηκα heúrēka, meaning "I have found (it)" and is attributed to Archimedes, who discovered how the volume of objects could be measured by water displacement.
I'm a linguist and singer-songwriter. I write about life, travel, language and technology.