The last post was a bit of a brain dump to make sure I didn't forget a few lessons I learned, in part because I knew I was quitting the job that involved doing ML type things. While I was working there of course I learned a lot and (I think) acquitted myself pretty well, but language processing and machine learning are not really what I spent 4 years doing for my PhD. Python is a programming language that I picked up to make my work in grammatical description and syntax easier, and while I find ML (and programming) pretty interesting, my main interest lies in understanding how languages work through comparison, with the ultimate goal of reconstructing linguistic structures and (hopefully) prehistory.
A year and a half ago or so I started working on a grant proposal for that exact thing with some researchers at the University of Zurich. This is a relatively young department that is doing some really cool research in typology, processing, and language acquisition from a corpus-based perspective on multiple languages (both Indo-European and non-IE families/phyla). At the same time historical linguistics is a huge focus in the department, as is modeling language change. This is super exciting because I take the perspective that language is spoken by individuals in communities who acquire language from their forbears (history), use it as a tool for communication (processing), which gives rise to statistical tendencies that all languages share (typology). Since it is individuals using language, this is done in an idiosyncratic way, but since language is learned and guided by principles of processing, the only way to get at both the commonality and the uniqueness of language is by investigating actual language corpora (recordings, transcriptions, etc). Of course the story of how languages change is much more complex and involves many more factors, but be that as it may, this is a great place to be.
Picture: Lake Zurich from the hill above the university
So, long story short, we found out last October that the grant had been funded, and the family and I started making plans to move to Zurich. More on that later, perhaps. With this project, our goal at the moment is to build a database of Austroasiatic language corpora that we can then investigate for all sorts of interesting phenomena, but focusing (initially at least) on word order. By comparing word order in multiple languages of the same family we intend to make an effort toward reconstructing the form of the parent languages from which the present-day spoken languages diverged, and also to identify language contact and interaction effects to contribute to discussions about the development of word order patterns cross-linguistically.
Just a quick blog post to mention that one of the tools I use in language documentation and description, Transcriber, is newly repackaged for use with OS X El Capitan! This is a big deal because previous versions (from 2013) failed to work, then the program was supposedly 'updated' (and didn't work), so I've been using the 2005 Windows version in a virtual box. But I just tested the new release (new as of 4 hours ago) and it works great on my Mac (just have to update the settings to default to UTF-8 for character encoding) and also with my trs2txt converter for Toolbox! Happy transcribing!
At the Himalayan Languages Symposium last week I gave a talk about deixis. This grammatical feature is essentially 'pointing', and words or morphemes in language can point to various things, so grammarians often talk about person or distance-based deixis, social deixis, and temporal deixis.
Distance-based deixis is often encoded in words called 'demonstratives', social deixis in 'honorifics' like "sir", "ma'am" etc.., and temporal deixis is encoded in tense markers. Deixis is actually more complex, though, as deictic morphemes can really point to any point in the communication space, as illustrated in the diagram on the right from Gerner (2009).
Since deixis is such a large topic, my 20-minute talk focused on the way distance-based deixis is encoded in Pnar and in related languages through demonstratives, specifically words that identify the location of nouns in space, relative to a deictic center. Most languages have at least a 2-way contrast (like English "this" and "that"), and rarely more than three. I began to be interested in this feature since in Pnar there is a 5-way contrast in demonstratives and some of the forms resemble similar words in neighboring Tibeto-Burman languages (a completely different language family). Just to illustrate, below on the left are the spatial deictic morphemes in Pnar (the black circle in the middle represents the 'deictic center', which in this case is the person who is speaking), and on the right are the words in some examples of noun phrases in Pnar. You will notice that demonstratives in Pnar are a combination of deictic markers with gender proclitics that identify the noun that the demonstratives are pointing to.
At this point there are a lot of other things I could discuss, but the post is getting a bit long. So I think I'll pause here and my next post will be about the features of demonstratives in neighboring languages. At least now you have a better idea of what deixis is, and how languages can differ significantly in terms of what they can encode in a spatial deictic system.
Gerner, Matthias. 2009. Deictic features of demonstratives: a typological survey with special reference to the Miao group. The Canadian Journal of Linguistics/La revue canadienne de linguistique 54:43–90.
Yesterday afternoon I gave a talk at the Himalayan Languages Symposium, which was held this year at NTU. It's the 20th meeting, and has generally focused on languages of the Himalayan region, which is a pretty broad area when you consider that the Himalayan range stretches from Pakistan to Burma. That's a heck of a lot of languages.
It was a really great conference, thanks to clear papers and engagement on a variety of topics. Phonetics and phonology of individual languages, historical reconstruction, ancient Tibetan, theoretical implications of marking patterns, field reports, typological surveys, Nepali Sign language, child language acquisition, and sociolinguistic studies were only some of the areas covered in the talks. One of the most interesting to me was a report by Anvita Abbi on the languages of Great Andaman, an island in the Andaman-Nicobar chain. These languages are an isolated group that remain unclassified and are in danger of extinction. I'll have to write a separate blog post to explain my fascination.
My talk was on deictic demonstratives in Pnar and the neighboring languages of northeast India. Look for a follow-up post in the next couple days that explains a bit more. For now, I'll just say that it was a great conference and it's back to the thesis in the coming week.
Image Credit: ICIMOD
I realize that some of my posts haven't been as clear as they could be. Specifically, I talked a lot about interlinearized texts, but what does that actually mean? Well, the thing about language is that when you are talking about specific aspects of language, it's helpful if the reader actually knows what you're talking about. Thus, examples are useful. When you're discussing an unwritten language, this has to be taken to a whole new level.
When I'm discussing examples in Pnar, I need four levels of representation, as in the example below. On the left the numbered lines represent the local orthography (line 1), the phonetic/phonemic representation using IPA (2), the word-for-word translation or English gloss (3), and the free translation that actually tells you the English meaning (4).
So on the left we have the four levels of representation, but you notice that the items on each line don't quite match up. This can be confusing, particularly if you're dealing with long examples. Interlinearization allows each element to correspond to one in the following line.
One way linguists do this is by creating tables, which have to be individually edited for each example. This is what you have to do in MSWord, unfortunately. Another way is using a typesetting program called LaTeX - this is how I produced the nicely formatted example on the right. Another convention is to have the local writing system be italicized and non-interlinearized.
Notice that the glosses on the third line are not exactly a translation equivalent, sometimes they are grammatical abbreviations for function words. Here, 'ALL' is an abbreviation for 'allative', which is a traditional term for a marker on nouns that indicates the noun to be a 'goal' or what another noun is moving towards.
Hopefully that clear things up a bit. To read more about interlinearized linguistic examples, this Wikipedia page should help.
I recently had the opportunity to attempt to condense my PhD thesis topic into 3 minutes as part of the 3-minute thesis challenge here at NTU. I made it into the final where I competed with other speakers for top prizes. I didn't win, but I had a lot of fun and enjoyed learning about some of the other great research being done here at NTU. Fortunately they took video of the presentations, so you can watch on YouTube (or below).
I'm a linguist and singer-songwriter. I write about life, travel, language and technology.