Corpus

2014-04-02

For the past week or so I’ve been working on getting my corpus of transcribed and translated texts into shape. The word corpus is from the Latin word meaning “body”, and in this case refers to a “body” of text that I’ve built over the past few years. This is the recorded data that my analysis of Pnar is based on - I’ve recorded over 12 hours of conversations, stories, and interviews, but only 8 hours and 6 minutes have been transcribed.

I say ‘only’, but that’s around 75,000 words, which has been time-consuming work over the past couple years. Just to give an idea, transcribing 5 minutes of recordings took me about one hour when I first began, including time for analysis with the assistance of a native speaker. This corpus is just beginning to break the surface of describing the language. Fortunately, the recordings have been pretty diverse in terms of subject matter and speakers, so it gives a pretty good initial picture of the language.

When I started out learning Pnar I began with recordings of speakers describing what was happening in The Pear Film, which is a short silent film developed by Wallace Chafe. Following the suggestion of my advisor Alec Coupe, I had a speaker watch it and tell others (who couldn’t see the film) what was going on. This gave me a very clear context in which to analyze language data.

From that beginning analysis, as I have worked on more recorded texts, my understanding of the language has developed and grown. I’m now finding that some of my original thoughts need adjustment, and that some of the original glosses and translations should change. So now I have to go back through and re-adjust the whole set of 75,000 words. It’s time-consuming, but fortunately not as bad as it used to be.