One of the concerns that has occupied my mind for that past few years is the question of data accessibility in the field of Linguistics. I am happy to announce that the data that underpins my grammatical description of Pnar is now freely available as a downloadable archive in audio and text form (anonymized where requested by participants). You can find the link to the dataset at the bottom of this post, but in the meantime I'd like to explain my views surrounding data access and give a brief explanation of the tool I've used to make my linguistic data accessible.
Why accessible data?
Those who are familiar with linguistics understand that traditional descriptions of language are often based on recorded, transcribed and translated interviews and stories by speakers of the language. Although some theoretical work may be based on a few utterances or a single example, most linguistic work is based on many actual examples from utterances that real speakers produce.
One issue here is that there is such between- and within-speaker variation in speech that unless the data you use is actually accessible to other linguists, one can easily question the veracity of a particular analysis. In the interest of scientific enquiry, then, it is incumbent on the analyst-linguist to make their actual data accessible to other researchers in at least some form, whether in an archive or in a database. Having the data accessible to multiple researchers may lead to disagreements about analysis (there may be more than one way of analyzing a particular linguistic structure, for example), but ultimately such disagreements are healthy because they expand our knowledge.
This touches on a larger issue in the world of science, that of verifiability and reproducibility of research, which has galvanized the larger scientific community towards Open Science (see this blog post for an explanation, and check out this OSF paper), and in some fields such as Psychology, has actually resulted in a whole journal devoted to "replication studies". These kind of studies are aimed at trying to replicate results and findings of a particular study by following the same procedure as the original researchers. When replication studies uphold a particular result, it makes it more likely that the original study's findings were not the result of a statistical anomaly or falsification of data, which is a very serious problem that can lead to erroneous claims requiring retraction. For more on this visit RetractionWatch.com.
What this means in the case of linguistic data is that the recordings, transcriptions, and translations that underlie a grammatical description or other study, whenever possible, should be made accessible to other linguists. Data sharing can be a touchy issue simply because of a) the ethical concerns of the providers of the data, b) potential cultural taboos, and c) because of the interests of the linguist who initially made and processed the data.
With proper permissions sought and precautions taken, these concerns can be minimized or dealt with appropriately. A linguist needs to (minimally) communicate to participants about how the data will be used, take the time to anonymize recordings and annotations when necessary, and create a license that constrains how the data can be used in the future. Ideally, if you are doing your research correctly, your university's Institutional Review Board will have already helped you to think through these things. There are also some excellent books, papers and chapters that deal (at least somewhat) with this subject, and there are a set of standards for social science research (with human subjects) and specifically for linguistics that researchers should be aware of.
Some reasons linguists don't share data
The final point (C, the interests of the linguist) is really the sticking point for most people. The reality is that many linguists do not want to release data for several reasons:
This brings me back to the earlier ruminations that started this post, namely that data produced by a linguist and which underpins their work ought to be accessible to other scientists and linguists. When I first submitted my PhD at NTU, I took a look at some of the options for data archiving, and I approached the university library (which keeps digital copies of all theses submitted at the university) to see if they could also store my audio and transcription data (over 1GB). About a year ago, they contacted me to let me know that they were developing such a service, something called DataVerse, and wanted to know if they could use my dataset to test it. I was happy to have them do so, and after some tweaking and time, this tool is now available for use.
DataVerse is a database/archive tool developed at Harvard University that allows researchers to store datasets that other researchers can download for use and testing. It supports the Open Science initiative by making data accessible and open. It also solves one of the problems I noted above by creating a unique url identifier and citation for the dataset. You can check out my dataset at its DOI here and download it for research and non-commercial purposes.
As I was thinking about this previously, I realized that what I wanted was not really an archive but a database that would allow me to develop and annotate my data further. Unfortunately DataVerse is not that - it is basically just a storage tool. What is nice is that it provides versioning, so the curator of the dataset can upload and publish changes. I think I may have to create my own database if I want something that will let me explore the data better. But for now, the data is freely accessible for other linguists (even though my analysis isn't perfect), which is a bit of a load off my mind.
This is my first update in a long while, as I notice that this blog hasn’t been updated since 2015. I blame life and the ease of posting short updates on Instagram, Twitter, and Facebook. Since my last post I started a new job, started dating an amazing lady and got married. No raised eyebrows please! It HAS been over a year and a half; things happen.
What this probably means is that updates are not super-likely to come daily or weekly. It’s hard to change habits, particularly when writing on a blog is not on the top of my list (being replaced with research, family, etc..). However, I am working on a few processes to write a bit more frequently, since it’s good practice and I have a lot of thoughts rattling around my brain.
In particular, I enjoy sharing what I know, whether it’s about products I find useful (like the Hanvon that I reviewed in a series of blog posts [Part 1, Part 2, Part 3]) or about processes. The processes I have been developing in the past year are related to data management and workflow, some of which I was taught, some of which I taught myself, and some of which I learned from the internet.
My initial work as a PhD student in linguistics involved Language Documentation and Description, which I hope to continue to do. Now I’m doing more Python programming and data analysis, which I find requires a somewhat different skill set, yet one informs the other. In what I hope will be a series of blog posts, I’ll try to unpack these things, in hopes that they will be useful for other people traveling similar roads.
I'm a linguist and singer-songwriter. I write about life, travel, language and technology.