One of the concerns that has occupied my mind for that past few years is the question of data accessibility in the field of Linguistics. I am happy to announce that the data that underpins my grammatical description of Pnar is now freely available as a downloadable archive in audio and text form (anonymized where requested by participants). You can find the link to the dataset at the bottom of this post, but in the meantime I'd like to explain my views surrounding data access and give a brief explanation of the tool I've used to make my linguistic data accessible.
Why accessible data?
Those who are familiar with linguistics understand that traditional descriptions of language are often based on recorded, transcribed and translated interviews and stories by speakers of the language. Although some theoretical work may be based on a few utterances or a single example, most linguistic work is based on many actual examples from utterances that real speakers produce.
One issue here is that there is such between- and within-speaker variation in speech that unless the data you use is actually accessible to other linguists, one can easily question the veracity of a particular analysis. In the interest of scientific enquiry, then, it is incumbent on the analyst-linguist to make their actual data accessible to other researchers in at least some form, whether in an archive or in a database. Having the data accessible to multiple researchers may lead to disagreements about analysis (there may be more than one way of analyzing a particular linguistic structure, for example), but ultimately such disagreements are healthy because they expand our knowledge.
This touches on a larger issue in the world of science, that of verifiability and reproducibility of research, which has galvanized the larger scientific community towards Open Science (see this blog post for an explanation, and check out this OSF paper), and in some fields such as Psychology, has actually resulted in a whole journal devoted to "replication studies". These kind of studies are aimed at trying to replicate results and findings of a particular study by following the same procedure as the original researchers. When replication studies uphold a particular result, it makes it more likely that the original study's findings were not the result of a statistical anomaly or falsification of data, which is a very serious problem that can lead to erroneous claims requiring retraction. For more on this visit RetractionWatch.com.
What this means in the case of linguistic data is that the recordings, transcriptions, and translations that underlie a grammatical description or other study, whenever possible, should be made accessible to other linguists. Data sharing can be a touchy issue simply because of a) the ethical concerns of the providers of the data, b) potential cultural taboos, and c) because of the interests of the linguist who initially made and processed the data.
With proper permissions sought and precautions taken, these concerns can be minimized or dealt with appropriately. A linguist needs to (minimally) communicate to participants about how the data will be used, take the time to anonymize recordings and annotations when necessary, and create a license that constrains how the data can be used in the future. Ideally, if you are doing your research correctly, your university's Institutional Review Board will have already helped you to think through these things. There are also some excellent books, papers and chapters that deal (at least somewhat) with this subject, and there are a set of standards for social science research (with human subjects) and specifically for linguistics that researchers should be aware of.
Some reasons linguists don't share data
The final point (C, the interests of the linguist) is really the sticking point for most people. The reality is that many linguists do not want to release data for several reasons:
This brings me back to the earlier ruminations that started this post, namely that data produced by a linguist and which underpins their work ought to be accessible to other scientists and linguists. When I first submitted my PhD at NTU, I took a look at some of the options for data archiving, and I approached the university library (which keeps digital copies of all theses submitted at the university) to see if they could also store my audio and transcription data (over 1GB). About a year ago, they contacted me to let me know that they were developing such a service, something called DataVerse, and wanted to know if they could use my dataset to test it. I was happy to have them do so, and after some tweaking and time, this tool is now available for use.
DataVerse is a database/archive tool developed at Harvard University that allows researchers to store datasets that other researchers can download for use and testing. It supports the Open Science initiative by making data accessible and open. It also solves one of the problems I noted above by creating a unique url identifier and citation for the dataset. You can check out my dataset at its DOI here and download it for research and non-commercial purposes.
As I was thinking about this previously, I realized that what I wanted was not really an archive but a database that would allow me to develop and annotate my data further. Unfortunately DataVerse is not that - it is basically just a storage tool. What is nice is that it provides versioning, so the curator of the dataset can upload and publish changes. I think I may have to create my own database if I want something that will let me explore the data better. But for now, the data is freely accessible for other linguists (even though my analysis isn't perfect), which is a bit of a load off my mind.
In a previous post I discussed some of the benefits I discovered in using LaTeX with LyX as a front-end. Another extremely useful tool to learn how to use is a Bibliography manager. If you are like I was, and often develop a new bibliography for each paper you write, this is something you might want to consider. On the other hand, you might be used to Microsoft applications and already be familiar with bibliography managers (such as Zotero) that integrate pretty well with the MS Office family.
As I started my PhD and began writing more papers, I noticed that many of my citations were the same. Rather than copy-pasting from previous papers and then adjusting the formatting for each submission, I realized it would be much easier to have a centralized location for all the papers I wanted to cite, and have the computer deal with the formatting according to the style sheet I needed to use.
Fortunately, support for this is ‘baked-in’ to LaTeX/LyX. There are various good tools that integrate with BibTeX (as the bibliography manager in LaTeX is called), but the one that I found to be most useful for my purposes is BibDesk. Rather than explain how I do it, I’ll point you to this excellent tutorial for Mac, which describes how to set it up. In the rest of this post I’ll simply give my reasons for using BibDesk.
Each of these reasons alone are worth getting your act together and creating a single bibliography repository. You can likely think of other good reasons, which just means there is no excuse to not do so.
Another issue that I am thinking about is how to make my library of citations/documents available on any computer with internet access via the cloud. This would ensure at the very least that I wouldn't worry as much about losing it if my computer dies (though I'm still going to make backups regularly). A fellow academic and friend of mine has managed to integrate all of his citations and PDFs with Zotero, and make it available on his phone in an app like Google Drive, so that at conferences he can remember a publication or search for one in a conversation and pull up the reference and/or associated document to show people. This is super cool and super useful - I’ll write about it if I can figure out how to pull it off on my own, or maybe I’ll get him to write about it.
I'm a linguist and singer-songwriter. I write about life, travel, language and technology.