Backing Up with Git and the Cloud
My next post was intended to be about streamlining a workflow by using various LaTeX authoring tools and a bibliography manager, but last week my Mac died. It is an older computer admittedly (from 2011), but it just had the graphics/motherboard replaced a year and a half ago due to a recall and I had high hopes that it would last at least until the end of the year. Unfortunately that didn’t happen - while at a conference over the weekend it gave up the ghost, leaving me with just a USB drive that had my presentation, and the last-minute changes to a Praat script that I had managed to upload to GitHub the night before.
Fortunately I have been saving up towards a new computer, since I knew this would probably happen. It’s had a good run, but when I got back to Singapore, replacing the mother/logic board turned out not to be worth it - I’d be better off putting the money I would spend fixing it towards a new computer.
What I find remarkable, however, is that I literally lost nothing except some emails that are stored on the email server anyhow, and maybe some intermediate changes to the Praat script. This is partly because I saw that the computer was starting to fail and backed up all my files with Time Machine on an external hard drive, but partly also because my workflow lends itself to backing up.
As per my previous post regarding Git, it is important to have a workflow that encourages you to snapshot changes. This is what the Git “commit” command does - fail to integrate it into your Git workflow at your peril. Another way to do ensure that your files are backed up is to use online cloud storage like Google Drive and Dropbox. Most likely, if you’re an academic you probably use both these services, but maybe your backup system is a bit ad-hoc. In my case, the death of my computer has suggested to me that I need to use multiple cloud storage services.
So here’s what I am currently doing using a temporary user account on my wife’s computer:
The main reason this works for me is that I generally don’t operate with Dropbox and Google Drive on - I only start these services manually when I want to sync with the online server. In some of my online research, it seems that there MAY be issues with running a Git repo or editing files when cloud storage is syncing, so I get around this potential issue by simply doing my offline edits, and then syncing cloud storage manually.
So as part of my workflow:
As I go, I am also discovering more files that need to be in a cloud server to avoid data loss. This is mitigated to some extent by regular Time Machine backups, but I’m beginning to wonder if I also need to invest in expanded cloud storage. The only other drawback of this process is that since I’m backing up in multiple cloud services, the storage takes up twice as much space on my hard drive as it otherwise would. I’m still thinking about how to deal with that issue…
Using Git for Linguistics
Recently, I started to explore using Git to maintain and organize my data. This includes both the primary data I work from as a linguist and the various kinds of data I produce in the form of written material (books, papers, etc..). As a field linguist, I primarily work with text that is transcribed from audio and video recordings. Early on I developed an archiving system to preserve the original files in their raw form as much as possible, but I didn't develop a similar system for maintaining my working files, and ended up with lots of duplicates of files that I had copied various places for various purposes.
In this fifth post on Linguistic tools (the overview of which is here), I plan to describe the way I am trying to overcome this issue and become more organized by adopting a version control system. Mostly, though, I'll just try to explain what Git is and what I use it for. This is by no means a complete tutorial, since other people have done that better, but hopefully it will provide some direction for others who are interested in streamlining and organizing their workflow.
When I started learning to program in Python and was introduced to Git, I saw the benefits for coding, but began to wonder if it was possible/useful to use it for other organizational tasks in relation to other kinds of writing and data structures. There are a few blog posts about using git for paper writing here and here and here, but they are written primarily for coders who also write papers (I'm looking at you, academic comp-sci folks!) and don't translate super well for my use case.
What is Git?
First, what is Git? Git is a version control system, which means it tracks changes in a repository (or 'repo'), basically serving as a series of backups for your work. It also has tools that allow you to make a copy (or 'branch'), work on it separately, and then when you're ready, you can see the changes made to individual files and 'merge' them together into a new version. For a simple tutorial, see here, and here is a great visual walkthrough.
Most explanations of Git use the example of a tree or a river, which do help to understand the process, but for total newbies (and possibly to address the difficulty of finding a good metaphor) I came up with the following metaphor, and find it to be a bit more useful.
The Git Metaphor
Imagine you have a drawing project. You have an empty piece of paper - this is your initial blank Git repository. Now you draw a picture on the piece of paper and decide it's an ok version - this is a snapshot that is your first 'commit'. You then decide you want to make some adjustments to the drawing, but you want to preserve the initial drawing, so you take out another piece of paper, lay it on top of the first, trace out the shape in pencil, and then begin to modify the edges of the drawing or the background by erasing or adding new bits. When you're happy with that you 'commit' it, and then every time you modify it, you repeat the process of first tracing it out on a piece of paper. Each of these pieces of paper stack, so if you want to go back to an earlier drawing, you can, without losing any of the other drawings.
Now let's say your friend wants to work on the same drawing project. You can give him your latest drawing, he can copy it and work on it in the same way. Then when he's done he can give it back to you and you can choose which bits of his version you want to incorporate into your current version. This is super useful for projects with lots of collaborators, especially if you're writing code, which are basically text files and relatively easy to merge based on differences in lines.
[Note that this isn't exactly what the Git system does - it doesn't actually create a NEW version of the repository, like the tracing paper, but instead uses the most recently committed version as the new base/foundation for future modifications.]
The most important thing I've had to keep in mind is to write myself detailed commit messages, and to figure out a good protocol or format for these messages. This is because once you've committed something, you can go back and search through your commits (save states), but this is only easy IF YOU KNOW WHAT THE COMMIT CONTAINS!!! Using the full filename of the committed file, for example, is much better than writing out 'New version of Paper'. Giving more information, such as about what you changed in the file, is even better.
My Use Case
My use case, however (as opposed to collaborating on code with a large team), is for a single repository of data in which I will occasionally share a single paper with collaborators. What I want is a bit more like a backup system (think Apple's Time Machine or the versioning system in Mac OS) but with a lot more control. Git, for example, works on the command line and requires you to provide messages for each commit you make, which also requires you, the user, to be clear about what changes are and why they were made. This (ideally) means less clutter and assists with organization.
In thinking about my needs and doing research, I discovered a drawback of version control systems like Git. If I have a single repository but am working on multiple papers at once (like I often do), guess what happens when I save changes on document A and then checkout a previous version of paper B? Yup, everything reverts to the state it was before, and I 'lose' all the changes I just made to A. I put 'lose' in quotes here because I don't actually lose anything, as long as I had committed it properly, but I'd then have to checkout the commit, copy the file, and then revert... anyway it's annoying.
What I need
So what I REALLY need is a single Git repository with the ability to track individual folders/files separately with their own version histories (like embedding repositories). Git can do this in several ways, none of which are particularly intuitive: submodules, subtrees, subrepo, and simple embedding. The first two are designed for somewhat different use cases than mine, namely using (part of) someone else's code in your own project. They're not really meant to create sub-repositories in your local repo to track separate histories. The third, subrepo, looks quite promising but requires that you run everything from the root of your main repo, and since I have lots of organizational subfolders, typing the names every time just gets tedious. So, I use the strategy of simple embedding.
By simple embedding, I refer to navigating to a subfolder within my main repo that I want to track and running the command 'git init'. This does two things:
This is probably a hack, and not the way Git was intended to be used, but it works for me. The only problem I foresee happening is making too many Git repositories, and not being able to keep track of the multiplicity. So I plan to use this sparingly. For the most part, I'm ok with committing changes across the whole repository, since one of the real benefits of using a VCS is the freedom to throw things away.
Some other great posts on version control systems include this one on the benefits of version control, this one on several different version control systems for writers, this post on the limitations of Github for writers - the last of a 6-part series worth reading if you are debating whether to start on the Git/version control journey. And finally, this one is about one person's suggestions for academic writing and Git.
In any case, Git or another VCS is definitely worth checking out. I also recently discovered that Git repos are supported 'out of the box' by LyX, my go-to LaTeX editor! More on that to come... I also recently discovered SRC (Simple Revision Control), which focuses on tracking revisions on single files. This might actually be what I need. I'm still exploring this, and it might not be necessary since my LaTeX editor supports Git. We'll see.
I'm a linguist and singer-songwriter. I write about life, travel, language and technology.