Git sparse checkout

2026-03-24

I sometimes find myself wanting to work on a project involving data or code from Github. Many such repositories are quite small and easy to work with, containing mostly text files. Occasionally, however, the datasets are quite large and can be a bit unwieldy, with model files or other data. Accessing the data in some cases can actually result in failure: when cloning one (10GB+) repository to my local workstation, my computer froze and restarted each time I tried.

While this can be due to several factors, one major concern seems to be the size of individual files stored in the repo and the manner in which they are stored. Git has recently migrated most repos to use LFS, but this doesn’t apply across the board.

In cases like these, if you don’t need the large files, it can be useful to know how to get the data that you want from an online repository so that you can work with it locally. While you can download individual files (which is the process I recommend for my students when they are working on an individual corpus), there is also a way to get multiple files in the correct structure without needing to download or clone the whole repository. This involves a process called “sparse checkout”, which can enable getting individual folders or files, and sometimes the related “cone” mode which allows you to get folders and their contents. I illustrate this with the taggedPBC repository, showing how you can get just the scripts and data files (5MB) without downloading all the corpora (7GB).

Outline of the repository

The taggedPBC is a dataset containing a subset of parallel verses from the New Testament in over 1,940 languages of the world that have been tagged for parts of speech (word classes). Each individual corpus is stored as a CoNLLU-formatted text file so that it can be updated with appropriate annotations. The files range in size from roughly 100KB to 10MB, and the total size of the combined corpora is roughly 7GB.

This data in the repository is accompanied by Python scripts that access the corpora to derive various measures and output various results, as well as documentation that outlines the various processes. The scripts and derived datasheets (formatted as XLSX files) are contained in separate folders, and are roughly 5MB in size.

Since all of the analyses are conducted using derived datasheets, the corpora are only needed if annotations are updated or new measures need to be derived. To run other kinds of analyses, the underlying corpora are not used, which means that in many cases you won’t need to store all the corpora on your local workstation.

Setting up a sparse checkout

Git is one of the most useful version control systems available to help keep code and other data organized. In our case, the folder structure of the taggedPBC is quite important if we want to be able to run the scripts, as the scripts use modules (other scripts) stored in subfolders. We could download each of the files/folders individually, but we can also get Git to handle this for us. If you are using Git and Github regularly, I would assume that you have some familiarity with the command line or terminal on your computer. The following bash commands allow us to set up a Git repository that references the online repository, but without downloading any data initially. This is essentially the “bones” of the repo.

# clone without checkout, so as not to download anything
$ git clone --filter=blob:none --no-checkout https://github.com/lingdoc/taggedPBC
# navigate to the directory
$ cd taggedPBC
# set git to sparse checkout mode
$ git config core.sparseCheckout true

At this point we have a blank repo in sparse checkout mode, and now we need to point it to the files that we actually want to download. There are a couple options for doing this. If you know that all the files you want (and only the files you want) are located in specific folders, you can set your sparse repo to “cone” mode and pass it a list of folders:

# set repo to cone mode and add directories (`scripts` and `recipes` are folders without corpora,
# and are relatively small in size)
$ git sparse-checkout set --cone scripts recipes

If you want to make sure to get all the data you need, copying complete directories in this way is safer. But if you want to exclude certain files in subfolders (for example, some folders might contain large binary files that you don’t need), then “cone” mode is not the best option. Instead, I would recommend making a list that only includes files that you want. Ideally, these would be systematically organized by the repo owner (say, all markdown files, or spreadsheet files, etc). This allows you to find them and add them to your sparse checkout list, like so:

# non-cone mode allows you to add wildcards to get specific files, i.e. all markdown files
$ echo "scripts/**/*.md" >> .git/info/sparse-checkout
# add all excel spreadsheet files to sparse checkout
$ echo "scripts/**/*.xlsx" >> .git/info/sparse-checkout
# add all python scripts to sparse checkout
$ echo "scripts/**/*.py" >> .git/info/sparse-checkout

Getting the data

When you are ready to download the files that you want, simply run the following command to get the ones you’ve specified.

$ git checkout main

That’s pretty much it - if you find you’re missing some files, you can once again add them to your sparse checkout list. I have found that this is a much better way to get data from repositories, particularly if you don’t need/want the complete repository and/or lack the space/bandwidth to store or retrieve it.