Skip to main content

Data on GitHub: The easy way to make your data available


Update (6 January 2012): See this post for information on the source_GitHubData function that makes downloading data from GitHub easier.


Update (15 June 2012): See this post for instructions on how to download GitHub based data into R if you are getting the error about an SSL certificate problem.


GitHub is designed for collaborating on coding projects. Nonetheless, it is also a potentially great resource for researchers to make their data publicly available. Specifically you can use it to:
  • store data in the cloud for future use (for free),
  • track changes,
  • make data publicly available for replication,
  • create a website to nicely present key information about the data,
and uniquely:
  • benefit from error checking by the research community.
This is an example of a data set that I’ve put up on GitHub.

How?

Taking advantage of these things through GitHub is pretty easy. In this post I’m going to give a brief overview of how to set up a GitHub data repository.
Note: I’ll assume that you have already set up your GitHub account. If you haven’t done this, see the instructions here (for set up in the command line) or here (for the Mac GUI program) or here (for the Windows GUI program).

Store Data in the Cloud

Data basically consists of two parts, the data and description files that explain what the data means and how we obtained it. Both of these things can be simple text files, easily hosted on GitHub:
  1. Create a new repository on GitHub by clicking on the New Repository button on your GitHub home page. A repository is just a collection of files.
    • Have GitHub create a README.md file.
  2. Clone your repository to your computer.
    • If you are using GUI GitHub, on your repository’s GitHub main page simply click the Clone to Mac or Clone to Windows buttons (depending on your operating system).
    • If you are using command line git.
      • First copy the repository’s URL. This is located on the repository’s GitHub home page near the top (it is slightly different from the page URL).
      • In the command line just use the git clone [URL] command. To clone the example data repository I use for this post type:
        $ git clone https://github.com/christophergandrud/Disproportionality_Data.git
      • Of course you can choose which directory on your computer to put the repository in with the cd command before running git clone.
  3. Fill the repository with your data and description file.
    • Use the README.md file as the place to describe your data–e.g. where you got it from, what project you used it for, any notes. This file will be the first file people see when they visit your repository.
      • To format the README.md file use Markdown syntax.
    • Create a Data folder in the repository and save your data in it using some text format. I prefer .csv. You can upload other types of files to GitHub, but if you save it in a text-based format others can directly suggest changes and you can more easily track changes.
  4. Commit your changes and push them to GitHub.
    • In GUI GitHub click on your data repository, write a short commit summary then click Commit & Sync.
    • In command line git first change your directory to the data repository with cd. Then add your changes with $ git add .. This adds your changed files to the ‘‘staging area’’ from where you can commit them. If you want to see what files were changed type git status -s.
      • Then commit the changes with:
      $ git commit -m ‘a comment describing the changes’
      • Then push the committed changes to GitHub with:
      $ git push origin master
  5. Create a cover site with GitHub Pages. This creates a nice face for the data repository. To create the page:
    • Click the Admin button next to your repository’s name on its GitHub main page.
    • Under ‘‘GitHub Pages’’ click Automatic Page Generator. Then choose the layout you like, add a tracking ID if you like, and publish the page.


Track Changes

GitHub will now track every change you make to all files in the data repository each time you commit the changes. The GitHub website and GUI program have a nice interface for seeing these changes.

Replication Website

Once you set up the page described in Step 5, other researchers can easily download the whole data repository either as a .tar.gz file or .zip. They can also go through your main page to the GitHub repository.
Specific data files can be directly downloaded into R with the RCurl package (and textConnection from the base package). To download my example data into R just type:
    library(RCurl)

    url <- "https://raw.github.com/christophergandrud/Disproportionality_Data/master/Disproportionality.csv"

    disproportionality.data <- getURL(url)                

    disproportionality.data <- read.csv(textConnection(disproportionality.data))
Note: make sure you copy the file’s raw GitHub URL.
You can use this to directly load GitHub based data into your Sweave or knitr file for direct replication.

Improve your data through community error checking

GitHub has really made open source coding projects much easier. Anybody can view a project’s entire code and suggest improvements. This is done with a pull request. If the owner of the project’s repository likes the changes they can accept the request.
Researchers can use this same function to suggest changes to a data set. If other researchers notice an error in a data set they can suggest a change with a pull request. The owner of the data set can then decide whether or not to accept the change.
Hosting data on GitHub and using pull requests allows data to benefit the kind of community led error checking that has been common on wikis and open source coding projects for awhile.


Git Resources

  • Pro Git: a free book on how to use command line git.
  • Git Reference: another good reference for command line git.
  • github:help: GitHub’s reference pages.

Comments

SteveMcE said…
Have you considered how you might include the capacity for citation of this data, given you've created the data file and are interested in it's reuse - and deserve appropriate credit for doing so. Perhaps you could consider the possibility for a persistent identifier such as DOI, Handle - or even one of the underyling identifiers within Git?
Unknown said…
That's a great comment. I probably should have addressed this in the original post.

So far I have assumed that the data would be cited in much the same way as other electronically available data.

If the data is used in a published work you could just cite the author, date published, URL, data accessed or something like that depending on your citation style of choice.

If it's used for something on a website, you could just link to the main page.

Hopefully the article I wrote using this data will be published. In which case, people should cite/link to that article.

These citation procedures were pretty much what I did with the data I used to create the data set on the GitHub page and in the working paper.

A DOI or Git identifier might definitely be something to look into though. I've put it on the list of future things to blog about.
Tom Roche said…
"Data basically consists of two parts, the data and description files that explain what the data means and how we obtained it. Both of these things can be simple text files" :-) Lucky you, working in a field where data can be text. I use github for code, but must upload data in a binary format (netCDF) because that's what's used (in atmospheric modeling) and it's already too big in the binary format--text would just blowout my repo size.
Unknown said…
Tom, thanks for pointing this out.

I really envy atmospheric science for having access to data sets that are so large they can't easily be stored in binary format.
Sam Joseph said…
I just wanted to say great blog post. I had been wondering about if there was a "github for data" and as you point it out, it could be github :-) I wonder how that fits in with all the public data sets http://www.quora.com/Data/Where-can-I-find-large-datasets-open-to-the-public github has a 1GB storage limit, which is fine for many purposes of course ... I guess the other concern might be github disappearing in the future - might be nice if there was some way to replicate across a few different storage services ...
Unknown said…
Sam

I think your point about a lack of integration with other public data set repositories is probably one of the bigger weaknesses of the Github for data storage approach right now.

But, it is definitely not an insurmountable problem since fundamentally the data is just a hosted csv file the link to which could easily be cross-posted in multiple places.

Re Github not being around in the future: this would be inconvenient from an access point of view (e.g. broken URLs in citations). However, it wouldn't be a big problem for both the data itself or it's version history (all of the changes made to it). These are all recorded by Git which is an open standard separate from Github and though new, I think it is reasonable to assume Git will be around for a long time.

If Github shut down, you could easily push the entire data set, all ancillary files, and version history to a another Github-like service (e.g. bitbucket) or host it yourself.

Popular posts from this blog

Dropbox & R Data

I'm always looking for ways to download data from the internet into R. Though I prefer to host and access plain-text data sets (CSV is my personal favourite) from GitHub (see my short paper on the topic) sometimes it's convenient to get data stored on Dropbox . There has been a change in the way Dropbox URLs work and I just added some functionality to the repmis R package. So I though that I'ld write a quick post on how to directly download data from Dropbox into R. The download method is different depending on whether or not your plain-text data is in a Dropbox Public folder or not. Dropbox Public Folder Dropbox is trying to do away with its public folders. New users need to actively create a Public folder. Regardless, sometimes you may want to download data from one. It used to be that files in Public folders were accessible through non-secure (http) URLs. It's easy to download these into R, just use the read.table command, where the URL is the file name

Slide: one function for lag/lead variables in data frames, including time-series cross-sectional data

I often want to quickly create a lag or lead variable in an R data frame. Sometimes I also want to create the lag or lead variable for different groups in a data frame, for example, if I want to lag GDP for each country in a data frame. I've found the various R methods for doing this hard to remember and usually need to look at old blog posts . Any time we find ourselves using the same series of codes over and over, it's probably time to put them into a function. So, I added a new command– slide –to the DataCombine R package (v0.1.5). Building on the shift function TszKin Julian posted on his blog , slide allows you to slide a variable up by any time unit to create a lead or down to create a lag. It returns the lag/lead variable to a new column in your data frame. It works with both data that has one observed unit and with time-series cross-sectional data. Note: your data needs to be in ascending time order with equally spaced time increments. For example 1995, 1996

A Link Between topicmodels LDA and LDAvis

Carson Sievert and Kenny Shirley have put together the really nice LDAvis R package. It provides a Shiny-based interactive interface for exploring the output from Latent Dirichlet Allocation topic models. If you've never used it, I highly recommend checking out their XKCD example (this paper also has some nice background). LDAvis doesn't fit topic models, it just visualises the output. As such it is agnostic about what package you use to fit your LDA topic model. They have a useful example of how to use output from the lda package. I wanted to use LDAvis with output from the topicmodels package. It works really nicely with texts preprocessed using the tm package. The trick is extracting the information LDAvis requires from the model and placing it into a specifically structured JSON formatted object. To make the conversion from topicmodels output to LDAvis JSON input easier, I created a linking function called topicmodels_json_ldavis . The full function is below. To