23 February 2014

Programmatically download political science data with the psData package

A lot of progress has been made on improving political scientists’ ability to access data ‘programmatically’, e.g. data can be downloaded with source code R. Packages such as WDI for World Bank Development Indicator and dvn for many data sets stored on the Dataverse Network make it much easier for political scientists to use this data as part of a highly integrated and reproducible workflow.

There are nonetheless still many commonly used political science data sets that aren’t easily accessible to researchers. Recently, I’ve been using the Database of Political Institutions (DPI), Polity IV democracy indicators, and Reinhart and Rogoff’s (2010) financial crisis occurrence data. All three of these data sets are freely available for download online. However, getting them, cleaning them up, and merging them together is kind of a pain. This is especially true for the Reinhart and Rogoff data, which is in 4 Excel files with over 70 individual sheets, one for each country’s data.

Also, I’ve been using variables that are combinations and/or transformations of indicators in regularly updated data sets, but which themselves aren’t regularly updated. In particular, Bueno de Mesquita et al. (2003) devised two variables that they called the ‘winset’ and the ‘selectorate’. These are basically specific combinations of data in DPI and Polity IV. However, the winset and selectorate variables haven’t been updated alongside the yearly updates of DPI and Polity IV.

There are two big problems here:

  1. A lot of time is wasted by political scientists (and their RAs) downloading, cleaning, and transforming these data sets for their own research.

  2. There are many opportunities while doing this work to introduce errors. Imagine the errors that might be introduced and go unnoticed if a copy-and-paste approach is used to merge the 70 Reinhart and Rogoff Excel sheets.

As a solution, I’ve been working on a new R package called psData. This package includes functions that automate the gathering, cleaning, and creation of common political science data and variables. So far (February 2014) it gathers DPI, Polity IV, and Reinhart and Rogoff data, as well as creates winset and selectorate variables. Hopefully the package will save political scientists a lot of time and reduce the number of data management errors.

There certainly could be errors in the way psData gathers data. However, once spotted the errors could be easily reported on the package’s Issues Page. Once fixed, the correction will be spread to all users via a package update.

Types of functions

There are two basic types of functions in psData: Getters and Variable Builders. Getter functions automate the gathering and cleaning of particular data sets so that they can easily be merged with other data. They do not transform the underlying data. Variable Builders use Getters to gather data and then transform it into new variables suggested by the political science literature.


To download only the polity2 variable from Polity IV:

# Load package

# Download polity2 variable
PolityData <- PolityGet(vars = "polity2")

# Show data

##   iso2c     country year polity2
## 1    AF Afghanistan 1800      -6
## 2    AF Afghanistan 1801      -6
## 3    AF Afghanistan 1802      -6
## 4    AF Afghanistan 1803      -6
## 5    AF Afghanistan 1804      -6
## 6    AF Afghanistan 1805      -6

Note that the iso2c variable refers to the ISO two letter country code country ID. This standardised country identifier could be used to easily merge the Polity IV data with another data set. Another country ID type can be selected with the OutCountryID argument. See the package documentation for details.

To create winset (W) and selectorate (ModS) data use the following code:

WinData <- WinsetCreator()


##    iso2c     country year    W ModS
## 1     AF Afghanistan 1975 0.25    0
## 2     AF Afghanistan 1976 0.25    0
## 3     AF Afghanistan 1977 0.25    0
## 15    AF Afghanistan 1989 0.50    0
## 16    AF Afghanistan 1990 0.50    0
## 17    AF Afghanistan 1991 0.50    0


psData should be on CRAN soon, but while it is in the development stage you can install it with the devtools package:

devtools::install_github('psData', 'christophergandrud')


Please feel free to suggest other data set downloading and variable creating functions. To do this just leave a note on the package’s Issues page or make a pull request with a new function added.


Vincent said...

Nowadays I just use the quality of governance data set. It has most of those you mention already cleaned up.

Christopher Gandrud said...

Vincent, great idea. I've added the quality of governance indicators to the suggestions list.

Correct me if I'm wrong, put it looks like they just merge the data in a big file, but don't update variables that aren't updated in the original data sets?

dartthrowingchimp said...

Fantastic! Thanks so much for doing this.

Fr. said...

I've written similar code to easily manipulate cross-sectional / times series in R, such as the Quality of Government dataset. Here's the rank amateur code, from a year back:


Would you be interested if I take a look at your package and submit the useful bits of my draft as additional functions?