Skip to main content

Programmatically download political science data with the psData package

A lot of progress has been made on improving political scientists’ ability to access data ‘programmatically’, e.g. data can be downloaded with source code R. Packages such as WDI for World Bank Development Indicator and dvn for many data sets stored on the Dataverse Network make it much easier for political scientists to use this data as part of a highly integrated and reproducible workflow.

There are nonetheless still many commonly used political science data sets that aren’t easily accessible to researchers. Recently, I’ve been using the Database of Political Institutions (DPI), Polity IV democracy indicators, and Reinhart and Rogoff’s (2010) financial crisis occurrence data. All three of these data sets are freely available for download online. However, getting them, cleaning them up, and merging them together is kind of a pain. This is especially true for the Reinhart and Rogoff data, which is in 4 Excel files with over 70 individual sheets, one for each country’s data.

Also, I’ve been using variables that are combinations and/or transformations of indicators in regularly updated data sets, but which themselves aren’t regularly updated. In particular, Bueno de Mesquita et al. (2003) devised two variables that they called the ‘winset’ and the ‘selectorate’. These are basically specific combinations of data in DPI and Polity IV. However, the winset and selectorate variables haven’t been updated alongside the yearly updates of DPI and Polity IV.

There are two big problems here:

  1. A lot of time is wasted by political scientists (and their RAs) downloading, cleaning, and transforming these data sets for their own research.

  2. There are many opportunities while doing this work to introduce errors. Imagine the errors that might be introduced and go unnoticed if a copy-and-paste approach is used to merge the 70 Reinhart and Rogoff Excel sheets.

As a solution, I’ve been working on a new R package called psData. This package includes functions that automate the gathering, cleaning, and creation of common political science data and variables. So far (February 2014) it gathers DPI, Polity IV, and Reinhart and Rogoff data, as well as creates winset and selectorate variables. Hopefully the package will save political scientists a lot of time and reduce the number of data management errors.

There certainly could be errors in the way psData gathers data. However, once spotted the errors could be easily reported on the package’s Issues Page. Once fixed, the correction will be spread to all users via a package update.

Types of functions

There are two basic types of functions in psData: Getters and Variable Builders. Getter functions automate the gathering and cleaning of particular data sets so that they can easily be merged with other data. They do not transform the underlying data. Variable Builders use Getters to gather data and then transform it into new variables suggested by the political science literature.

Examples

To download only the polity2 variable from Polity IV:

# Load package
library(psData)

# Download polity2 variable
PolityData <- PolityGet(vars = "polity2")

# Show data
head(PolityData)


##   iso2c     country year polity2
## 1    AF Afghanistan 1800      -6
## 2    AF Afghanistan 1801      -6
## 3    AF Afghanistan 1802      -6
## 4    AF Afghanistan 1803      -6
## 5    AF Afghanistan 1804      -6
## 6    AF Afghanistan 1805      -6

Note that the iso2c variable refers to the ISO two letter country code country ID. This standardised country identifier could be used to easily merge the Polity IV data with another data set. Another country ID type can be selected with the OutCountryID argument. See the package documentation for details.

To create winset (W) and selectorate (ModS) data use the following code:

WinData <- WinsetCreator()

head(WinData)


##    iso2c     country year    W ModS
## 1     AF Afghanistan 1975 0.25    0
## 2     AF Afghanistan 1976 0.25    0
## 3     AF Afghanistan 1977 0.25    0
## 15    AF Afghanistan 1989 0.50    0
## 16    AF Afghanistan 1990 0.50    0
## 17    AF Afghanistan 1991 0.50    0

Install

psData should be on CRAN soon, but while it is in the development stage you can install it with the devtools package:

devtools::install_github('psData', 'christophergandrud')

Suggestions

Please feel free to suggest other data set downloading and variable creating functions. To do this just leave a note on the package’s Issues page or make a pull request with a new function added.

Comments

Vincent said…
Nowadays I just use the quality of governance data set. It has most of those you mention already cleaned up.
Vincent, great idea. I've added the quality of governance indicators to the suggestions list.

Correct me if I'm wrong, put it looks like they just merge the data in a big file, but don't update variables that aren't updated in the original data sets?
Fantastic! Thanks so much for doing this.
Fr. said…
I've written similar code to easily manipulate cross-sectional / times series in R, such as the Quality of Government dataset. Here's the rank amateur code, from a year back:

https://github.com/briatte/qogdata

Would you be interested if I take a look at your package and submit the useful bits of my draft as additional functions?

Popular posts from this blog

Do Political Scientists Care About Effect Sizes: Replication and Type M Errors

Reproducibility has come a long way in political science. Many major journals now require replication materials be made available either on their websites or some service such as the Dataverse Network. Most of the top journals in political science have formally committed to reproducible research best practices by signing up to the The (DA-RT) Data Access and Research Transparency Joint Statement.This is certainly progress. But what are political scientists actually supposed to do with this new information? Data and code availability does help avoid effort duplication--researchers don't need to gather data or program statistical procedures that have already been gathered or programmed. It promotes better research habits. It definitely provides ''procedural oversight''. We would be highly suspect of results from authors that were unable or unwilling to produce their code/data.However, there are lots of problems that data/code availability requirements do not address.…

Showing results from Cox Proportional Hazard Models in R with simPH

Update 2 February 2014: A new version of simPH (Version 1.0) will soon be available for download from CRAN. It allows you to plot using points, ribbons, and (new) lines. See the updated package description paper for examples. Note that the ribbons argument will no longer work as in the examples below. Please use type = 'ribbons' (or 'points' or 'lines'). Effectively showing estimates and uncertainty from Cox Proportional Hazard (PH) models, especially for interactive and non-linear effects, can be challenging with currently available software. So, researchers often just simply display a results table. These are pretty useless for Cox PH models. It is difficult to decipher a simple linear variable’s estimated effect and basically impossible to understand time interactions, interactions between variables, and nonlinear effects without the reader further calculating quantities of interest for a variety of fitted values.So, I’ve been putting together the simPH R p…

Slide: one function for lag/lead variables in data frames, including time-series cross-sectional data

I often want to quickly create a lag or lead variable in an R data frame. Sometimes I also want to create the lag or lead variable for different groups in a data frame, for example, if I want to lag GDP for each country in a data frame.I've found the various R methods for doing this hard to remember and usually need to look at old blogposts. Any time we find ourselves using the same series of codes over and over, it's probably time to put them into a function. So, I added a new command–slide–to the DataCombine R package (v0.1.5).Building on the shift function TszKin Julian posted on his blog, slide allows you to slide a variable up by any time unit to create a lead or down to create a lag. It returns the lag/lead variable to a new column in your data frame. It works with both data that has one observed unit and with time-series cross-sectional data.Note: your data needs to be in ascending time order with equally spaced time increments. For example 1995, 1996, 1997. ExamplesNot…