30 March 2014

Numbering Subway Exits

In a bit of an aside from what I usually work on, I've put together a small website with a simple purpose: advocating for subway station exits to be numbered. These are really handy for finding your way around and are common in East Asia. But I've never seen them in Western countries.

If you're interested check out the site:

23 February 2014

Programmatically download political science data with the psData package

A lot of progress has been made on improving political scientists’ ability to access data ‘programmatically’, e.g. data can be downloaded with source code R. Packages such as WDI for World Bank Development Indicator and dvn for many data sets stored on the Dataverse Network make it much easier for political scientists to use this data as part of a highly integrated and reproducible workflow.

There are nonetheless still many commonly used political science data sets that aren’t easily accessible to researchers. Recently, I’ve been using the Database of Political Institutions (DPI), Polity IV democracy indicators, and Reinhart and Rogoff’s (2010) financial crisis occurrence data. All three of these data sets are freely available for download online. However, getting them, cleaning them up, and merging them together is kind of a pain. This is especially true for the Reinhart and Rogoff data, which is in 4 Excel files with over 70 individual sheets, one for each country’s data.

Also, I’ve been using variables that are combinations and/or transformations of indicators in regularly updated data sets, but which themselves aren’t regularly updated. In particular, Bueno de Mesquita et al. (2003) devised two variables that they called the ‘winset’ and the ‘selectorate’. These are basically specific combinations of data in DPI and Polity IV. However, the winset and selectorate variables haven’t been updated alongside the yearly updates of DPI and Polity IV.

There are two big problems here:

  1. A lot of time is wasted by political scientists (and their RAs) downloading, cleaning, and transforming these data sets for their own research.

  2. There are many opportunities while doing this work to introduce errors. Imagine the errors that might be introduced and go unnoticed if a copy-and-paste approach is used to merge the 70 Reinhart and Rogoff Excel sheets.

As a solution, I’ve been working on a new R package called psData. This package includes functions that automate the gathering, cleaning, and creation of common political science data and variables. So far (February 2014) it gathers DPI, Polity IV, and Reinhart and Rogoff data, as well as creates winset and selectorate variables. Hopefully the package will save political scientists a lot of time and reduce the number of data management errors.

There certainly could be errors in the way psData gathers data. However, once spotted the errors could be easily reported on the package’s Issues Page. Once fixed, the correction will be spread to all users via a package update.

Types of functions

There are two basic types of functions in psData: Getters and Variable Builders. Getter functions automate the gathering and cleaning of particular data sets so that they can easily be merged with other data. They do not transform the underlying data. Variable Builders use Getters to gather data and then transform it into new variables suggested by the political science literature.

Examples

To download only the polity2 variable from Polity IV:

# Load package
library(psData)

# Download polity2 variable
PolityData <- PolityGet(vars = "polity2")

# Show data
head(PolityData)


##   iso2c     country year polity2
## 1    AF Afghanistan 1800      -6
## 2    AF Afghanistan 1801      -6
## 3    AF Afghanistan 1802      -6
## 4    AF Afghanistan 1803      -6
## 5    AF Afghanistan 1804      -6
## 6    AF Afghanistan 1805      -6

Note that the iso2c variable refers to the ISO two letter country code country ID. This standardised country identifier could be used to easily merge the Polity IV data with another data set. Another country ID type can be selected with the OutCountryID argument. See the package documentation for details.

To create winset (W) and selectorate (ModS) data use the following code:

WinData <- WinsetCreator()

head(WinData)


##    iso2c     country year    W ModS
## 1     AF Afghanistan 1975 0.25    0
## 2     AF Afghanistan 1976 0.25    0
## 3     AF Afghanistan 1977 0.25    0
## 15    AF Afghanistan 1989 0.50    0
## 16    AF Afghanistan 1990 0.50    0
## 17    AF Afghanistan 1991 0.50    0

Install

psData should be on CRAN soon, but while it is in the development stage you can install it with the devtools package:

devtools::install_github('psData', 'christophergandrud')

Suggestions

Please feel free to suggest other data set downloading and variable creating functions. To do this just leave a note on the package’s Issues page or make a pull request with a new function added.

4 January 2014

How I Accidentally Wrote a Paper on Supervisory Transparency in the European Union and Why You Should Too

Research is an unpredictable thing. You head in one direction, but end up going another. Here is a recent example:

A co-author and I had an idea for a paper. It's a long story, but basically we wanted to compare banks in the US to those in the EU. This was a situation where our desire to explore a theory was egged on by, what we believed, was available data. In the US it's easy to gather data on banks because the regulators have a nice website where they release the filings banks send them. The data is in a really good format for statistical analysis. US done. We thought our next move would be to quickly gather similar data for EU banks and we would be on our way.

First we contacted the UK's Financial Conduct Authority. Surprisingly, they told us that not only did they not release this data, but it was actually illegal for them to do so. Pretty frustrating. Answers to one question stymied by a lack of data. Argh. I guess we'll just keep looking to see what kind of sample of European countries we can come up with and hope it doesn't lead to ridiculous sampling bias.

We eventually found that only 11 EU countries (out of 28) release any such data and it is in a hodgepodge of different formats making it very difficult to compare across countries. This is remarkable when compared to the US and kind of astounded me given my open data priors. We then looked at EU level initiatives to increase supervisory transparency. The European Banking Authority has made a number of attempts to increase transparency. For example, it asks Member States to submit some basic aggregate level data about their banking systems. It makes this data available on its website.

Countries have a lot of reporting flexibility. They can even choose to label specific items as non-material, non-applicable, or even confidential. Remarkably, a fair number of countries don't even do this. They just report nothing, as we can see here:

Number of Member States without missing aggregate banking data reported to the EBA


Though almost all Member States reported data during the height of the crisis (2009) this is an aberration. In fact a lot of countries just don't report anything.

What are the implications of this secrecy we asked? We decided to do more research to find out. We published the first outcome of this project here. Have a look if you're interested, but that isn't the point of this blog post. The point is that we set off to research one thing, but ended up stumbling upon another problem that was worthy of investigation.

This post's title is clearly a bit facetious. But the general point is serious. We need to be flexible enough and curious enough to be able to answer the questions we didn't even know to ask before we go looking.

6 December 2013

Three Quick and Simple Data Cleaning Helper Functions (December 2013)

As I go about cleaning and merging data sets with R I often end up creating and using simple functions over and over. When this happens, I stick them in the DataCombine package. This makes it easier for me to remember how to do an operation and others can possibly benefit from simplified and (hopefully) more intuitive code.

I've talked about some of the commands in DataCombine in previous posts. In this post I'll give examples for a few more that I've added over the past couple of months. Note: these examples are based on DataCombine version 0.1.11.

Here is a brief run down of the functions covered in this post:

  • FindReplace: a function to replace multiple patterns found in a character string column of a data frame.

  • MoveFront: moves variables to the front of a data frame. This can be useful if you have a data frame with many variables and want to move a variable or variables to the front.

  • rmExcept: removes all objects from a work space except those specified by the user.

FindReplace

Recently I needed to replace many patterns in a column of strings. Here is a short example. Imagine we have a data frame like this:

ABData <- data.frame(a = c("London, UK", "Oxford, UK", "Berlin, DE", "Hamburg, DE", "Oslo, NO"), b = c(8, 0.1, 3, 2, 1))

Ok, now I want to replace the UK and DE parts of the strings with England and Germany. So I create a data frame with two columns. The first records the pattern and the second records what I want to replace the pattern with:

Replaces <- data.frame(from = c("UK", "DE"), to = c("England", "Germany"))

Now I can just use FindReplace to make the replacements all at once:

library(DataCombine)

ABNewDF <- FindReplace(data = ABData, Var = "a", replaceData = Replaces, from = "from", to = "to", exact = FALSE)

# Show changes
ABNewDF
##                  a   b
## 1  London, England 8.0
## 2  Oxford, England 0.1
## 3  Berlin, Germany 3.0
## 4 Hamburg, Germany 2.0
## 5         Oslo, NO 1.0

If you set exact = TRUE then FindReplace will only replace exact pattern matches. Also, you can set vector = TRUE to return only a vector of the column you replaced (the Var column), rather than the whole data frame.

MoveFront

On occasion I've wanted to move a few variables to the front of a data frame. The MoveFront function makes this pretty simple. It only has two arguments: data and Var. Data is the data frame and Var is a character vector with the columns I want to move to the front of the data frame in the order that I want them. Here is an example:

# Create dummy data
A <- B <- C <- 1:50
OldOrder <- data.frame(A, B, C)

names(OldOrder)
## [1] "A" "B" "C"
# Move B and A to the front
NewOrder2 <- MoveFront(OldOrder, c("B", "A"))
names(NewOrder2)
## [1] "B" "A" "C"

rmExcept

Finally, sometimes I want to clean up my work space and only keep specific objects. I want to remove everything else. This is straightforward with rmExcept. For example:

# Create objects
A <- 1
B <- 2
C <- 3

# Remove all objects except for A
rmExcept("A")
## Removed the following objects:
## ABData, ABNewDF, B, C, NewOrder2, OldOrder, Replaces
# Show workspace
ls()
## [1] "A"

You can set the environment you want to clean up with the envir argument. By default is is your global environment.

2 September 2013

Showing results from Cox Proportional Hazard Models in R with simPH

Update 2 February 2014: A new version of simPH (Version 1.0) will soon be available for download from CRAN. It allows you to plot using points, ribbons, and (new) lines. See the updated package description paper for examples. Note that the ribbons argument will no longer work as in the examples below. Please use type = 'ribbons' (or 'points' or 'lines').

Effectively showing estimates and uncertainty from Cox Proportional Hazard (PH) models, especially for interactive and non-linear effects, can be challenging with currently available software. So, researchers often just simply display a results table. These are pretty useless for Cox PH models. It is difficult to decipher a simple linear variable’s estimated effect and basically impossible to understand time interactions, interactions between variables, and nonlinear effects without the reader further calculating quantities of interest for a variety of fitted values.

So, I’ve been putting together the simPH R package to hopefully make it easier to show results from Cox PH models. The package uses plots of post-estimation simulations (the same idea behind the plotting facilities in the Zelig package) to show Cox PH estimates and the uncertainty surrounding them.

Here I want to give an overview of how to use simPH. First the general process, then a few examples. Have a look at this paper for details about the math behind the graphs.

General Steps

There are three steps to use simPH:

  1. Estimate a Cox PH model in the usual way with the coxph command in the survival package.

  2. Simulate quantities of interest--hazard ratios, first differences, marginal effect, relative hazards, or hazard rates--with the appropriate simPH simulation command.

  3. Plot the simulations with the simGG method.

A Few Examples

Here are some basic examples that illustrate the process and key syntax. Before getting started you can install simPH in the normal R way from CRAN.

Linear effects

Let’s look at a simple example with a linear non-interactive effect. The data set I’m using is from Carpenter (2002). It is included with simPH. See the package documentation for details.

First, let’s estimate a Cox PH model where the event of interest is a drug receiving FDA approval.

# Load packages
library(survival)
library(simPH)

# Load Carpenter (2002) data
data("CarpenterFdaData")

# Estimate survival model
 M1 <- coxph(Surv(acttime, censor) ~ prevgenx + lethal +
            deathrt1 + acutediz + hosp01  + hhosleng +
            mandiz01 + femdiz01 + peddiz01 + orphdum +
            vandavg3 + wpnoavg3 + condavg3 + orderent +
            stafcder, data = CarpenterFdaData)

Now say we want to examine the effect that the number of FDA staff reviewing a drug application has on it being accepted. This variable is called stafcder in the model we just estimated. To do this let’s use simPH to simulate hazard ratios. We will simulate hazard ratios for units \(j\) and \(l\), i.e. \(\frac{h_{j}(t)}{h_{l}(t)}\) using simPH’s coxsimLinear command, because we estimate the effect of the number of staff as linear. In the following code we use the Xj argument to set the \(j\) values. We could use Xl also, but as we don’t coxsimLinear assumes all \(x_{l}\) are 0.

Notice that the argument spin = TRUE. This finds the shortest 95% probability interval of the simulations using the SPIn method. SPIn can be especially useful for showing simulated quantities of interest generated from Cox PH models, because then can often be crowded close to a lower boundary (0 in the case of hazard rates). We should be more interested in the area with the highest probability–most simulated values–rather than an arbitrary central interval. That being said, if spin = FALSE, then we will simply find the central 95% interval of the simulations.

# Simulate and plot Hazard Ratios for stafcder variable
Sim1 <- coxsimLinear(M1, b = "stafcder", 
                    qi = "Hazard Ratio", ci = 0.95,
                    Xj = seq(1237, 1600, by = 2), spin = TRUE)
# Plot
simGG(Sim1)

Notice in the plot that each simulation is plotted as an individual point. These are all of the simulations in the shortest 95% probability interval. Each point has a bit of transparency (they are 90% transparent by default). So the plot is visually weighted; the darker areas of the graph have a higher concentration of the simulations. This gives us a very clear picture of the simulation distribution, i.e. the estimated effect and our uncertainty about it.

If you don’t want to plot every point, you can simply use ribbons showing the constricted 95% and the middle 50% of this interval. To do this simply use the ribbons = TRUE argument.

simGG(Sim1, ribbons = TRUE, alpha = 0.5)

plot of chunk unnamed-chunk-4

Notice the alpha = 0.5 argument. This increased the transparency of the widest ribbon to 50%.

The syntax we’ve used here is very similar to what we use when we are working with nonlinear effects estimated with polynomials and splines. Post-estimation simulations can be run with the coxsimPoly and coxsimSpline commands. See the simPH documentation for more examples.

Interactions

The syntax for two-way interactions simulated with the coxsimInteract command is a little different. Using the same data, let’s look at how to show results for interactions. In the following model we are interacting two variables lethal and prevgenx. We can think of these variables as X1 and X2, respectively. For interactions it can be useful to examine the marginal effect. To find the marginal effect of a one unit increase in the lethal variable given various values of prevgenx let’s use the following code:

# Estimate the model
M2 <- coxph(Surv(acttime, censor) ~ lethal*prevgenx, data = CarpenterFdaData)
 
# Simulate Marginal Effect of lethal for multiple values of prevgenx
Sim2 <- coxsimInteract(M2, b1 = "lethal", b2 = "prevgenx",
                       qi = "Marginal Effect",
         X2 = seq(2, 115, by = 2), nsim = 1000)

# Plot the results
simGG(Sim2, ribbons = TRUE, alpha = 0.5, xlab = "\nprevgenx",
      ylab = "Marginal effect of lethal\n")

plot of chunk unnamed-chunk-5

The order of the X1 and X2 variables in the interactions matters. The marginal effect is always calculated for the X1 variable over a range of X2 values.

Notice also that we set the x and y axis labels with the xlab and ylab arguments.

Time-varying Effects

Finally, let’s look at how to use with the coxsimtvc command to show results from effects that we estimate to vary over time. Here we are going to use another data set that is also included with simPH. The event of interest in the following model is when deliberation is stopped on a European Union directive (the model is from Licht (2011)). We will create hazard ratios for the effect that the number of backlogged items (backlog) has on deliberation time. We will estimate the effect as a log-time interaction.

# Load Golub & Steunenberg (2007) data. The data is included with simPH.
data("GolubEUPData")

# Create natural log-time interactions
Golubtvc <- function(x){
  tvc(data = GolubEUPData, b = x, tvar = "end", tfun = "log")
}
GolubEUPData$Lcoop <- Golubtvc("coop")
GolubEUPData$Lqmv <- Golubtvc("qmv")
GolubEUPData$Lbacklog <- Golubtvc("backlog")
GolubEUPData$Lcodec <- Golubtvc("codec")
GolubEUPData$Lqmvpostsea <- Golubtvc("qmvpostsea")
GolubEUPData$Lthatcher <- Golubtvc("thatcher")

# Estimate model
M1 <- coxph(Surv(begin, end, event) ~ qmv + qmvpostsea + qmvpostteu +
              coop + codec + eu9 + eu10 + eu12 + eu15 + thatcher + 
              agenda + backlog + Lqmv + Lqmvpostsea + Lcoop + Lcodec +
              Lthatcher + Lbacklog,
            data = GolubEUPData, ties = "efron")

Much of the first half of the code is dedicated to creating the log-time interactions with the tvc command.

Now we simply create the simulations for a range of values of backlog and plot them. Note in the following code that we tell coxsimtvc the name of both the base backlog variable and its log-time interaction term Lbacklog using the btvc argument. We also need to tell coxsimtvc that it is a log-time interaction with tfun = "log".

# Create simtvc object for relative hazard
Sim2 <- coxsimtvc(obj = M1, b = "backlog", btvc = "Lbacklog",
                  qi = "Relative Hazard", Xj = seq(40, 200, 40),
                  tfun = "log", from = 1200, to = 2000, by = 10,
                  nsim = 500)

# Create relative hazard plot
simGG(Sim2, xlab = "\nTime in Days", ribbons = TRUE,
        leg.name = "Backlogged \n Items", alpha = 0.2)

plot of chunk unnamed-chunk-7

simGG Plots are ggplot2

Finally, because almost every plot created by simGG is a ggplot2 plot, you can use almost the full range of customisation options that that package allows. See the ggplot2 documentation for many examples.

23 August 2013

GitHub renders CSV in the browser, becomes even better for social data set creation

I've written in a number of places about how GitHub can be a great place to store data. Unlike basically all other web data storage sites (many of which I really like such as Dataverse and FigShare) GitHub enables deep social data set development and fits nicely into a reproducible research workflow with R.

One negative though, especially compared to FigShare, was that there was no easy way to view CSV or TSV data files in the browser. Unless you downloaded the data and opened it in Excel or an R viewer or whatever, you had to look at the raw data file in the browser. It's basically impossible to make sense of a data set of any size like this.

However, from at least today, GitHub now renders the data set in the browser as you would expect. Take a look at their blog post for the details.

16 July 2013

Getting Started with Reproducible Research: A chapter from my new book

This is an abridged excerpt from Chapter 2 of my new book Reproducible Research with R and RStudio. It's published by Chapman & Hall/CRC Press.

You can purchase it on Amazon. "Search inside this book" includes a complete table of contents.

Researchers often start thinking about making their work reproducible near the end of the research process when they write up their results or maybe even later when a journal requires their data and code be made available for publication. Or maybe even later when another researcher asks if they can use the data from a published article to reproduce the findings. By then there may be numerous versions of the data set and records of the analyses stored across multiple folders on the researcher’s computers. It can be difficult and time consuming to sift through these files to create an accurate account of how the results were reached. Waiting until near the end of the research process to start thinking about reproducibility can lead to incomplete documentation that does not give an accurate account of how findings were made. Focusing on reproducibility from the beginning of the process and continuing to follow a few simple guidelines throughout your research can help you avoid these problems. Remember "reproducibility is not an afterthought–it is something that must be built into the project from the beginning" (Donoho, 2010, 386).

This chapter first gives you a brief overview of the reproducible research process: a workflow for reproducible research. Then it covers some of the key guidelines that can help make your research more reproducible.

The Big Picture: A Workflow for Reproducible Research

The three basic stages of a typical computational empirical research project are:

  • data gathering,
  • data analysis,
  • results presentation.

Instead of starting to use the individual tools of reproducible research as soon as you learn them I recommend briefly stepping back and considering how the stages of reproducible research tie together overall. This will make your workflow more coherent from the beginning and save you a lot of backtracking later on. The figure below illustrates the workflow. Notice that most of the arrows connecting the workflow’s parts point in both directions, indicating that you should always be thinking how to make it easier to go backwards through your research, i.e. reproduce it, as well as forwards.


Example Workflow & A Selection of Commands to Tie it Together


Around the edges of the figure are some of the commands that make it easier to go forwards and backwards through the process. These commands tie your research together. For example, you can use API-based R packages to gather data from the internet. You can use R’s merge command to combine data gathered from different sources into one data set. The getURL from R’s RCurl package and the read.table commands can be used to bring this data set into your statistical analyses. The knitr package then ties your analyses into your presentation documents. This includes the code you used, the figures you created, and, with the help of tools such as the xtable package, tables of results. You can even tie multiple presentation documents together. For example, you can access the same figure for use in a LaTeX article and a Markdown created website with the includegraphics and ![]() commands, respectively. This helps you maintain a consistent presentation of results across multiple document types.

Practical Tips for Reproducible Research

Before learning the details of the reproducible research workflow with R and RStudio, it’s useful to cover a few broad tips that will help you organize your research process and put these skills in perspective. The tips are:

  1. Document everything!,
  2. Everything is a (text) file,
  3. All files should be human readable,
  4. Explicitly tie your files together,
  5. Have a plan to organize, store, and make your files available.

Using these tips will help make your computational research really reproducible.

Document everything!

In order to reproduce your research others must be able to know what you did. You have to tell them what you did by documenting as much of your research process as possible. Ideally, you should tell your readers how you gathered your data, analyzed it, and presented the results. Documenting everything is the key to reproducible research and lies behind all of the other tips here.

Everything is a (text) file

Your documentation is stored in files that include data, analysis code, the write up of results, and explanations of these files (e.g. data set codebooks, session info files, and so on). Ideally, you should use the simplest file format possible to store this information. Usually the simplest file format is the humble, but versatile, text file.

Text files are extremely nimble. They can hold your data in, for example, comma-separated values (.csv) format. They can contain your analysis code in .R files. And they can be the basis for your presentations as markup documents like .tex or .md, for LaTeX and Markdown files, respectively. All of these files can be opened by any program that can read text files.

One reason reproducible research is best stored in text files is that this helps future proof your research. Other file formats, like those used by Microsoft Word (.docx) or Excel (.xlsx), change regularly and may not be compatible with future versions of these programs. Text files, on the other hand, can be opened by a very wide range of currently existing programs and, more likely than not, future ones as well. Even if future researchers do not have R or a LaTeX distribution, they will still be able to open your text files and, aided by frequent comments (see below), be able to understand how we conducted your research (Bowers, 2011, 3).

Text files are also very easy to search and manipulate with a wide range of programs–such as R and RStudio–that can find and replace text characters as well as merge and separate files. Finally, text files are easy to version and changes can be tracked using programs such as Git.

All files should be human readable

Treat all of your research files as if someone who has not worked on the project will, in the future, try to understand them. Computer code is a way of communicating with the computer. It is 'machine readable' in that the computer is able to use it to understand what you want to do. However, there is a very good chance that other people (or you six months in the future) will not understand what you were telling the computer. So, you need to make all of your files 'human readable'. To make them human readable, you should comment on your code with the goal of communicating its design and purpose (Wilson et al., 2012). With this in mind it is a good idea to comment frequently (Bowers, 2011, 3) and format your code using a style guide (Nagler, 1995). For especially important pieces of code you should use literate programming–where the source code and the presentation text describing its design and purpose appear in the same document (Knuth, 1992). Doing this will make it very clear to others how you accomplished a piece of research.

Explicitly tie your files together

If everything is just a text file then research projects can be thought of as individual text files that have a relationship with one another. They are tied together. A data file is used as input for an analysis file. The results of an analysis are shown and discussed in a markup file that is used to create a PDF document. Researchers often do not explicitly document the relationships between files that they used in their research. For example, the results of an analysis–a table or figure–may be copied and pasted into a presentation document. It can be very difficult for future researchers to trace the table or figure back to a particular statistical model and a particular data set without clear documentation. Therefore, it is important to make the links between your files explicit.

Tie commands are the most dynamic way to explicitly link your files together (see the figure above for examples). These commands instruct the computer program you are using to use information from another file.

Have a plan to organize, store, & make your files available

Finally, in order for independent researchers to reproduce your work they need to be able access the files that instruct them how to do this. Files also need to be organized so that independent researchers can figure out how they fit together. So, from the beginning of your research process you should have a plan for organizing your files and a way to make them accessible.

One rule of thumb for organizing your research in files is to limit the amount of content any one file has. Files that contain many different operations can be very difficult to navigate, even if they have detailed comments. For example, it would be very difficult to find any particular operation in a file that contained the code used to gather the data, run all of the statistical models, and create the results figures and tables. If you have a hard time finding things in a file you created, think of the difficulties independent researchers will have!

Because we have so many ways to link files together there is really no need to lump many different operations into one file. So, we can make our files modular. One source code file should be used to complete one or just a few tasks. Breaking your operations into discrete parts will also make it easier for you and others to find errors (Nagler, 1995, 490).