## 6 December 2013

### Three Quick and Simple Data Cleaning Helper Functions (December 2013)

As I go about cleaning and merging data sets with R I often end up creating and using simple functions over and over. When this happens, I stick them in the DataCombine package. This makes it easier for me to remember how to do an operation and others can possibly benefit from simplified and (hopefully) more intuitive code.

I've talked about some of the commands in DataCombine in previous posts. In this post I'll give examples for a few more that I've added over the past couple of months. Note: these examples are based on DataCombine version 0.1.11.

Here is a brief run down of the functions covered in this post:

• FindReplace: a function to replace multiple patterns found in a character string column of a data frame.

• MoveFront: moves variables to the front of a data frame. This can be useful if you have a data frame with many variables and want to move a variable or variables to the front.

• rmExcept: removes all objects from a work space except those specified by the user.

# FindReplace

Recently I needed to replace many patterns in a column of strings. Here is a short example. Imagine we have a data frame like this:

ABData <- data.frame(a = c("London, UK", "Oxford, UK", "Berlin, DE", "Hamburg, DE", "Oslo, NO"), b = c(8, 0.1, 3, 2, 1))


Ok, now I want to replace the UK and DE parts of the strings with England and Germany. So I create a data frame with two columns. The first records the pattern and the second records what I want to replace the pattern with:

Replaces <- data.frame(from = c("UK", "DE"), to = c("England", "Germany"))


Now I can just use FindReplace to make the replacements all at once:

library(DataCombine)

ABNewDF <- FindReplace(data = ABData, Var = "a", replaceData = Replaces, from = "from", to = "to", exact = FALSE)

# Show changes
ABNewDF

##                  a   b
## 1  London, England 8.0
## 2  Oxford, England 0.1
## 3  Berlin, Germany 3.0
## 4 Hamburg, Germany 2.0
## 5         Oslo, NO 1.0


If you set exact = TRUE then FindReplace will only replace exact pattern matches. Also, you can set vector = TRUE to return only a vector of the column you replaced (the Var column), rather than the whole data frame.

# MoveFront

On occasion I've wanted to move a few variables to the front of a data frame. The MoveFront function makes this pretty simple. It only has two arguments: data and Var. Data is the data frame and Var is a character vector with the columns I want to move to the front of the data frame in the order that I want them. Here is an example:

# Create dummy data
A <- B <- C <- 1:50
OldOrder <- data.frame(A, B, C)

names(OldOrder)

## [1] "A" "B" "C"

# Move B and A to the front
NewOrder2 <- MoveFront(OldOrder, c("B", "A"))
names(NewOrder2)

## [1] "B" "A" "C"


# rmExcept

Finally, sometimes I want to clean up my work space and only keep specific objects. I want to remove everything else. This is straightforward with rmExcept. For example:

# Create objects
A <- 1
B <- 2
C <- 3

# Remove all objects except for A
rmExcept("A")

## Removed the following objects:
## ABData, ABNewDF, B, C, NewOrder2, OldOrder, Replaces

# Show workspace
ls()

## [1] "A"


You can set the environment you want to clean up with the envir argument. By default is is your global environment.

## 2 September 2013

### Showing results from Cox Proportional Hazard Models in R with simPH

Effectively showing estimates and uncertainty from Cox Proportional Hazard (PH) models, especially for interactive and non-linear effects, can be challenging with currently available software. So, researchers often just simply display a results table. These are pretty useless for Cox PH models. It is difficult to decipher a simple linear variable’s estimated effect and basically impossible to understand time interactions, interactions between variables, and nonlinear effects without the reader further calculating quantities of interest for a variety of fitted values.

So, I’ve been putting together the simPH R package to hopefully make it easier to show results from Cox PH models. The package uses plots of post-estimation simulations (the same idea behind the plotting facilities in the Zelig package) to show Cox PH estimates and the uncertainty surrounding them.

Here I want to give an overview of how to use simPH. First the general process, then a few examples. Have a look at this paper for details about the math behind the graphs.

### General Steps

There are three steps to use simPH:

1. Estimate a Cox PH model in the usual way with the coxph command in the survival package.

2. Simulate quantities of interest--hazard ratios, first differences, marginal effect, relative hazards, or hazard rates--with the appropriate simPH simulation command.

3. Plot the simulations with the simGG method.

### A Few Examples

Here are some basic examples that illustrate the process and key syntax. Before getting started you can install simPH in the normal R way from CRAN.

#### Linear effects

Let’s look at a simple example with a linear non-interactive effect. The data set I’m using is from Carpenter (2002). It is included with simPH. See the package documentation for details.

First, let’s estimate a Cox PH model where the event of interest is a drug receiving FDA approval.

# Load packages
library(survival)
library(simPH)

# Load Carpenter (2002) data

# Estimate survival model
M1 <- coxph(Surv(acttime, censor) ~ prevgenx + lethal +
deathrt1 + acutediz + hosp01  + hhosleng +
mandiz01 + femdiz01 + peddiz01 + orphdum +
vandavg3 + wpnoavg3 + condavg3 + orderent +
stafcder, data = CarpenterFdaData)



Now say we want to examine the effect that the number of FDA staff reviewing a drug application has on it being accepted. This variable is called stafcder in the model we just estimated. To do this let’s use simPH to simulate hazard ratios. We will simulate hazard ratios for units $$j$$ and $$l$$, i.e. $$\frac{h_{j}(t)}{h_{l}(t)}$$ using simPH’s coxsimLinear command, because we estimate the effect of the number of staff as linear. In the following code we use the Xj argument to set the $$j$$ values. We could use Xl also, but as we don’t coxsimLinear assumes all $$x_{l}$$ are 0.

Notice that the argument spin = TRUE. This finds the shortest 95% probability interval of the simulations using the SPIn method. SPIn can be especially useful for showing simulated quantities of interest generated from Cox PH models, because then can often be crowded close to a lower boundary (0 in the case of hazard rates). We should be more interested in the area with the highest probability–most simulated values–rather than an arbitrary central interval. That being said, if spin = FALSE, then we will simply find the central 95% interval of the simulations.

# Simulate and plot Hazard Ratios for stafcder variable
Sim1 <- coxsimLinear(M1, b = "stafcder",
qi = "Hazard Ratio", ci = 0.95,
Xj = seq(1237, 1600, by = 2), spin = TRUE)
# Plot
simGG(Sim1)


Notice in the plot that each simulation is plotted as an individual point. These are all of the simulations in the shortest 95% probability interval. Each point has a bit of transparency (they are 90% transparent by default). So the plot is visually weighted; the darker areas of the graph have a higher concentration of the simulations. This gives us a very clear picture of the simulation distribution, i.e. the estimated effect and our uncertainty about it.

If you don’t want to plot every point, you can simply use ribbons showing the constricted 95% and the middle 50% of this interval. To do this simply use the ribbons = TRUE argument.

simGG(Sim1, ribbons = TRUE, alpha = 0.5)


Notice the alpha = 0.5 argument. This increased the transparency of the widest ribbon to 50%.

The syntax we’ve used here is very similar to what we use when we are working with nonlinear effects estimated with polynomials and splines. Post-estimation simulations can be run with the coxsimPoly and coxsimSpline commands. See the simPH documentation for more examples.

#### Interactions

The syntax for two-way interactions simulated with the coxsimInteract command is a little different. Using the same data, let’s look at how to show results for interactions. In the following model we are interacting two variables lethal and prevgenx. We can think of these variables as X1 and X2, respectively. For interactions it can be useful to examine the marginal effect. To find the marginal effect of a one unit increase in the lethal variable given various values of prevgenx let’s use the following code:

# Estimate the model
M2 <- coxph(Surv(acttime, censor) ~ lethal*prevgenx, data = CarpenterFdaData)

# Simulate Marginal Effect of lethal for multiple values of prevgenx
Sim2 <- coxsimInteract(M2, b1 = "lethal", b2 = "prevgenx",
qi = "Marginal Effect",
X2 = seq(2, 115, by = 2), nsim = 1000)

# Plot the results
simGG(Sim2, ribbons = TRUE, alpha = 0.5, xlab = "\nprevgenx",
ylab = "Marginal effect of lethal\n")


The order of the X1 and X2 variables in the interactions matters. The marginal effect is always calculated for the X1 variable over a range of X2 values.

Notice also that we set the x and y axis labels with the xlab and ylab arguments.

#### Time-varying Effects

Finally, let’s look at how to use with the coxsimtvc command to show results from effects that we estimate to vary over time. Here we are going to use another data set that is also included with simPH. The event of interest in the following model is when deliberation is stopped on a European Union directive (the model is from Licht (2011)). We will create hazard ratios for the effect that the number of backlogged items (backlog) has on deliberation time. We will estimate the effect as a log-time interaction.

# Load Golub & Steunenberg (2007) data. The data is included with simPH.
data("GolubEUPData")

# Create natural log-time interactions
Golubtvc <- function(x){
tvc(data = GolubEUPData, b = x, tvar = "end", tfun = "log")
}
GolubEUPData$Lcoop <- Golubtvc("coop") GolubEUPData$Lqmv <- Golubtvc("qmv")
GolubEUPData$Lbacklog <- Golubtvc("backlog") GolubEUPData$Lcodec <- Golubtvc("codec")
GolubEUPData$Lqmvpostsea <- Golubtvc("qmvpostsea") GolubEUPData$Lthatcher <- Golubtvc("thatcher")

# Estimate model
M1 <- coxph(Surv(begin, end, event) ~ qmv + qmvpostsea + qmvpostteu +
coop + codec + eu9 + eu10 + eu12 + eu15 + thatcher +
agenda + backlog + Lqmv + Lqmvpostsea + Lcoop + Lcodec +
Lthatcher + Lbacklog,
data = GolubEUPData, ties = "efron")



Much of the first half of the code is dedicated to creating the log-time interactions with the tvc command.

Now we simply create the simulations for a range of values of backlog and plot them. Note in the following code that we tell coxsimtvc the name of both the base backlog variable and its log-time interaction term Lbacklog using the btvc argument. We also need to tell coxsimtvc that it is a log-time interaction with tfun = "log".

# Create simtvc object for relative hazard
Sim2 <- coxsimtvc(obj = M1, b = "backlog", btvc = "Lbacklog",
qi = "Relative Hazard", Xj = seq(40, 200, 40),
tfun = "log", from = 1200, to = 2000, by = 10,
nsim = 500)

# Create relative hazard plot
simGG(Sim2, xlab = "\nTime in Days", ribbons = TRUE,
leg.name = "Backlogged \n Items", alpha = 0.2)



#### simGG Plots are ggplot2

Finally, because almost every plot created by simGG is a ggplot2 plot, you can use almost the full range of customisation options that that package allows. See the ggplot2 documentation for many examples.

## 23 August 2013

### GitHub renders CSV in the browser, becomes even better for social data set creation

I've written in a number of places about how GitHub can be a great place to store data. Unlike basically all other web data storage sites (many of which I really like such as Dataverse and FigShare) GitHub enables deep social data set development and fits nicely into a reproducible research workflow with R.

One negative though, especially compared to FigShare, was that there was no easy way to view CSV or TSV data files in the browser. Unless you downloaded the data and opened it in Excel or an R viewer or whatever, you had to look at the raw data file in the browser. It's basically impossible to make sense of a data set of any size like this.

However, from at least today, GitHub now renders the data set in the browser as you would expect. Take a look at their blog post for the details.

## 16 July 2013

### Getting Started with Reproducible Research: A chapter from my new book

This is an abridged excerpt from Chapter 2 of my new book Reproducible Research with R and RStudio. It's published by Chapman & Hall/CRC Press.

You can purchase it on Amazon. "Search inside this book" includes a complete table of contents.

Researchers often start thinking about making their work reproducible near the end of the research process when they write up their results or maybe even later when a journal requires their data and code be made available for publication. Or maybe even later when another researcher asks if they can use the data from a published article to reproduce the findings. By then there may be numerous versions of the data set and records of the analyses stored across multiple folders on the researcher’s computers. It can be difficult and time consuming to sift through these files to create an accurate account of how the results were reached. Waiting until near the end of the research process to start thinking about reproducibility can lead to incomplete documentation that does not give an accurate account of how findings were made. Focusing on reproducibility from the beginning of the process and continuing to follow a few simple guidelines throughout your research can help you avoid these problems. Remember "reproducibility is not an afterthought–it is something that must be built into the project from the beginning" (Donohue, 2010, 386).

This chapter first gives you a brief overview of the reproducible research process: a workflow for reproducible research. Then it covers some of the key guidelines that can help make your research more reproducible.

# The Big Picture: A Workflow for Reproducible Research

The three basic stages of a typical computational empirical research project are:

• data gathering,
• data analysis,
• results presentation.

Instead of starting to use the individual tools of reproducible research as soon as you learn them I recommend briefly stepping back and considering how the stages of reproducible research tie together overall. This will make your workflow more coherent from the beginning and save you a lot of backtracking later on. The figure below illustrates the workflow. Notice that most of the arrows connecting the workflow’s parts point in both directions, indicating that you should always be thinking how to make it easier to go backwards through your research, i.e. reproduce it, as well as forwards.

### Example Workflow & A Selection of Commands to Tie it Together

Around the edges of the figure are some of the commands that make it easier to go forwards and backwards through the process. These commands tie your research together. For example, you can use API-based R packages to gather data from the internet. You can use R’s merge command to combine data gathered from different sources into one data set. The getURL from R’s RCurl package and the read.table commands can be used to bring this data set into your statistical analyses. The knitr package then ties your analyses into your presentation documents. This includes the code you used, the figures you created, and, with the help of tools such as the xtable package, tables of results. You can even tie multiple presentation documents together. For example, you can access the same figure for use in a LaTeX article and a Markdown created website with the includegraphics and ![]() commands, respectively. This helps you maintain a consistent presentation of results across multiple document types.

# Practical Tips for Reproducible Research

Before learning the details of the reproducible research workflow with R and RStudio, it’s useful to cover a few broad tips that will help you organize your research process and put these skills in perspective. The tips are:

1. Document everything!,
2. Everything is a (text) file,
3. All files should be human readable,
4. Explicitly tie your files together,
5. Have a plan to organize, store, and make your files available.

Using these tips will help make your computational research really reproducible.

## Document everything!

In order to reproduce your research others must be able to know what you did. You have to tell them what you did by documenting as much of your research process as possible. Ideally, you should tell your readers how you gathered your data, analyzed it, and presented the results. Documenting everything is the key to reproducible research and lies behind all of the other tips here.

## Everything is a (text) file

Your documentation is stored in files that include data, analysis code, the write up of results, and explanations of these files (e.g. data set codebooks, session info files, and so on). Ideally, you should use the simplest file format possible to store this information. Usually the simplest file format is the humble, but versatile, text file.

Text files are extremely nimble. They can hold your data in, for example, comma-separated values (.csv) format. They can contain your analysis code in .R files. And they can be the basis for your presentations as markup documents like .tex or .md, for LaTeX and Markdown files, respectively. All of these files can be opened by any program that can read text files.

One reason reproducible research is best stored in text files is that this helps future proof your research. Other file formats, like those used by Microsoft Word (.docx) or Excel (.xlsx), change regularly and may not be compatible with future versions of these programs. Text files, on the other hand, can be opened by a very wide range of currently existing programs and, more likely than not, future ones as well. Even if future researchers do not have R or a LaTeX distribution, they will still be able to open your text files and, aided by frequent comments (see below), be able to understand how we conducted your research (Bowers, 2011, 3).

Text files are also very easy to search and manipulate with a wide range of programs–such as R and RStudio–that can find and replace text characters as well as merge and separate files. Finally, text files are easy to version and changes can be tracked using programs such as Git.

## All files should be human readable

Treat all of your research files as if someone who has not worked on the project will, in the future, try to understand them. Computer code is a way of communicating with the computer. It is 'machine readable' in that the computer is able to use it to understand what you want to do. However, there is a very good chance that other people (or you six months in the future) will not understand what you were telling the computer. So, you need to make all of your files 'human readable'. To make them human readable, you should comment on your code with the goal of communicating its design and purpose (Wilson et al., 2012). With this in mind it is a good idea to comment frequently (Bowers, 2011, 3) and format your code using a style guide (Nagler, 1995). For especially important pieces of code you should use literate programming–where the source code and the presentation text describing its design and purpose appear in the same document (Knuth, 1992). Doing this will make it very clear to others how you accomplished a piece of research.

## Explicitly tie your files together

If everything is just a text file then research projects can be thought of as individual text files that have a relationship with one another. They are tied together. A data file is used as input for an analysis file. The results of an analysis are shown and discussed in a markup file that is used to create a PDF document. Researchers often do not explicitly document the relationships between files that they used in their research. For example, the results of an analysis–a table or figure–may be copied and pasted into a presentation document. It can be very difficult for future researchers to trace the table or figure back to a particular statistical model and a particular data set without clear documentation. Therefore, it is important to make the links between your files explicit.

Tie commands are the most dynamic way to explicitly link your files together (see the figure above for examples). These commands instruct the computer program you are using to use information from another file.

## Have a plan to organize, store, & make your files available

Finally, in order for independent researchers to reproduce your work they need to be able access the files that instruct them how to do this. Files also need to be organized so that independent researchers can figure out how they fit together. So, from the beginning of your research process you should have a plan for organizing your files and a way to make them accessible.

One rule of thumb for organizing your research in files is to limit the amount of content any one file has. Files that contain many different operations can be very difficult to navigate, even if they have detailed comments. For example, it would be very difficult to find any particular operation in a file that contained the code used to gather the data, run all of the statistical models, and create the results figures and tables. If you have a hard time finding things in a file you created, think of the difficulties independent researchers will have!

Because we have so many ways to link files together there is really no need to lump many different operations into one file. So, we can make our files modular. One source code file should be used to complete one or just a few tasks. Breaking your operations into discrete parts will also make it easier for you and others to find errors (Nagler, 1995, 490).

## 9 June 2013

### Quick and Simple D3 Network Graphs from R

Sometimes I just want to quickly make a simple D3 JavaScript directed network graph with data in R. Because D3 network graphs can be manipulated in the browser–i.e. nodes can be moved around and highlighted–they're really nice for data exploration. They're also really nice in HTML presentations. So I put together a bare-bones simple function–called d3SimpleNetwork for turning an R data frame into a D3 network graph.

# Arguments

By bare-bones I mean other than the arguments indicating the Data data frame, as well as the Source and Target variables it only has three arguments: height, width, and file.

The data frame you use should have two columns that contain the source and target variables. Here's an example using fake data:

Source <- c("A", "A", "A", "A", "B", "B", "C", "C", "D")
Target <- c("B", "C", "D", "J", "E", "F", "G", "H", "I")
NetworkData <- data.frame(Source, Target)


The height and width arguments obviously set the graph's frame height and width. You can tell file the file name to output the graph to. This will create a standalone webpage. If you leave file as NULL, then the graph will be printed to the console. This can be useful if you are creating a document using knitr Markdown or (similarly) slidify. Just set the code chunk results='asis and the graph will be rendered in the document.

# Example

Here's a simple example. First load d3SimpleNetwork:

# Load packages to download d3SimpleNetwork
library(digest)
library(devtools)

source_gist("5734624")


Now just run the function with the example NetworkData from before:

d3SimpleNetwork(NetworkData, height = 300, width = 700)


Click here for the fully manipulable version. If you click on individual nodes they will change colour and become easier to see. In the future I might add more customisability, but I kind of like the function's current simplicity.

Update 12 June 2013: The original d3SimpleNetwork command discussed here doesn't work easily with slidify. I have created a new d3Network R package that does work well with slidify (and other knitr-created HTML slideshows). Use its d3Network command and set the argument iframe = TRUE.

## 21 May 2013

### Slide: one function for lag/lead variables in data frames, including time-series cross-sectional data

I often want to quickly create a lag or lead variable in an R data frame. Sometimes I also want to create the lag or lead variable for different groups in a data frame, for example, if I want to lag GDP for each country in a data frame.

I've found the various R methods for doing this hard to remember and usually need to look at old blog posts. Any time we find ourselves using the same series of codes over and over, it's probably time to put them into a function.

So, I added a new command–slide–to the DataCombine R package (v0.1.5).

Building on the shift function TszKin Julian posted on his blog, slide allows you to slide a variable up by any time unit to create a lead or down to create a lag. It returns the lag/lead variable to a new column in your data frame. It works with both data that has one observed unit and with time-series cross-sectional data.

Note: your data needs to be in ascending time order with equally spaced time increments. For example 1995, 1996, 1997.

## Examples

### Not Cross-sectional data

Let's create an example data set with three variables:

# Create time variable
Year <- 1980:1999

# Dummy covariates
A <- B <- 1:20

Data1 <- data.frame(Year, A, B)


##   Year A B
## 1 1980 1 1
## 2 1981 2 2
## 3 1982 3 3
## 4 1983 4 4
## 5 1984 5 5
## 6 1985 6 6


Now let's lag the A variable by one time unit.

library(DataCombine)

DataSlid1 <- slide(Data1, Var = "A", slideBy = -1)


##   Year A B A-1
## 1 1980 1 1  NA
## 2 1981 2 2   1
## 3 1982 3 3   2
## 4 1983 4 4   3
## 5 1984 5 5   4
## 6 1985 6 6   5


The lag variable is automatically given the name A-1.

To lag a variable (i.e. the lag value at a given time is the value of the non-lagged variable at a time in the past) set the slideBy argument as a negative number. Lead variables, are created by using positive numbers in slideBy. Lead variables at a given time have the value of the non-lead variable from some time in the future.

### Time-series Cross-sectional data

Now let's use slide to create a lead variable with time-series cross-sectional data. First create the example data:

# Create time and unit ID variables
Year <- rep(1980:1983, 5)
ID <- sort(rep(seq(1:5), 4))

# Dummy covariates
A <- B <- 1:20

Data2 <- data.frame(Year, ID, A, B)


##   Year ID A B
## 1 1980  1 1 1
## 2 1981  1 2 2
## 3 1982  1 3 3
## 4 1983  1 4 4
## 5 1980  2 5 5
## 6 1981  2 6 6


Now let's create a two time unit lead variable based on B for each unit identified by ID:

DataSlid2 <- slide(Data2, Var = "B", GroupVar = "ID",
slideBy = 2)


##   Year ID A B B2
## 1 1980  1 1 1  3
## 2 1981  1 2 2  4
## 3 1982  1 3 3 NA
## 4 1983  1 4 4 NA
## 5 1980  2 5 5  7
## 6 1981  2 6 6  8


Hopefully you'll find slide useful in your own data analysis. Any suggestions for improvement are always welcome.

## 17 April 2013

### Reinhart & Rogoff: Everyone makes coding mistakes, we need to make it easy to find them + Graphing uncertainty

You may have already seen a lot written on the replication of Reinhart & Rogoff’s (R & R) much cited 2010 paper done by Herndon, Ash, and Pollin. If you haven’t, here is a round up of some of some of what has been written: Konczal, Yglesias, Krugman, Cowen, Peng, FT Alphaville.

This is an interesting issue for me because it involves three topics I really like: political economy, reproducibility, and communicating uncertainty. Others have already commented on these topics in detail. I just wanted to add to this discussion by (a) talking about how this event highlights a real need for researchers to use systems that make finding and correcting mistakes easy, (b) incentivising mistake finding/correction rather than penalising it, and (c) showing uncertainty.

## Systems for Finding and Correcting Mistakes

One of the problems Herndon, Ash, and Pollin found in R&R’s analysis was and Excel coding error. I love to hate on Excel as much as the next R devotee, but I think that is missing the point. The real lesson is not “don’t use Excel” the real lesson is: we all make mistakes.

(Important point: I refer throughout this post to errors caused by coding mistakes rather than purposeful fabrications and falsifications.)

Coding mistakes are an ever present part of our work. The problem is not that we make coding mistakes. Despite our best efforts we always will. The problem is that we often use tools and practices that make it difficult to find and correct our mistakes.

This is where I can get in some Excel hating: tools and practices that make it difficult to find mistakes include binary files (like Excel’s) that can’t be version controlled in a way that fully reveals the research process, not commenting code, not making your data readily available in formats that make replication easy, not having a system for quickly fixing mistakes when they are found. Sorry R users, but the last three are definitely not exclusive to Excel.

It took Herndon, Ash, and Pollin a considerable amount of time to replicate R & R’s findings and therefore find the Excel error. This seems partially because R & R did not make their analysis files readily available (Herndon, Ash, and Pollin had to ask for them). I’m not sure how this error is going to be corrected and documented. But I imagine it will be like most research corrections: kind of on the fly, mostly emailing and reposting.

How big of a detail is this? There is some debate over how big of a problem this mistake is. Roger Peng ends his really nice post:

The vibe on the Internets seems to be that if only this problem had been identified sooner, the world would be a better place. But my cynical mind says, uh, no. You can toss this incident in the very large bucket of papers with some technical errors that are easily fixed. Thankfully, someone found these errors and fixed them, and that’s a good thing. Science moves on.

I agree with most of this paragraph. But, given how important R & R’s finding was to major policy debates it would have been much better if the mistake was caught sooner rather than later. The tools and practices R & R used made it harder to find and correct the mistake, so policymakers were operating with less accurate information for longer.

Solutions: I’ve written in some detail in the most recent issue of The Political Methodologist about how cloud-based version control systems like GitHub can be used to make finding and correcting mistakes easier. Pull requests, for example, are a really nice way to directly suggest corrections.

## Incentivising Error Finding and Correction

Going forward I think it will be interesting to see how this incident shapes researchers’ perceived incentives to make their work easily replicable. Replication is an important part of finding the mistakes that everyone makes. If being found to make a coding mistake (not a fabrication) has a negative impact on your academic career then there are incentives to make finding mistakes difficult, by for example making replication difficult. Most papers do not receive nearly as much attention as R & R’s. So, for most researchers making replication difficult will make it pretty unlikely that anyone will replicate your research and you’ll be home free.

This is a perverse incentive indeed.

What can we do? Many journals now require replicable code to accompany published articles. This is a good incentive. Maybe we should go further, and somehow directly incentivise the finding and correction of errors in data sets and analysis code. Ideas could include giving more weight to replication studies at hiring and promotion committees. Maybe even allowing these committees to include information on researchers’ GitHub pull requests that meaningfully improve other’s work by correcting mistakes.

This of course might create future perversion incentives to add errors so that they can then be found. I think this is a bit fanciful. There are surely enough negative social incentives (i.e. embarrassment) surrounding making mistakes to prevent this.

## Showing Uncertainty

Roger Peng’s post highlighted the issue of graphing uncertainty, but I just wanted to build it out a little further. The interpretation of the correlation R & R’s found between GDP Growth and Government Debt could have been tempered significantly before any mistakes were found by more directly communicating their original uncertainty. In their original paper, they presented the relationship using bar graphs of average and median GDP growth per grouped debt/GDP level:

Beyond showing the mean and median there is basically no indication of the distribution of the data they are from.

Herndon, Ash, and Pollin put together some nice graphs of these distributions (and avoid that thing economists do of using two vertical axis with two different meanings).

Here is one that gets rid of the groups altogether:

If R & R had shown a simple scatter plot like this (though they did exclude some of the higher GDP Growth country-years at the high debt end, so their's would have looked different), it would have been much more difficult to overly interpret the substantive–policy–value of a correlation between GDP/growth and debt/GDP.

Maybe this wouldn’t have actually changed the policy debate that much, As Mark Blyth argues in his recent book on austerity “facts never disconfirm a good ideology” (p. 18). But at least Paul Krugman might not have had to debate debt/GDP cutoff points on CNBC (for example time point 12:40):

P.S. To R & R’s credit, they do often make their data available. Their data has been useful for at least one of my papers. However, it is often available in a format that is hard to use for cross-country statistical analysis, including, I would imagine, their own. Though I have never found any errors in the data, reporting and implementing corrections to this data would be piecemeal at best.