Christopher Gandrud (간드루드 크리스토파)

Posting Elsewhere

2017-02-27T14:17:00.001+00:00

FYI: most of my blog-type writing on software development and political economy is now on Medium or also at Bruegel (only political economy at Bruegel).

End of 2015 Blog Roundup

2015-12-30T15:46:00.000+00:00

Over the past few months I've mostly been blogging at a number of other venues. These include:

A piece with Mark Hallerberg in Democracy Audit UK summarising our research on how, despite previous findings, democratic governments run similarly sizable bank bailout tabs as autocracies. This wasn't noticed in previous work, because democratic governments have incentives (possiblilty of losing elections) to shift the realisation of these costs into the future.
A post over at Bruegel introducing the Financial Supervisory Transparency Index that Mark Copelovitch, Mark Hallerberg, and I created. We also discuss supervisory transparency's implications for a European capital markets union.
At VoxUkraine, I discuss the causes and possible solutions to brawling in the Ukranian parliament based on my recent research in the Journal of Peace Research.
I didn't write this one, but my co-author Tom Pepinsky, wrote a nice piece about a new working paper we have on the (difficulty) of predicting financial crises.

More Corrections the the DPI’s yrcurnt Election Timing Variable: OECD Edition

2015-06-20T09:17:00.000+01:00

Previously on The Political Methodologist, I posted updates to the Database of Political Institutions' election timing variable: yrcurnt. That set of corrections was only for the current 28 EU member states.

I’ve now expanded the corrections to include most other OECD countries.¹ Again, there were many missing elections:

Change list

Country	Changes
Australia	Corrects missing 1998 election year.
Canada	Corrects missing 2000, 2006, 2008, 2011 election years.
Iceland	Corrects missing 2009 election year.
Ireland	Corrects missing 2011 election.
Japan	Corrects missing 2005 and 2012 elections. Corrects misclassification of the 2003 and 2009 elections, which were originally erroneously labeled as being in 2004 and 2008, respectively.

Import into R

To import the most recent corrected version of the data into R simply use:

election_time <- rio::import('https://raw.githubusercontent.com/christophergandrud/yrcurnt_corrected/master/data/yrcurnt_original_corrected.csv')

Australia, Canada, Iceland, Israel, Japan, South Korea, New Zealand, Norway, Switzerland, USA↩

A Link Between topicmodels LDA and LDAvis

2015-05-08T07:41:00.000+01:00

Carson Sievert and Kenny Shirley have put together the really nice LDAvis R package. It provides a Shiny-based interactive interface for exploring the output from Latent Dirichlet Allocation topic models. If you've never used it, I highly recommend checking out their XKCD example (this paper also has some nice background).

LDAvis doesn't fit topic models, it just visualises the output. As such it is agnostic about what package you use to fit your LDA topic model. They have a useful example of how to use output from the lda package.

I wanted to use LDAvis with output from the topicmodels package. It works really nicely with texts preprocessed using the tm package. The trick is extracting the information LDAvis requires from the model and placing it into a specifically structured JSON formatted object.

To make the conversion from topicmodels output to LDAvis JSON input easier, I created a linking function called topicmodels_json_ldavis. The full function is below. To use it follow these steps:

Create a VCorpus object using the tm package's Corpus function.
Convert this to a document term matrix using DocumentTermMatrix, also from tm.
Run your model using topicmodel's LDA function.
Convert the output into JSON format using topicmodels_json_ldavis. The function requires the output from steps 1-3.
Visualise with LDAvis' serVis.

Simulated or Real: What type of data should we use when teaching social science statistics?

2014-12-16T14:36:00.000+00:00

I just finished teaching a new course on collaborative data science to social science students. The materials are on GitHub if you're interested.

What did we do and why?

Maybe the most unusual thing about this class from a statistics pedagogy perspective was that it was entirely focused on real world data; data that the students gathered themselves. I gave them virtually no instruction on what data to gather. They gathered data they felt would help them answer their research questions.

Students directly confronted the data warts that usually consume a large proportion of researchers' actual time. My intention was that the students systematically learn tools and best practices for how to address these warts.

This is in contrast to many social scientists' statistics education. Typically, students are presented with pre-arranged data. They are then asked to perform some statistical function with it. The end.

This leaves students underprepared for actually using statistics in an undirected project (their thesis, in a job). Typically when confronted with data gathering and transformation issues in the real world most muddle through, piecing together ad hoc techniques as they go along in an decidedly non-efficient manner and often with poor results. A fair number of students will become frustrated and may never actually succeed in using any of the statistical tools they did learn.

What kind of data?

How does this course fit into a broader social science statistical education?

Zachary Jones had a really nice post the other day advocating that statistics courses use Monte Carlo simulation rather than real world data. The broad argument being that the messiness of real world data distracts students from carefully learning the statistical properties that instructors intend them to learn.

Superficially, it would seem that the course I just finished and Zachary's prescription are opposed. We could think of stats courses as using one of two different types of data:

simulated --- real world

Simulated vs. Real?

As you'll see I almost entirely agree with Zachary's post, but I think there is a more important difference between the social science statistic course status quo and less commonly taught courses such as mine and (what I think) Zachary is proposing. The difference is where the data comes from: is it gathered/generated by students or is it prepackaged by an instructor?

Many status quo courses use data that is prepackaged by instructors. Both simulated and real world data can be prepackaged. I suppose there are many motivations for this, but an important one surely is that it is easier to teach. As an instructor, you know what the results will be and you know the series of clicks or code that will generate this answer. There are no surprises. Students may also find prepackaged data comforting as they know that there is a correct answer out there. They just need to decode the series of clicks to get it.

Though prepackaged data is easier for instructors and students, it surely is counterproductive in terms of learning how to actually answer research questions with data analysis.

Students will not learn necessary skills needed to gather and transform real world data so that it can be analysed. Students who simply load a prepackaged data set of simulated values will often not understand where it came. They can succumb to the temptation to just click through until they get the right answer.

On the other hand I've definitely had the experience teaching with student simulated data that Zachary describes:

I think many students find [hypothesis testing] unintuitive and end up leaving with a foggy understanding of what tests do. With simulation I don't think it is so hard to explain since you can easily show confidence interval coverage, error rates, power, etc.

The actually important distinction in social science statistics education for thinking about what is more or less effective is:

student gathered/generated --- instructor gathered/generated

Prepackaged vs. student generated data

There is of course a pedagogical difference between data that students gathered from the real world and data they simulated with a computer. Simulated data is useful for teaching the behaviour of statistical methods. Real world data is useful for teaching students how to plan and execute a project using these methods to answer research questions in a way that is reproducible and introduces fewer data munging biases into estimates. Though almost certainly too much to take on together in one course, both should be central to a well-rounded social science statistics education.

Set up R/Stan on Amazon EC2

2014-12-10T15:12:00.000+00:00

A few months ago I posted the script that I use to set up my R/JAGS working environment on an Amazon EC2 instance.

Since then I've largely transitioned to using R/Stan to estimate my models. So, I've updated my setup script (see below).

There are a few other changes:

I don't install/use RStudio on Amazon EC2. Instead, I just use R from the terminal. Don't get me wrong, I love RStudio. But since what I'm doing on EC2 is just running simulations (I handle the results on my local machine), RStudio is overkill.
I don't install git anymore. Instead I use source_url (from devtools) and source_data (from repmis) to source scripts from GitHub. Again all of the manipulation I'm doing to these scripts is on my local machine.

Our developing Financial Regulatory Transparency Index

2014-12-04T13:51:00.000+00:00

Here is a presentation I just gave on a work in-progress. We are developing a new Financial Regulatory Transparency Index using a Bayesian IRT approach.

Do Political Scientists Care About Effect Sizes: Replication and Type M Errors

2014-10-13T07:53:00.000+01:00

Reproducibility has come a long way in political science. Many major journals now require replication materials be made available either on their websites or some service such as the Dataverse Network. Most of the top journals in political science have formally committed to reproducible research best practices by signing up to the The (DA-RT) Data Access and Research Transparency Joint Statement.

This is certainly progress. But what are political scientists actually supposed to do with this new information? Data and code availability does help avoid effort duplication--researchers don't need to gather data or program statistical procedures that have already been gathered or programmed. It promotes better research habits. It definitely provides ''procedural oversight''. We would be highly suspect of results from authors that were unable or unwilling to produce their code/data.

However, there are lots of problems that data/code availability requirements do not address. Apart from a few journals like Political Science Research and Methods, most journals have no standing policy to check the replication materials' veracity. Reviewers rarely have access to manuscripts' code/data. Even if they did have access to it, few reviewers would be willing or able to undertake the time consuming task of reviewing this material.

Do political science journals care about coding and data errors?

What do we do if someone replicating published research finds clear data or coding errors that have biased the published estimates?

Note that I'm limiting the discussion here to honest mistakes, not active attempts to deceive. We all make these mistakes. To keep it simple, I'm also only talking about clear, knowable, and non-causal coding and data errors.

Probably the most responsible action a journal could take when clear cut coding/data biased results have been found would be to directly adjoin to the original article a note detailing the bias. This way readers will always be aware of the correction and will have the best information possible. This is a more efficient way of getting out corrected information than relying on some probabilistic process where readers may or may not stumble upon the information posted elsewhere.

As far as I know, however, no political science journal has a written procedure (please correct me if I'm wrong) for dealing with this new information. My sense is that there are a series of ad hoc responses that closely correspond to how the bias affects the results:

Statistical significance

The situation where a journal is most likely to do anything is when correcting the bias makes the results no longer statistically significant. This might get a journal to append a note to the original article. But maybe not, they could just ignore it.

Sign

It might be that once the coding/data bias is corrected, the sign of an estimated effect flips--the result of what Andrew Gelman calls Type S errors. I really have no idea what a journal would do in this situation. They might append a note or maybe not.

Magnitude

Perhaps the most likely outcome of correcting honest coding/data bias is that the effect size changes. These errors would be the result of Gelman's Type M errors. My sense (and experience) is that in a context where novelty is greatly privileged over facts journal editors will almost certainly ignore this new information. It will be buried.

Do political scientists care about effect size?

Due to the complexity of what political scientists study, we rarely (perhaps with the exception of specific topics like election forecasting) think that we are very close to estimating a given effect's real magnitude. Most researchers are aiming for statistical significance and a sign that matches their theory.

Does this mean that we don't care about trying to estimate magnitudes as closely as possible?

Looking at political science practice pre-publication, there is a lot of evidence that we do care about Type M errors. Considerable effort is given to finding new estimation methods that produce less biased results. Questions of omitted variable bias are very common at research seminars and in journal reviews. Most researchers do carefully build their data sets and code to minimise coding/data bias. Many of these efforts are focused on the headline stuff--whether or not a given effect is significant and what the direction of the effect is. But, these efforts are also part of a desire to make the most accurate estimate of an effect as possible.

However, the review process and journals' responses to finding Type M errors caused by honest coding/data errors in published findings suggest that perhaps we don't care about effect size. Reviewers almost never look at code and data. Journals (as far as I know, please correct me if I'm wrong) never append information on replications that find Type M errors to original papers.

Prescription

I have a simple prescription for demonstrating that we actually care about estimating accurate effect sizes:

Develop a standard practice of including a short authored write up of the data/code bias with corrected results in the original article's supplementary materials. Append a notice to the article pointing to this.

Doing this would not only give readers more accurate effect size estimates, but also make replication materials more useful.

Standardising the practice of publishing authored notes will incentivise people to use replication materials, find errors, and publicly correct them.

Note to self: brew cleanup r

2014-07-22T10:51:00.000+01:00

Note to self: after updating R with Homebrew remember to cleanup old versions:

brew cleanup r

Otherwise I'm liable to get a segfault. (see also)

Simple script from setting up R, Git, and Jags on Amazon EC2 Ubuntu Instance

2014-06-28T15:38:00.002+01:00

Just wanted to put up the script I've been using to create an Amazon EC2 Ubuntu instance for running RStudio, Git, and Jags. There isn't anything really new in here, but it it has been serving me well.

The script begins after the basic instance has been set up in the Amazon EC2 console (yhat has a nice post on how to do this, though some of their screenshots are a little old). Just SSH into the instance and get started.

Slides for a simple introduction to Bayesian data analysis

2014-06-06T10:39:00.000+01:00

Here is the slide deck from a simple introductory talk I gave yesterday on Bayesian data analysis:

Updates to repmis: caching downloaded data and Excel data downloading

2014-05-11T16:51:00.000+01:00

Over the past few months I’ve added a few improvements to the repmis–miscellaneous functions for reproducible research–R package. I just want to briefly highlight two of them:

Caching downloaded data sets.
source_XlsxData for downloading data in Excel formatted files.

Both of these capabilities are in repmis version 0.2.9 and greater.

Caching

When working with data sourced directly from the internet, it can be time consuming (and make the data hoster angry) to repeatedly download the data. So, repmis’s source functions (source_data, source_DropboxData, and source_XlsxData) can now cache a downloaded data set by setting the argument cache = TRUE. For example:

DisData <- source_data("http://bit.ly/156oQ7a", cache = TRUE)

When the function is run again, the data set at http://bit.ly/156oQ7a will be loaded locally, rather than downloaded.

To delete the cached data set, simply run the function again with the argument clearCache = TRUE.

`source_XlsxData`

I recently added the source_XlsxData function to download Excel data sets directly into R. This function works very similarly to the other source functions. There are two differences:

You need to specify the sheet argument. This is either the name of one specific sheet in the downloaded Excel workbook or its number (e.g. the first sheet in the workbook would be sheet = 1).
You can pass other arguments to the read.xlsx function from the xlsx package.

Here’s a simple example:

RRurl <- 'http://www.carmenreinhart.com/user_uploads/data/22_data.xls'

RRData <- source_XlsxData(url = RRurl, sheet = 2, startRow = 5)

startRow = 5 basically drops the first 4 rows of the sheet.

d3Network Plays Nice with Shiny Web Apps

2014-05-09T15:43:00.000+01:00

After some delay (and because of helpful prompting by Giles Heywood and code contributions by John Harrison) d3Network now plays nicely with Shiny web apps. This means you can fully integrate R/D3.js network graphs into your web apps.

Here is what one simple example looks like:

An explanation of the code is here and you can download the app and play with it using:

shiny::runGitHub('d3ShinyExample', 'christophergandrud')

European Parliament Candidates Have a Unique Opportunity to Advocate for Banking Union Transparency and Accountability

2014-05-09T11:21:00.000+01:00

This is reposted from the original on the Hertie School of Governance European Elections blog.

The discussion of issues around the European Parliament Elections has been beating around the bush for quite some time now. Karlheinz Reif and Hermann Schmitt famously described European Elections as ”second-order elections”, in that they are secondary to national elections. A few weeks ago on this blog Andrea Römmele and Yann Lorenz argued that the current election cycle has been characterised by personality politics between candidates vying for the Commission presidency, rather than substantive issues.

However, the election campaigns could be an important opportunity for the public to express their views on and even learn more about one of the defining changes to the European Union since the introduction of the Euro: the European Banking Union.

Much of the framework for the Banking Union has been established in the past year after intense debate between the EU institutions. A key component of the Union is that in November 2014, the European Central Bank (ECB) will become the primary regulator for about 130 of the Euro area’s largest banks and will have the power to become the main supervisor of any other bank, should it deem this necessary to ensure ”high standards”.

A perennial complaint made against the EU is that it lacks transparency and accountability. While there are many causes of this (not least of which is poor media coverage of EU policy-making), the ECB’s activities in the Banking Union certainly are less than transparent according to the rules currently set out. As Prof. Mark Hallerberg and I document in a recent Bruegel Policy Note, financial regulatory transparency in Europe and especially the Banking Union is very lacking. Unlike in another large banking union –the United States, where detailed supervisory data is released every quarter – the ECB does not plan to regularly release any data on the individual banks it supervises.

This makes it difficult for citizens, especially informed watchdog groups, to independently evaluate the ECB’s supervisory effectiveness before it is too late, i.e. before there is another crisis.

The European Parliament has been somewhat successful in improving the transparency and accountably (paywall) of the ECB’s future supervisory activities. Unlike originally proposed, the Parliament now has the power to scrutinise the ECB’s supervisory activities. It will nonetheless be constrained by strict confidentiality rules in its ability to freely access information and publish the information it does find.

In our paper, we also show how a lack of supervisory transparency is not exclusive to EU supervisors – the member state regulators, who will still directly oversee most banks, are in general similarly opaque. We found that only 11 (five in the Eurozone) out of 28 member states regularly release any supervisory data. Member state reporting of basic aggregate supervisory data to the European Banking Authority is also very inconsistent.

European Parliamentarians could use the increased attention that they receive during the election period to improve public awareness of the important role they have played in improving the transparency and accountability of new EU institutions. Perhaps, after the election, they could even use popular support that they may build for these activities during the election period to get stronger oversight capabilities and improve financial supervisory transparency in the European Banking Union.

Numbering Subway Exits

2014-03-30T14:45:00.000+01:00

In a bit of an aside from what I usually work on, I've put together a small website with a simple purpose: advocating for subway station exits to be numbered. These are really handy for finding your way around and are common in East Asia. But I've never seen them in Western countries.

If you're interested check out the site:

Programmatically download political science data with the psData package

2014-02-23T16:41:00.000+00:00

A lot of progress has been made on improving political scientists’ ability to access data ‘programmatically’, e.g. data can be downloaded with source code R. Packages such as WDI for World Bank Development Indicator and dvn for many data sets stored on the Dataverse Network make it much easier for political scientists to use this data as part of a highly integrated and reproducible workflow.

There are nonetheless still many commonly used political science data sets that aren’t easily accessible to researchers. Recently, I’ve been using the Database of Political Institutions (DPI), Polity IV democracy indicators, and Reinhart and Rogoff’s (2010) financial crisis occurrence data. All three of these data sets are freely available for download online. However, getting them, cleaning them up, and merging them together is kind of a pain. This is especially true for the Reinhart and Rogoff data, which is in 4 Excel files with over 70 individual sheets, one for each country’s data.

Also, I’ve been using variables that are combinations and/or transformations of indicators in regularly updated data sets, but which themselves aren’t regularly updated. In particular, Bueno de Mesquita et al. (2003) devised two variables that they called the ‘winset’ and the ‘selectorate’. These are basically specific combinations of data in DPI and Polity IV. However, the winset and selectorate variables haven’t been updated alongside the yearly updates of DPI and Polity IV.

There are two big problems here:

A lot of time is wasted by political scientists (and their RAs) downloading, cleaning, and transforming these data sets for their own research.
There are many opportunities while doing this work to introduce errors. Imagine the errors that might be introduced and go unnoticed if a copy-and-paste approach is used to merge the 70 Reinhart and Rogoff Excel sheets.

As a solution, I’ve been working on a new R package called psData. This package includes functions that automate the gathering, cleaning, and creation of common political science data and variables. So far (February 2014) it gathers DPI, Polity IV, and Reinhart and Rogoff data, as well as creates winset and selectorate variables. Hopefully the package will save political scientists a lot of time and reduce the number of data management errors.

There certainly could be errors in the way psData gathers data. However, once spotted the errors could be easily reported on the package’s Issues Page. Once fixed, the correction will be spread to all users via a package update.

Types of functions

There are two basic types of functions in psData: Getters and Variable Builders. Getter functions automate the gathering and cleaning of particular data sets so that they can easily be merged with other data. They do not transform the underlying data. Variable Builders use Getters to gather data and then transform it into new variables suggested by the political science literature.

Examples

To download only the polity2 variable from Polity IV:

# Load package
library(psData)

# Download polity2 variable
PolityData <- PolityGet(vars = "polity2")

# Show data
head(PolityData)


##   iso2c     country year polity2
## 1    AF Afghanistan 1800      -6
## 2    AF Afghanistan 1801      -6
## 3    AF Afghanistan 1802      -6
## 4    AF Afghanistan 1803      -6
## 5    AF Afghanistan 1804      -6
## 6    AF Afghanistan 1805      -6

Note that the iso2c variable refers to the ISO two letter country code country ID. This standardised country identifier could be used to easily merge the Polity IV data with another data set. Another country ID type can be selected with the OutCountryID argument. See the package documentation for details.

To create winset (W) and selectorate (ModS) data use the following code:

WinData <- WinsetCreator()

head(WinData)


##    iso2c     country year    W ModS
## 1     AF Afghanistan 1975 0.25    0
## 2     AF Afghanistan 1976 0.25    0
## 3     AF Afghanistan 1977 0.25    0
## 15    AF Afghanistan 1989 0.50    0
## 16    AF Afghanistan 1990 0.50    0
## 17    AF Afghanistan 1991 0.50    0

Install

psData should be on CRAN soon, but while it is in the development stage you can install it with the devtools package:

devtools::install_github('psData', 'christophergandrud')

Suggestions

Please feel free to suggest other data set downloading and variable creating functions. To do this just leave a note on the package’s Issues page or make a pull request with a new function added.

How I Accidentally Wrote a Paper on Supervisory Transparency in the European Union and Why You Should Too

2014-01-04T15:03:00.000+00:00

Research is an unpredictable thing. You head in one direction, but end up going another. Here is a recent example:

A co-author and I had an idea for a paper. It's a long story, but basically we wanted to compare banks in the US to those in the EU. This was a situation where our desire to explore a theory was egged on by, what we believed, was available data. In the US it's easy to gather data on banks because the regulators have a nice website where they release the filings banks send them. The data is in a really good format for statistical analysis. US done. We thought our next move would be to quickly gather similar data for EU banks and we would be on our way.

First we contacted the UK's Financial Conduct Authority. Surprisingly, they told us that not only did they not release this data, but it was actually illegal for them to do so. Pretty frustrating. Answers to one question stymied by a lack of data. Argh. I guess we'll just keep looking to see what kind of sample of European countries we can come up with and hope it doesn't lead to ridiculous sampling bias.

We eventually found that only 11 EU countries (out of 28) release any such data and it is in a hodgepodge of different formats making it very difficult to compare across countries. This is remarkable when compared to the US and kind of astounded me given my open data priors. We then looked at EU level initiatives to increase supervisory transparency. The European Banking Authority has made a number of attempts to increase transparency. For example, it asks Member States to submit some basic aggregate level data about their banking systems. It makes this data available on its website.

Countries have a lot of reporting flexibility. They can even choose to label specific items as non-material, non-applicable, or even confidential. Remarkably, a fair number of countries don't even do this. They just report nothing, as we can see here:

Number of Member States without missing aggregate banking data reported to the EBA

Though almost all Member States reported data during the height of the crisis (2009) this is an aberration. In fact a lot of countries just don't report anything.

What are the implications of this secrecy we asked? We decided to do more research to find out. We published the first outcome of this project here. Have a look if you're interested, but that isn't the point of this blog post. The point is that we set off to research one thing, but ended up stumbling upon another problem that was worthy of investigation.

This post's title is clearly a bit facetious. But the general point is serious. We need to be flexible enough and curious enough to be able to answer the questions we didn't even know to ask before we go looking.

Three Quick and Simple Data Cleaning Helper Functions (December 2013)

2013-12-06T17:43:00.000+00:00

As I go about cleaning and merging data sets with R I often end up creating and using simple functions over and over. When this happens, I stick them in the DataCombine package. This makes it easier for me to remember how to do an operation and others can possibly benefit from simplified and (hopefully) more intuitive code.

I've talked about some of the commands in DataCombine in previous posts. In this post I'll give examples for a few more that I've added over the past couple of months. Note: these examples are based on DataCombine version 0.1.11.

Here is a brief run down of the functions covered in this post:

FindReplace: a function to replace multiple patterns found in a character string column of a data frame.
MoveFront: moves variables to the front of a data frame. This can be useful if you have a data frame with many variables and want to move a variable or variables to the front.
rmExcept: removes all objects from a work space except those specified by the user.

`FindReplace`

Recently I needed to replace many patterns in a column of strings. Here is a short example. Imagine we have a data frame like this:

ABData <- data.frame(a = c("London, UK", "Oxford, UK", "Berlin, DE", "Hamburg, DE", "Oslo, NO"), b = c(8, 0.1, 3, 2, 1))

Ok, now I want to replace the UK and DE parts of the strings with England and Germany. So I create a data frame with two columns. The first records the pattern and the second records what I want to replace the pattern with:

Replaces <- data.frame(from = c("UK", "DE"), to = c("England", "Germany"))

Now I can just use FindReplace to make the replacements all at once:

library(DataCombine)

ABNewDF <- FindReplace(data = ABData, Var = "a", replaceData = Replaces, from = "from", to = "to", exact = FALSE)

# Show changes
ABNewDF

##                  a   b
## 1  London, England 8.0
## 2  Oxford, England 0.1
## 3  Berlin, Germany 3.0
## 4 Hamburg, Germany 2.0
## 5         Oslo, NO 1.0

If you set exact = TRUE then FindReplace will only replace exact pattern matches. Also, you can set vector = TRUE to return only a vector of the column you replaced (the Var column), rather than the whole data frame.

`MoveFront`

On occasion I've wanted to move a few variables to the front of a data frame. The MoveFront function makes this pretty simple. It only has two arguments: data and Var. Data is the data frame and Var is a character vector with the columns I want to move to the front of the data frame in the order that I want them. Here is an example:

# Create dummy data
A <- B <- C <- 1:50
OldOrder <- data.frame(A, B, C)

names(OldOrder)

## [1] "A" "B" "C"

# Move B and A to the front
NewOrder2 <- MoveFront(OldOrder, c("B", "A"))
names(NewOrder2)

## [1] "B" "A" "C"

`rmExcept`

Finally, sometimes I want to clean up my work space and only keep specific objects. I want to remove everything else. This is straightforward with rmExcept. For example:

# Create objects
A <- 1
B <- 2
C <- 3

# Remove all objects except for A
rmExcept("A")

## Removed the following objects:
## ABData, ABNewDF, B, C, NewOrder2, OldOrder, Replaces

# Show workspace
ls()

## [1] "A"

You can set the environment you want to clean up with the envir argument. By default is is your global environment.

Showing results from Cox Proportional Hazard Models in R with simPH

2013-09-02T14:27:00.000+01:00

Update 2 February 2014: A new version of simPH (Version 1.0) will soon be available for download from CRAN. It allows you to plot using points, ribbons, and (new) lines. See the updated package description paper for examples. Note that the ribbons argument will no longer work as in the examples below. Please use type = 'ribbons' (or 'points' or 'lines').

Effectively showing estimates and uncertainty from Cox Proportional Hazard (PH) models, especially for interactive and non-linear effects, can be challenging with currently available software. So, researchers often just simply display a results table. These are pretty useless for Cox PH models. It is difficult to decipher a simple linear variable’s estimated effect and basically impossible to understand time interactions, interactions between variables, and nonlinear effects without the reader further calculating quantities of interest for a variety of fitted values.

So, I’ve been putting together the simPH R package to hopefully make it easier to show results from Cox PH models. The package uses plots of post-estimation simulations (the same idea behind the plotting facilities in the Zelig package) to show Cox PH estimates and the uncertainty surrounding them.

Here I want to give an overview of how to use simPH. First the general process, then a few examples. Have a look at this paper for details about the math behind the graphs.

General Steps

There are three steps to use simPH:

Estimate a Cox PH model in the usual way with the coxph command in the survival package.
Simulate quantities of interest--hazard ratios, first differences, marginal effect, relative hazards, or hazard rates--with the appropriate simPH simulation command.
Plot the simulations with the simGG method.

A Few Examples

Here are some basic examples that illustrate the process and key syntax. Before getting started you can install simPH in the normal R way from CRAN.

Linear effects

Let’s look at a simple example with a linear non-interactive effect. The data set I’m using is from Carpenter (2002). It is included with simPH. See the package documentation for details.

First, let’s estimate a Cox PH model where the event of interest is a drug receiving FDA approval.

# Load packages
library(survival)
library(simPH)

# Load Carpenter (2002) data
data("CarpenterFdaData")

# Estimate survival model
 M1 <- coxph(Surv(acttime, censor) ~ prevgenx + lethal +
            deathrt1 + acutediz + hosp01  + hhosleng +
            mandiz01 + femdiz01 + peddiz01 + orphdum +
            vandavg3 + wpnoavg3 + condavg3 + orderent +
            stafcder, data = CarpenterFdaData)

Now say we want to examine the effect that the number of FDA staff reviewing a drug application has on it being accepted. This variable is called stafcder in the model we just estimated. To do this let’s use simPH to simulate hazard ratios. We will simulate hazard ratios for units \(j\) and \(l\), i.e. \(\frac{h_{j}(t)}{h_{l}(t)}\) using simPH’s coxsimLinear command, because we estimate the effect of the number of staff as linear. In the following code we use the Xj argument to set the \(j\) values. We could use Xl also, but as we don’t coxsimLinear assumes all \(x_{l}\) are 0.

Notice that the argument spin = TRUE. This finds the shortest 95% probability interval of the simulations using the SPIn method. SPIn can be especially useful for showing simulated quantities of interest generated from Cox PH models, because then can often be crowded close to a lower boundary (0 in the case of hazard rates). We should be more interested in the area with the highest probability–most simulated values–rather than an arbitrary central interval. That being said, if spin = FALSE, then we will simply find the central 95% interval of the simulations.

# Simulate and plot Hazard Ratios for stafcder variable
Sim1 <- coxsimLinear(M1, b = "stafcder", 
                    qi = "Hazard Ratio", ci = 0.95,
                    Xj = seq(1237, 1600, by = 2), spin = TRUE)
# Plot
simGG(Sim1)

Notice in the plot that each simulation is plotted as an individual point. These are all of the simulations in the shortest 95% probability interval. Each point has a bit of transparency (they are 90% transparent by default). So the plot is visually weighted; the darker areas of the graph have a higher concentration of the simulations. This gives us a very clear picture of the simulation distribution, i.e. the estimated effect and our uncertainty about it.

If you don’t want to plot every point, you can simply use ribbons showing the constricted 95% and the middle 50% of this interval. To do this simply use the ribbons = TRUE argument.

simGG(Sim1, ribbons = TRUE, alpha = 0.5)

Notice the alpha = 0.5 argument. This increased the transparency of the widest ribbon to 50%.

The syntax we’ve used here is very similar to what we use when we are working with nonlinear effects estimated with polynomials and splines. Post-estimation simulations can be run with the coxsimPoly and coxsimSpline commands. See the simPH documentation for more examples.

Interactions

The syntax for two-way interactions simulated with the coxsimInteract command is a little different. Using the same data, let’s look at how to show results for interactions. In the following model we are interacting two variables lethal and prevgenx. We can think of these variables as X1 and X2, respectively. For interactions it can be useful to examine the marginal effect. To find the marginal effect of a one unit increase in the lethal variable given various values of prevgenx let’s use the following code:

# Estimate the model
M2 <- coxph(Surv(acttime, censor) ~ lethal*prevgenx, data = CarpenterFdaData)
 
# Simulate Marginal Effect of lethal for multiple values of prevgenx
Sim2 <- coxsimInteract(M2, b1 = "lethal", b2 = "prevgenx",
                       qi = "Marginal Effect",
         X2 = seq(2, 115, by = 2), nsim = 1000)

# Plot the results
simGG(Sim2, ribbons = TRUE, alpha = 0.5, xlab = "\nprevgenx",
      ylab = "Marginal effect of lethal\n")

The order of the X1 and X2 variables in the interactions matters. The marginal effect is always calculated for the X1 variable over a range of X2 values.

Notice also that we set the x and y axis labels with the xlab and ylab arguments.

Time-varying Effects

Finally, let’s look at how to use with the coxsimtvc command to show results from effects that we estimate to vary over time. Here we are going to use another data set that is also included with simPH. The event of interest in the following model is when deliberation is stopped on a European Union directive (the model is from Licht (2011)). We will create hazard ratios for the effect that the number of backlogged items (backlog) has on deliberation time. We will estimate the effect as a log-time interaction.

# Load Golub & Steunenberg (2007) data. The data is included with simPH.
data("GolubEUPData")

# Create natural log-time interactions
Golubtvc <- function(x){
  tvc(data = GolubEUPData, b = x, tvar = "end", tfun = "log")
}
GolubEUPData$Lcoop <- Golubtvc("coop")
GolubEUPData$Lqmv <- Golubtvc("qmv")
GolubEUPData$Lbacklog <- Golubtvc("backlog")
GolubEUPData$Lcodec <- Golubtvc("codec")
GolubEUPData$Lqmvpostsea <- Golubtvc("qmvpostsea")
GolubEUPData$Lthatcher <- Golubtvc("thatcher")

# Estimate model
M1 <- coxph(Surv(begin, end, event) ~ qmv + qmvpostsea + qmvpostteu +
              coop + codec + eu9 + eu10 + eu12 + eu15 + thatcher + 
              agenda + backlog + Lqmv + Lqmvpostsea + Lcoop + Lcodec +
              Lthatcher + Lbacklog,
            data = GolubEUPData, ties = "efron")

Much of the first half of the code is dedicated to creating the log-time interactions with the tvc command.

Now we simply create the simulations for a range of values of backlog and plot them. Note in the following code that we tell coxsimtvc the name of both the base backlog variable and its log-time interaction term Lbacklog using the btvc argument. We also need to tell coxsimtvc that it is a log-time interaction with tfun = "log".

# Create simtvc object for relative hazard
Sim2 <- coxsimtvc(obj = M1, b = "backlog", btvc = "Lbacklog",
                  qi = "Relative Hazard", Xj = seq(40, 200, 40),
                  tfun = "log", from = 1200, to = 2000, by = 10,
                  nsim = 500)

# Create relative hazard plot
simGG(Sim2, xlab = "\nTime in Days", ribbons = TRUE,
        leg.name = "Backlogged \n Items", alpha = 0.2)

simGG Plots are ggplot2

Finally, because almost every plot created by simGG is a ggplot2 plot, you can use almost the full range of customisation options that that package allows. See the ggplot2 documentation for many examples.

GitHub renders CSV in the browser, becomes even better for social data set creation

2013-08-23T04:42:00.000+01:00

I've written in a number of places about how GitHub can be a great place to store data. Unlike basically all other web data storage sites (many of which I really like such as Dataverse and FigShare) GitHub enables deep social data set development and fits nicely into a reproducible research workflow with R.

One negative though, especially compared to FigShare, was that there was no easy way to view CSV or TSV data files in the browser. Unless you downloaded the data and opened it in Excel or an R viewer or whatever, you had to look at the raw data file in the browser. It's basically impossible to make sense of a data set of any size like this.

However, from at least today, GitHub now renders the data set in the browser as you would expect. Take a look at their blog post for the details.

Getting Started with Reproducible Research: A chapter from my new book

2013-07-16T00:35:00.000+01:00

This is an abridged excerpt from Chapter 2 of my new book Reproducible Research with R and RStudio. It's published by Chapman & Hall/CRC Press.

You can purchase it on Amazon. "Search inside this book" includes a complete table of contents.

Researchers often start thinking about making their work reproducible near the end of the research process when they write up their results or maybe even later when a journal requires their data and code be made available for publication. Or maybe even later when another researcher asks if they can use the data from a published article to reproduce the findings. By then there may be numerous versions of the data set and records of the analyses stored across multiple folders on the researcher’s computers. It can be difficult and time consuming to sift through these files to create an accurate account of how the results were reached. Waiting until near the end of the research process to start thinking about reproducibility can lead to incomplete documentation that does not give an accurate account of how findings were made. Focusing on reproducibility from the beginning of the process and continuing to follow a few simple guidelines throughout your research can help you avoid these problems. Remember "reproducibility is not an afterthought–it is something that must be built into the project from the beginning" (Donoho, 2010, 386).

This chapter first gives you a brief overview of the reproducible research process: a workflow for reproducible research. Then it covers some of the key guidelines that can help make your research more reproducible.

The Big Picture: A Workflow for Reproducible Research

The three basic stages of a typical computational empirical research project are:

data gathering,
data analysis,
results presentation.

Instead of starting to use the individual tools of reproducible research as soon as you learn them I recommend briefly stepping back and considering how the stages of reproducible research tie together overall. This will make your workflow more coherent from the beginning and save you a lot of backtracking later on. The figure below illustrates the workflow. Notice that most of the arrows connecting the workflow’s parts point in both directions, indicating that you should always be thinking how to make it easier to go backwards through your research, i.e. reproduce it, as well as forwards.

Example Workflow & A Selection of Commands to Tie it Together

Around the edges of the figure are some of the commands that make it easier to go forwards and backwards through the process. These commands tie your research together. For example, you can use API-based R packages to gather data from the internet. You can use R’s merge command to combine data gathered from different sources into one data set. The getURL from R’s RCurl package and the read.table commands can be used to bring this data set into your statistical analyses. The knitr package then ties your analyses into your presentation documents. This includes the code you used, the figures you created, and, with the help of tools such as the xtable package, tables of results. You can even tie multiple presentation documents together. For example, you can access the same figure for use in a LaTeX article and a Markdown created website with the includegraphics and ![]() commands, respectively. This helps you maintain a consistent presentation of results across multiple document types.

Practical Tips for Reproducible Research

Before learning the details of the reproducible research workflow with R and RStudio, it’s useful to cover a few broad tips that will help you organize your research process and put these skills in perspective. The tips are:

Document everything!,
Everything is a (text) file,
All files should be human readable,
Explicitly tie your files together,
Have a plan to organize, store, and make your files available.

Using these tips will help make your computational research really reproducible.

Document everything!

In order to reproduce your research others must be able to know what you did. You have to tell them what you did by documenting as much of your research process as possible. Ideally, you should tell your readers how you gathered your data, analyzed it, and presented the results. Documenting everything is the key to reproducible research and lies behind all of the other tips here.

Everything is a (text) file

Your documentation is stored in files that include data, analysis code, the write up of results, and explanations of these files (e.g. data set codebooks, session info files, and so on). Ideally, you should use the simplest file format possible to store this information. Usually the simplest file format is the humble, but versatile, text file.

Text files are extremely nimble. They can hold your data in, for example, comma-separated values (.csv) format. They can contain your analysis code in .R files. And they can be the basis for your presentations as markup documents like .tex or .md, for LaTeX and Markdown files, respectively. All of these files can be opened by any program that can read text files.

One reason reproducible research is best stored in text files is that this helps future proof your research. Other file formats, like those used by Microsoft Word (.docx) or Excel (.xlsx), change regularly and may not be compatible with future versions of these programs. Text files, on the other hand, can be opened by a very wide range of currently existing programs and, more likely than not, future ones as well. Even if future researchers do not have R or a LaTeX distribution, they will still be able to open your text files and, aided by frequent comments (see below), be able to understand how we conducted your research (Bowers, 2011, 3).

Text files are also very easy to search and manipulate with a wide range of programs–such as R and RStudio–that can find and replace text characters as well as merge and separate files. Finally, text files are easy to version and changes can be tracked using programs such as Git.

All files should be human readable

Treat all of your research files as if someone who has not worked on the project will, in the future, try to understand them. Computer code is a way of communicating with the computer. It is 'machine readable' in that the computer is able to use it to understand what you want to do. However, there is a very good chance that other people (or you six months in the future) will not understand what you were telling the computer. So, you need to make all of your files 'human readable'. To make them human readable, you should comment on your code with the goal of communicating its design and purpose (Wilson et al., 2012). With this in mind it is a good idea to comment frequently (Bowers, 2011, 3) and format your code using a style guide (Nagler, 1995). For especially important pieces of code you should use literate programming–where the source code and the presentation text describing its design and purpose appear in the same document (Knuth, 1992). Doing this will make it very clear to others how you accomplished a piece of research.

Explicitly tie your files together

If everything is just a text file then research projects can be thought of as individual text files that have a relationship with one another. They are tied together. A data file is used as input for an analysis file. The results of an analysis are shown and discussed in a markup file that is used to create a PDF document. Researchers often do not explicitly document the relationships between files that they used in their research. For example, the results of an analysis–a table or figure–may be copied and pasted into a presentation document. It can be very difficult for future researchers to trace the table or figure back to a particular statistical model and a particular data set without clear documentation. Therefore, it is important to make the links between your files explicit.

Tie commands are the most dynamic way to explicitly link your files together (see the figure above for examples). These commands instruct the computer program you are using to use information from another file.

Have a plan to organize, store, & make your files available

Finally, in order for independent researchers to reproduce your work they need to be able access the files that instruct them how to do this. Files also need to be organized so that independent researchers can figure out how they fit together. So, from the beginning of your research process you should have a plan for organizing your files and a way to make them accessible.

One rule of thumb for organizing your research in files is to limit the amount of content any one file has. Files that contain many different operations can be very difficult to navigate, even if they have detailed comments. For example, it would be very difficult to find any particular operation in a file that contained the code used to gather the data, run all of the statistical models, and create the results figures and tables. If you have a hard time finding things in a file you created, think of the difficulties independent researchers will have!

Because we have so many ways to link files together there is really no need to lump many different operations into one file. So, we can make our files modular. One source code file should be used to complete one or just a few tasks. Breaking your operations into discrete parts will also make it easier for you and others to find errors (Nagler, 1995, 490).

Quick and Simple D3 Network Graphs from R

2013-06-09T03:00:00.000+01:00

Sometimes I just want to quickly make a simple D3 JavaScript directed network graph with data in R. Because D3 network graphs can be manipulated in the browser–i.e. nodes can be moved around and highlighted–they're really nice for data exploration. They're also really nice in HTML presentations. So I put together a bare-bones simple function–called d3SimpleNetwork for turning an R data frame into a D3 network graph.

Arguments

By bare-bones I mean other than the arguments indicating the Data data frame, as well as the Source and Target variables it only has three arguments: height, width, and file.

The data frame you use should have two columns that contain the source and target variables. Here's an example using fake data:

Source <- c("A", "A", "A", "A", "B", "B", "C", "C", "D")
Target <- c("B", "C", "D", "J", "E", "F", "G", "H", "I")
NetworkData <- data.frame(Source, Target)

The height and width arguments obviously set the graph's frame height and width. You can tell file the file name to output the graph to. This will create a standalone webpage. If you leave file as NULL, then the graph will be printed to the console. This can be useful if you are creating a document using knitr Markdown ~~or (similarly) slidify~~. Just set the code chunk results='asis and the graph will be rendered in the document.

Example

Here's a simple example. First load d3SimpleNetwork:

# Load packages to download d3SimpleNetwork
library(digest)
library(devtools)

# Download d3SimpleNetwork
source_gist("5734624")

Now just run the function with the example NetworkData from before:

d3SimpleNetwork(NetworkData, height = 300, width = 700)

Click here for the fully manipulable version. If you click on individual nodes they will change colour and become easier to see. In the future I might add more customisability, but I kind of like the function's current simplicity.

Update 12 June 2013: The original d3SimpleNetwork command discussed here doesn't work easily with slidify. I have created a new d3Network R package that does work well with slidify (and other knitr-created HTML slideshows). Use its d3Network command and set the argument iframe = TRUE.

Slide: one function for lag/lead variables in data frames, including time-series cross-sectional data

2013-05-21T11:18:00.000+01:00

I often want to quickly create a lag or lead variable in an R data frame. Sometimes I also want to create the lag or lead variable for different groups in a data frame, for example, if I want to lag GDP for each country in a data frame.

I've found the various R methods for doing this hard to remember and usually need to look at old blog posts. Any time we find ourselves using the same series of codes over and over, it's probably time to put them into a function.

So, I added a new command–slide–to the DataCombine R package (v0.1.5).

Building on the shift function TszKin Julian posted on his blog, slide allows you to slide a variable up by any time unit to create a lead or down to create a lag. It returns the lag/lead variable to a new column in your data frame. It works with both data that has one observed unit and with time-series cross-sectional data.

Note: your data needs to be in ascending time order with equally spaced time increments. For example 1995, 1996, 1997.

Examples

Not Cross-sectional data

Let's create an example data set with three variables:

# Create time variable
Year <- 1980:1999

# Dummy covariates
A <- B <- 1:20

Data1 <- data.frame(Year, A, B)

head(Data1)

##   Year A B
## 1 1980 1 1
## 2 1981 2 2
## 3 1982 3 3
## 4 1983 4 4
## 5 1984 5 5
## 6 1985 6 6

Now let's lag the A variable by one time unit.

library(DataCombine)

DataSlid1 <- slide(Data1, Var = "A", slideBy = -1)

head(DataSlid1)

##   Year A B A-1
## 1 1980 1 1  NA
## 2 1981 2 2   1
## 3 1982 3 3   2
## 4 1983 4 4   3
## 5 1984 5 5   4
## 6 1985 6 6   5

The lag variable is automatically given the name A-1.

To lag a variable (i.e. the lag value at a given time is the value of the non-lagged variable at a time in the past) set the slideBy argument as a negative number. Lead variables, are created by using positive numbers in slideBy. Lead variables at a given time have the value of the non-lead variable from some time in the future.

Time-series Cross-sectional data

Now let's use slide to create a lead variable with time-series cross-sectional data. First create the example data:

# Create time and unit ID variables
Year <- rep(1980:1983, 5)
ID <- sort(rep(seq(1:5), 4))

# Dummy covariates
A <- B <- 1:20

Data2 <- data.frame(Year, ID, A, B)

head(Data2)

##   Year ID A B
## 1 1980  1 1 1
## 2 1981  1 2 2
## 3 1982  1 3 3
## 4 1983  1 4 4
## 5 1980  2 5 5
## 6 1981  2 6 6

Now let's create a two time unit lead variable based on B for each unit identified by ID:

DataSlid2 <- slide(Data2, Var = "B", GroupVar = "ID",
                    slideBy = 2)

head(DataSlid2)

##   Year ID A B B2
## 1 1980  1 1 1  3
## 2 1981  1 2 2  4
## 3 1982  1 3 3 NA
## 4 1983  1 4 4 NA
## 5 1980  2 5 5  7
## 6 1981  2 6 6  8

Hopefully you'll find slide useful in your own data analysis. Any suggestions for improvement are always welcome.

Reinhart & Rogoff: Everyone makes coding mistakes, we need to make it easy to find them + Graphing uncertainty

2013-04-17T07:36:00.000+01:00

You may have already seen a lot written on the replication of Reinhart & Rogoff’s (R & R) much cited 2010 paper done by Herndon, Ash, and Pollin. If you haven’t, here is a round up of some of some of what has been written: Konczal, Yglesias, Krugman, Cowen, Peng, FT Alphaville.

This is an interesting issue for me because it involves three topics I really like: political economy, reproducibility, and communicating uncertainty. Others have already commented on these topics in detail. I just wanted to add to this discussion by (a) talking about how this event highlights a real need for researchers to use systems that make finding and correcting mistakes easy, (b) incentivising mistake finding/correction rather than penalising it, and (c) showing uncertainty.

Systems for Finding and Correcting Mistakes

One of the problems Herndon, Ash, and Pollin found in R&R’s analysis was and Excel coding error. I love to hate on Excel as much as the next R devotee, but I think that is missing the point. The real lesson is not “don’t use Excel” the real lesson is: we all make mistakes.

(Important point: I refer throughout this post to errors caused by coding mistakes rather than purposeful fabrications and falsifications.)

Coding mistakes are an ever present part of our work. The problem is not that we make coding mistakes. Despite our best efforts we always will. The problem is that we often use tools and practices that make it difficult to find and correct our mistakes.

This is where I can get in some Excel hating: tools and practices that make it difficult to find mistakes include binary files (like Excel’s) that can’t be version controlled in a way that fully reveals the research process, not commenting code, not making your data readily available in formats that make replication easy, not having a system for quickly fixing mistakes when they are found. Sorry R users, but the last three are definitely not exclusive to Excel.

It took Herndon, Ash, and Pollin a considerable amount of time to replicate R & R’s findings and therefore find the Excel error. This seems partially because R & R did not make their analysis files readily available (Herndon, Ash, and Pollin had to ask for them). I’m not sure how this error is going to be corrected and documented. But I imagine it will be like most research corrections: kind of on the fly, mostly emailing and reposting.

How big of a detail is this? There is some debate over how big of a problem this mistake is. Roger Peng ends his really nice post:

The vibe on the Internets seems to be that if only this problem had been identified sooner, the world would be a better place. But my cynical mind says, uh, no. You can toss this incident in the very large bucket of papers with some technical errors that are easily fixed. Thankfully, someone found these errors and fixed them, and that’s a good thing. Science moves on.

I agree with most of this paragraph. But, given how important R & R’s finding was to major policy debates it would have been much better if the mistake was caught sooner rather than later. The tools and practices R & R used made it harder to find and correct the mistake, so policymakers were operating with less accurate information for longer.

Solutions: I’ve written in some detail in the most recent issue of The Political Methodologist about how cloud-based version control systems like GitHub can be used to make finding and correcting mistakes easier. Pull requests, for example, are a really nice way to directly suggest corrections.

Incentivising Error Finding and Correction

Going forward I think it will be interesting to see how this incident shapes researchers’ perceived incentives to make their work easily replicable. Replication is an important part of finding the mistakes that everyone makes. If being found to make a coding mistake (not a fabrication) has a negative impact on your academic career then there are incentives to make finding mistakes difficult, by for example making replication difficult. Most papers do not receive nearly as much attention as R & R’s. So, for most researchers making replication difficult will make it pretty unlikely that anyone will replicate your research and you’ll be home free.

I can't send you my data b/c I think you might find out I made an error. #overlyhonestmethods
— Carlisle Rainey (@carlislerainey) January 9, 2013

This is a perverse incentive indeed.

What can we do? Many journals now require replicable code to accompany published articles. This is a good incentive. Maybe we should go further, and somehow directly incentivise the finding and correction of errors in data sets and analysis code. Ideas could include giving more weight to replication studies at hiring and promotion committees. Maybe even allowing these committees to include information on researchers’ GitHub pull requests that meaningfully improve other’s work by correcting mistakes.

This of course might create future perversion incentives to add errors so that they can then be found. I think this is a bit fanciful. There are surely enough negative social incentives (i.e. embarrassment) surrounding making mistakes to prevent this.

Showing Uncertainty

Roger Peng’s post highlighted the issue of graphing uncertainty, but I just wanted to build it out a little further. The interpretation of the correlation R & R’s found between GDP Growth and Government Debt could have been tempered significantly before any mistakes were found by more directly communicating their original uncertainty. In their original paper, they presented the relationship using bar graphs of average and median GDP growth per grouped debt/GDP level:

Beyond showing the mean and median there is basically no indication of the distribution of the data they are from.

Herndon, Ash, and Pollin put together some nice graphs of these distributions (and avoid that thing economists do of using two vertical axis with two different meanings).

Here is one that gets rid of the groups altogether:

If R & R had shown a simple scatter plot like this (though they did exclude some of the higher GDP Growth country-years at the high debt end, so their's would have looked different), it would have been much more difficult to overly interpret the substantive–policy–value of a correlation between GDP/growth and debt/GDP.

Maybe this wouldn’t have actually changed the policy debate that much, As Mark Blyth argues in his recent book on austerity “facts never disconfirm a good ideology” (p. 18). But at least Paul Krugman might not have had to debate debt/GDP cutoff points on CNBC (for example time point 12:40):

P.S. To R & R’s credit, they do often make their data available. Their data has been useful for at least one of my papers. However, it is often available in a format that is hard to use for cross-country statistical analysis, including, I would imagine, their own. Though I have never found any errors in the data, reporting and implementing corrections to this data would be piecemeal at best.

Dropbox & R Data

2013-04-11T23:05:00.000+01:00

I'm always looking for ways to download data from the internet into R. Though I prefer to host and access plain-text data sets (CSV is my personal favourite) from GitHub (see my short paper on the topic) sometimes it's convenient to get data stored on Dropbox.

There has been a change in the way Dropbox URLs work and I just added some functionality to the repmis R package. So I though that I'ld write a quick post on how to directly download data from Dropbox into R.

The download method is different depending on whether or not your plain-text data is in a Dropbox Public folder or not.

Dropbox Public Folder

Dropbox is trying to do away with its public folders. New users need to actively create a Public folder. Regardless, sometimes you may want to download data from one. It used to be that files in Public folders were accessible through non-secure (http) URLs. It's easy to download these into R, just use the read.table command, where the URL is the file name. Dropbox recently changed Public links to be secure (https) URLs. These cannot be accessed with read.table.

Instead you need can use the source_data command from repmis:

FinURL <-"https://dl.dropbox.com/u/12581470/code/Replicability_code/Fin_Trans_Replication_Journal/Data/public.fin.msm.model.csv"

# Download data
FinRegulatorData <- repmis::source_data(FinURL,
                             sep = ",",
                             header = TRUE)

Non-Public Dropbox Folders

Getting data from a non-Public folder into R was a trickier. When you click on a Dropbox-based file's Share Link button you are taken to a secure URL, but not for the file itself. The Dropbox webpage you're taken to is filled with lots of other Dropbox information. I used to think that accessing a plain-text data file embedded in one of these webpages would require some tricky web scrapping. Luckily, today I ran across this blog post by Kay Cichini.

With some modifications I was able to easily create a function that could download data from non-Public Dropbox folders. The source_DropboxData command is in the most recent version of repmis (v0.2.4) is the result. All you need to know is the name of the file you want to download and its Dropbox key. You can find both of these things in the URL for the webpage that appears when you click on Share Link. Here is an example:

https://www.dropbox.com/s/exh4iobbm2p5p1v/fin_research_note.csv

The file name is at the very end (fin_research_note.csv) and the key is the string of letters and numbers in the middle (exh4iobbm2p5p1v). Now we have all of the information we need for source_DropboxData:

FinDataFull <- repmis::source_DropboxData("fin_research_note.csv",
                                  "exh4iobbm2p5p1v",
                                  sep = ",",
                                  header = TRUE)