30 December 2015

End of 2015 Blog Roundup

Over the past few months I've mostly been blogging at a number of other venues. These include:

  • A piece with Mark Hallerberg in Democracy Audit UK summarising our research on how, despite previous findings, democratic governments run similarly sizable bank bailout tabs as autocracies. This wasn't noticed in previous work, because democratic governments have incentives (possiblilty of losing elections) to shift the realisation of these costs into the future.

  • A post over at Bruegel introducing the Financial Supervisory Transparency Index that Mark Copelovitch, Mark Hallerberg, and I created. We also discuss supervisory transparency's implications for a European capital markets union.

  • At VoxUkraine, I discuss the causes and possible solutions to brawling in the Ukranian parliament based on my recent research in the Journal of Peace Research.

  • I didn't write this one, but my co-author Tom Pepinsky, wrote a nice piece about a new working paper we have on the (difficulty) of predicting financial crises.

20 June 2015

More Corrections the the DPI’s yrcurnt Election Timing Variable: OECD Edition

Previously on The Political Methodologist, I posted updates to the Database of Political Institutions' election timing variable: yrcurnt. That set of corrections was only for the current 28 EU member states.

I’ve now expanded the corrections to include most other OECD countries.1 Again, there were many missing elections:

Change list

Country Changes
Australia Corrects missing 1998 election year.
Canada Corrects missing 2000, 2006, 2008, 2011 election years.
Iceland Corrects missing 2009 election year.
Ireland Corrects missing 2011 election.
Japan Corrects missing 2005 and 2012 elections. Corrects misclassification of the 2003 and 2009 elections, which were originally erroneously labeled as being in 2004 and 2008, respectively.

 Import into R

To import the most recent corrected version of the data into R simply use:

election_time <- rio::import('https://raw.githubusercontent.com/christophergandrud/yrcurnt_corrected/master/data/yrcurnt_original_corrected.csv')

  1. Australia, Canada, Iceland, Israel, Japan, South Korea, New Zealand, Norway, Switzerland, USA

8 May 2015

A Link Between topicmodels LDA and LDAvis

Carson Sievert and Kenny Shirley have put together the really nice LDAvis R package. It provides a Shiny-based interactive interface for exploring the output from Latent Dirichlet Allocation topic models. If you've never used it, I highly recommend checking out their XKCD example (this paper also has some nice background).

LDAvis doesn't fit topic models, it just visualises the output. As such it is agnostic about what package you use to fit your LDA topic model. They have a useful example of how to use output from the lda package.

I wanted to use LDAvis with output from the topicmodels package. It works really nicely with texts preprocessed using the tm package. The trick is extracting the information LDAvis requires from the model and placing it into a specifically structured JSON formatted object.

To make the conversion from topicmodels output to LDAvis JSON input easier, I created a linking function called topicmodels_json_ldavis. The full function is below. To use it follow these steps:

  1. Create a VCorpus object using the tm package's Corpus function.

  2. Convert this to a document term matrix using DocumentTermMatrix, also from tm.

  3. Run your model using topicmodel's LDA function.

  4. Convert the output into JSON format using topicmodels_json_ldavis. The function requires the output from steps 1-3.

  5. Visualise with LDAvis' serVis.

16 December 2014

Simulated or Real: What type of data should we use when teaching social science statistics?

I just finished teaching a new course on collaborative data science to social science students. The materials are on GitHub if you're interested.

What did we do and why?

Maybe the most unusual thing about this class from a statistics pedagogy perspective was that it was entirely focused on real world data; data that the students gathered themselves. I gave them virtually no instruction on what data to gather. They gathered data they felt would help them answer their research questions.

Students directly confronted the data warts that usually consume a large proportion of researchers' actual time. My intention was that the students systematically learn tools and best practices for how to address these warts.

This is in contrast to many social scientists' statistics education. Typically, students are presented with pre-arranged data. They are then asked to perform some statistical function with it. The end.

This leaves students underprepared for actually using statistics in an undirected project (their thesis, in a job). Typically when confronted with data gathering and transformation issues in the real world most muddle through, piecing together ad hoc techniques as they go along in an decidedly non-efficient manner and often with poor results. A fair number of students will become frustrated and may never actually succeed in using any of the statistical tools they did learn.

What kind of data?

How does this course fit into a broader social science statistical education?

Zachary Jones had a really nice post the other day advocating that statistics courses use Monte Carlo simulation rather than real world data. The broad argument being that the messiness of real world data distracts students from carefully learning the statistical properties that instructors intend them to learn.

Superficially, it would seem that the course I just finished and Zachary's prescription are opposed. We could think of stats courses as using one of two different types of data:

simulated --- real world

Simulated vs. Real?

As you'll see I almost entirely agree with Zachary's post, but I think there is a more important difference between the social science statistic course status quo and less commonly taught courses such as mine and (what I think) Zachary is proposing. The difference is where the data comes from: is it gathered/generated by students or is it prepackaged by an instructor?

Many status quo courses use data that is prepackaged by instructors. Both simulated and real world data can be prepackaged. I suppose there are many motivations for this, but an important one surely is that it is easier to teach. As an instructor, you know what the results will be and you know the series of clicks or code that will generate this answer. There are no surprises. Students may also find prepackaged data comforting as they know that there is a correct answer out there. They just need to decode the series of clicks to get it.

Though prepackaged data is easier for instructors and students, it surely is counterproductive in terms of learning how to actually answer research questions with data analysis.

Students will not learn necessary skills needed to gather and transform real world data so that it can be analysed. Students who simply load a prepackaged data set of simulated values will often not understand where it came. They can succumb to the temptation to just click through until they get the right answer.

On the other hand I've definitely had the experience teaching with student simulated data that Zachary describes:

I think many students find [hypothesis testing] unintuitive and end up leaving with a foggy understanding of what tests do. With simulation I don't think it is so hard to explain since you can easily show confidence interval coverage, error rates, power, etc.

The actually important distinction in social science statistics education for thinking about what is more or less effective is:

student gathered/generated --- instructor gathered/generated

Prepackaged vs. student generated data

There is of course a pedagogical difference between data that students gathered from the real world and data they simulated with a computer. Simulated data is useful for teaching the behaviour of statistical methods. Real world data is useful for teaching students how to plan and execute a project using these methods to answer research questions in a way that is reproducible and introduces fewer data munging biases into estimates. Though almost certainly too much to take on together in one course, both should be central to a well-rounded social science statistics education.

10 December 2014

Set up R/Stan on Amazon EC2

A few months ago I posted the script that I use to set up my R/JAGS working environment on an Amazon EC2 instance.

Since then I've largely transitioned to using R/Stan to estimate my models. So, I've updated my setup script (see below).

There are a few other changes:

  • I don't install/use RStudio on Amazon EC2. Instead, I just use R from the terminal. Don't get me wrong, I love RStudio. But since what I'm doing on EC2 is just running simulations (I handle the results on my local machine), RStudio is overkill.

  • I don't install git anymore. Instead I use source_url (from devtools) and source_data (from repmis) to source scripts from GitHub. Again all of the manipulation I'm doing to these scripts is on my local machine.

4 December 2014

Our developing Financial Regulatory Transparency Index

Here is a presentation I just gave on a work in-progress. We are developing a new Financial Regulatory Transparency Index using a Bayesian IRT approach.

13 October 2014

Do Political Scientists Care About Effect Sizes: Replication and Type M Errors

Reproducibility has come a long way in political science. Many major journals now require replication materials be made available either on their websites or some service such as the Dataverse Network. Most of the top journals in political science have formally committed to reproducible research best practices by signing up to the The (DA-RT) Data Access and Research Transparency Joint Statement.

This is certainly progress. But what are political scientists actually supposed to do with this new information? Data and code availability does help avoid effort duplication--researchers don't need to gather data or program statistical procedures that have already been gathered or programmed. It promotes better research habits. It definitely provides ''procedural oversight''. We would be highly suspect of results from authors that were unable or unwilling to produce their code/data.

However, there are lots of problems that data/code availability requirements do not address. Apart from a few journals like Political Science Research and Methods, most journals have no standing policy to check the replication materials' veracity. Reviewers rarely have access to manuscripts' code/data. Even if they did have access to it, few reviewers would be willing or able to undertake the time consuming task of reviewing this material.

Do political science journals care about coding and data errors?

What do we do if someone replicating published research finds clear data or coding errors that have biased the published estimates?

Note that I'm limiting the discussion here to honest mistakes, not active attempts to deceive. We all make these mistakes. To keep it simple, I'm also only talking about clear, knowable, and non-causal coding and data errors.

Probably the most responsible action a journal could take when clear cut coding/data biased results have been found would be to directly adjoin to the original article a note detailing the bias. This way readers will always be aware of the correction and will have the best information possible. This is a more efficient way of getting out corrected information than relying on some probabilistic process where readers may or may not stumble upon the information posted elsewhere.

As far as I know, however, no political science journal has a written procedure (please correct me if I'm wrong) for dealing with this new information. My sense is that there are a series of ad hoc responses that closely correspond to how the bias affects the results:

Statistical significance

The situation where a journal is most likely to do anything is when correcting the bias makes the results no longer statistically significant. This might get a journal to append a note to the original article. But maybe not, they could just ignore it.


It might be that once the coding/data bias is corrected, the sign of an estimated effect flips--the result of what Andrew Gelman calls Type S errors. I really have no idea what a journal would do in this situation. They might append a note or maybe not.


Perhaps the most likely outcome of correcting honest coding/data bias is that the effect size changes. These errors would be the result of Gelman's Type M errors. My sense (and experience) is that in a context where novelty is greatly privileged over facts journal editors will almost certainly ignore this new information. It will be buried.

Do political scientists care about effect size?

Due to the complexity of what political scientists study, we rarely (perhaps with the exception of specific topics like election forecasting) think that we are very close to estimating a given effect's real magnitude. Most researchers are aiming for statistical significance and a sign that matches their theory.

Does this mean that we don't care about trying to estimate magnitudes as closely as possible?

Looking at political science practice pre-publication, there is a lot of evidence that we do care about Type M errors. Considerable effort is given to finding new estimation methods that produce less biased results. Questions of omitted variable bias are very common at research seminars and in journal reviews. Most researchers do carefully build their data sets and code to minimise coding/data bias. Many of these efforts are focused on the headline stuff--whether or not a given effect is significant and what the direction of the effect is. But, these efforts are also part of a desire to make the most accurate estimate of an effect as possible.

However, the review process and journals' responses to finding Type M errors caused by honest coding/data errors in published findings suggest that perhaps we don't care about effect size. Reviewers almost never look at code and data. Journals (as far as I know, please correct me if I'm wrong) never append information on replications that find Type M errors to original papers.


I have a simple prescription for demonstrating that we actually care about estimating accurate effect sizes:

Develop a standard practice of including a short authored write up of the data/code bias with corrected results in the original article's supplementary materials. Append a notice to the article pointing to this.

Doing this would not only give readers more accurate effect size estimates, but also make replication materials more useful.

Standardising the practice of publishing authored notes will incentivise people to use replication materials, find errors, and publicly correct them.