26 April 2012

Graphing Predicted Legislative Violence with Zelig & ggplot2

Update 18 January 2013: This example works for Zelig versions < 4. For versions from 4 you will likely have to use Zelig's simulation.matrix command to extract the matrix of expected values.
In my previous post I briefly mentioned an early draft of a working paper (HERE) I've written that looks into the possible causes of violence between legislators (like the violence shown in this picture from the Turkish Parliament).
 
From The Guardian


In this post I'm going to briefly discuss how I used Zelig's rare events logistic regression (relogit) and ggplot2  in R to simulate and plot the legislative violence probabilities that are in the paper. In this example I am plotting simulated probabilities at fitted values on three variables:

  • Age of democracy (Polity IV > 5)
  • A dichotomous electoral proportionality variable where 1 is high proportionality, 0 otherwise (see here for more details)
  • Governing parties' majority (as a % of total legislative seats. Data is from DPI.)


Background

I used King and Zeng's rare events logistic regression which they include in their R package Zelig to study incidences of legislative violence because (a) I was interested in a dichotomous outcome--whether or not a legislature had an incident of violence in a given year and (b) fortunately legislative violence is fairly rare. There are only 88 incidences in my data set spanning 1981 to Spring 2011 and even fewer (72) when I constricted the sample to 1981-2009, because there is limited data on many of my dependent variables after 2009.

Why GGPLOT2?

If you are familiar with the Zelig package, you'll know that it already includes a capability to both simulate quantities of interest (for me it's probabilities of violence given various values of the covariates) and plot the results from these simulations with uncertainty estimates.

To do this, first run the basic Zelig model then use setx() to set the range of covariate fitted values you are interested predicting probabilities for (all others are set to their means by default). Then use sim() to simulate the quantities of interest. Finally, just use plot() on the Zelig object that  sim() creates. (See the full code at the end of the post.)

However these plots are . . . not incredibly visually appealing.  Here is an example with various ages of democracy:



Plus, if you are not using base R plots in the rest of your paper, these types of plots will clash.

I used ggplot2 graphs in the rest of the paper so I wanted a way to plot simulated probabilities with ggplot2.  Basically I wanted this:



Using GGPLOT2 and Zelig Simulation Output.

Once you have the Zelig object returned from sim() it is simply a matter of extracting the simulation results. The default is to run 1,000 simulations for each fitted value. Zelig stores these in qi$ev in the Zelig object. In this example the fitted values of democratic age (fitted at years 0 through 85) are in a Zelig simulation object that I called Model.DemSim. To extract the simulations of the predicted probabilities use:
# Extract expected values from simulations

Model.demAge.e <- (Model.DemSim$qi)
Now turn the object Model.demAge.e into a data frame and use melt() from Reshape2 to reshape the data so that you can use it in ggplot2.
# Create data.frame
Model.demAge.e <- data.frame(Model.demAge.e$ev)

# Melt data
Model.demAge.e <- melt(Model.demAge.e, measure = 1:86)
Since the numbers in Variable actually mean something (years of democracy) the final cleanup stage is to remove the “X” prefixes attached to Variable.
# Remove 'X' from variable
Model.demAge.e$variable <- as.numeric(gsub("X", "", Model.demAge.e$variable))
Now we can use Model.demAge.e as the source of data for geom_point() and stat_smooth() in ggplot! You might want to drop simulation results outside the 2.5 and 97.5 percentiles to keep only the middle 95%. The red bars in the Zelig base plots represent the middle 95%. Right now I prefer keeping all of the simulation results and simply changing the alpha (transparency) of the points. This allows us to see all of the results, both outliers and those within widely accepted, but still somewhat arbitrary 95% bounds.

Here is the full code for completely reproducing the last plot above (which is also in the working paper). The last thing to mention is that subsetting the data with complete.cases() to keep only observations with full data on all variables is a crucial step to make before running zelig().

2 comments:

Danilo said...

Nice post, Christopher! I also use Zelig very often, it's a nice package. I would like to ask you a question: do you know any other R package which performs simulations of quantities of interest the way Zelig does? Since there are some models currently not supported by Zelig, it would be good to have a more general solution. Do you have any idea?

Thanks a lot!

Christopher Gandrud said...

Thanks Danilo.

As a matter of fact, since writing that post I've created two R packages for similar simulation techniques with Cox proportional hazards models (Zelig does these, but not for many important quantities of interest) and (with Laron K Williams, Guy D Whitten) dynamic simulations of autoregressive relationships.

In a more generic sense, drawing coefficient estimates for most models from the normal multivariate distribution is pretty easy in R (see for example the rmultnorm command in the MSBVAR package). Then you just need to write the code to calculate the quantities of interest using the appropriate formula + fitted values.