Skip to main content

Graphing Predicted Legislative Violence with Zelig & ggplot2

Update 18 January 2013: This example works for Zelig versions < 4. For versions from 4 you will likely have to use Zelig's simulation.matrix command to extract the matrix of expected values.
In my previous post I briefly mentioned an early draft of a working paper (HERE) I've written that looks into the possible causes of violence between legislators (like the violence shown in this picture from the Turkish Parliament).
From The Guardian

In this post I'm going to briefly discuss how I used Zelig's rare events logistic regression (relogit) and ggplot2  in R to simulate and plot the legislative violence probabilities that are in the paper. In this example I am plotting simulated probabilities at fitted values on three variables:

  • Age of democracy (Polity IV > 5)
  • A dichotomous electoral proportionality variable where 1 is high proportionality, 0 otherwise (see here for more details)
  • Governing parties' majority (as a % of total legislative seats. Data is from DPI.)


I used King and Zeng's rare events logistic regression which they include in their R package Zelig to study incidences of legislative violence because (a) I was interested in a dichotomous outcome--whether or not a legislature had an incident of violence in a given year and (b) fortunately legislative violence is fairly rare. There are only 88 incidences in my data set spanning 1981 to Spring 2011 and even fewer (72) when I constricted the sample to 1981-2009, because there is limited data on many of my dependent variables after 2009.


If you are familiar with the Zelig package, you'll know that it already includes a capability to both simulate quantities of interest (for me it's probabilities of violence given various values of the covariates) and plot the results from these simulations with uncertainty estimates.

To do this, first run the basic Zelig model then use setx() to set the range of covariate fitted values you are interested predicting probabilities for (all others are set to their means by default). Then use sim() to simulate the quantities of interest. Finally, just use plot() on the Zelig object that  sim() creates. (See the full code at the end of the post.)

However these plots are . . . not incredibly visually appealing.  Here is an example with various ages of democracy:

Plus, if you are not using base R plots in the rest of your paper, these types of plots will clash.

I used ggplot2 graphs in the rest of the paper so I wanted a way to plot simulated probabilities with ggplot2.  Basically I wanted this:

Using GGPLOT2 and Zelig Simulation Output.

Once you have the Zelig object returned from sim() it is simply a matter of extracting the simulation results. The default is to run 1,000 simulations for each fitted value. Zelig stores these in qi$ev in the Zelig object. In this example the fitted values of democratic age (fitted at years 0 through 85) are in a Zelig simulation object that I called Model.DemSim. To extract the simulations of the predicted probabilities use:
# Extract expected values from simulations

Model.demAge.e <- (Model.DemSim$qi)
Now turn the object Model.demAge.e into a data frame and use melt() from Reshape2 to reshape the data so that you can use it in ggplot2.
# Create data.frame
Model.demAge.e <- data.frame(Model.demAge.e$ev)

# Melt data
Model.demAge.e <- melt(Model.demAge.e, measure = 1:86)
Since the numbers in Variable actually mean something (years of democracy) the final cleanup stage is to remove the “X” prefixes attached to Variable.
# Remove 'X' from variable
Model.demAge.e$variable <- as.numeric(gsub("X", "", Model.demAge.e$variable))
Now we can use Model.demAge.e as the source of data for geom_point() and stat_smooth() in ggplot! You might want to drop simulation results outside the 2.5 and 97.5 percentiles to keep only the middle 95%. The red bars in the Zelig base plots represent the middle 95%. Right now I prefer keeping all of the simulation results and simply changing the alpha (transparency) of the points. This allows us to see all of the results, both outliers and those within widely accepted, but still somewhat arbitrary 95% bounds.

Here is the full code for completely reproducing the last plot above (which is also in the working paper). The last thing to mention is that subsetting the data with complete.cases() to keep only observations with full data on all variables is a crucial step to make before running zelig().


Danilo said…
Nice post, Christopher! I also use Zelig very often, it's a nice package. I would like to ask you a question: do you know any other R package which performs simulations of quantities of interest the way Zelig does? Since there are some models currently not supported by Zelig, it would be good to have a more general solution. Do you have any idea?

Thanks a lot!
Thanks Danilo.

As a matter of fact, since writing that post I've created two R packages for similar simulation techniques with Cox proportional hazards models (Zelig does these, but not for many important quantities of interest) and (with Laron K Williams, Guy D Whitten) dynamic simulations of autoregressive relationships.

In a more generic sense, drawing coefficient estimates for most models from the normal multivariate distribution is pretty easy in R (see for example the rmultnorm command in the MSBVAR package). Then you just need to write the code to calculate the quantities of interest using the appropriate formula + fitted values.

Popular posts from this blog

Do Political Scientists Care About Effect Sizes: Replication and Type M Errors

Reproducibility has come a long way in political science. Many major journals now require replication materials be made available either on their websites or some service such as the Dataverse Network. Most of the top journals in political science have formally committed to reproducible research best practices by signing up to the The (DA-RT) Data Access and Research Transparency Joint Statement.This is certainly progress. But what are political scientists actually supposed to do with this new information? Data and code availability does help avoid effort duplication--researchers don't need to gather data or program statistical procedures that have already been gathered or programmed. It promotes better research habits. It definitely provides ''procedural oversight''. We would be highly suspect of results from authors that were unable or unwilling to produce their code/data.However, there are lots of problems that data/code availability requirements do not address.…

Slide: one function for lag/lead variables in data frames, including time-series cross-sectional data

I often want to quickly create a lag or lead variable in an R data frame. Sometimes I also want to create the lag or lead variable for different groups in a data frame, for example, if I want to lag GDP for each country in a data frame.I've found the various R methods for doing this hard to remember and usually need to look at old blogposts. Any time we find ourselves using the same series of codes over and over, it's probably time to put them into a function. So, I added a new command–slide–to the DataCombine R package (v0.1.5).Building on the shift function TszKin Julian posted on his blog, slide allows you to slide a variable up by any time unit to create a lead or down to create a lag. It returns the lag/lead variable to a new column in your data frame. It works with both data that has one observed unit and with time-series cross-sectional data.Note: your data needs to be in ascending time order with equally spaced time increments. For example 1995, 1996, 1997. ExamplesNot…

Showing results from Cox Proportional Hazard Models in R with simPH

Update 2 February 2014: A new version of simPH (Version 1.0) will soon be available for download from CRAN. It allows you to plot using points, ribbons, and (new) lines. See the updated package description paper for examples. Note that the ribbons argument will no longer work as in the examples below. Please use type = 'ribbons' (or 'points' or 'lines'). Effectively showing estimates and uncertainty from Cox Proportional Hazard (PH) models, especially for interactive and non-linear effects, can be challenging with currently available software. So, researchers often just simply display a results table. These are pretty useless for Cox PH models. It is difficult to decipher a simple linear variable’s estimated effect and basically impossible to understand time interactions, interactions between variables, and nonlinear effects without the reader further calculating quantities of interest for a variety of fitted values.So, I’ve been putting together the simPH R p…