Skip to main content

Graphing Predicted Legislative Violence with Zelig & ggplot2

Update 18 January 2013: This example works for Zelig versions < 4. For versions from 4 you will likely have to use Zelig's `simulation.matrix` command to extract the matrix of expected values.
In my previous post I briefly mentioned an early draft of a working paper (HERE) I've written that looks into the possible causes of violence between legislators (like the violence shown in this picture from the Turkish Parliament).
 From The Guardian

In this post I'm going to briefly discuss how I used Zelig's rare events logistic regression (relogit) and ggplot2  in R to simulate and plot the legislative violence probabilities that are in the paper. In this example I am plotting simulated probabilities at fitted values on three variables:

• Age of democracy (Polity IV > 5)
• A dichotomous electoral proportionality variable where 1 is high proportionality, 0 otherwise (see here for more details)
• Governing parties' majority (as a % of total legislative seats. Data is from DPI.)

Background

I used King and Zeng's rare events logistic regression which they include in their R package Zelig to study incidences of legislative violence because (a) I was interested in a dichotomous outcome--whether or not a legislature had an incident of violence in a given year and (b) fortunately legislative violence is fairly rare. There are only 88 incidences in my data set spanning 1981 to Spring 2011 and even fewer (72) when I constricted the sample to 1981-2009, because there is limited data on many of my dependent variables after 2009.

Why GGPLOT2?

If you are familiar with the Zelig package, you'll know that it already includes a capability to both simulate quantities of interest (for me it's probabilities of violence given various values of the covariates) and plot the results from these simulations with uncertainty estimates.

To do this, first run the basic Zelig model then use setx() to set the range of covariate fitted values you are interested predicting probabilities for (all others are set to their means by default). Then use sim() to simulate the quantities of interest. Finally, just use plot() on the Zelig object that  sim() creates. (See the full code at the end of the post.)

However these plots are . . . not incredibly visually appealing.  Here is an example with various ages of democracy:

Plus, if you are not using base R plots in the rest of your paper, these types of plots will clash.

I used ggplot2 graphs in the rest of the paper so I wanted a way to plot simulated probabilities with ggplot2.  Basically I wanted this:

Using GGPLOT2 and Zelig Simulation Output.

Once you have the Zelig object returned from sim() it is simply a matter of extracting the simulation results. The default is to run 1,000 simulations for each fitted value. Zelig stores these in qi\$ev in the Zelig object. In this example the fitted values of democratic age (fitted at years 0 through 85) are in a Zelig simulation object that I called Model.DemSim. To extract the simulations of the predicted probabilities use:
``````# Extract expected values from simulations

Model.demAge.e <- (Model.DemSim\$qi)``````
Now turn the object Model.demAge.e into a data frame and use melt() from Reshape2 to reshape the data so that you can use it in ggplot2.
``````# Create data.frame
Model.demAge.e <- data.frame(Model.demAge.e\$ev)``````
``````
# Melt data
Model.demAge.e <- melt(Model.demAge.e, measure = 1:86)``````
Since the numbers in Variable actually mean something (years of democracy) the final cleanup stage is to remove the “X” prefixes attached to Variable.
``````# Remove 'X' from variable
Model.demAge.e\$variable <- as.numeric(gsub("X", "", Model.demAge.e\$variable))``````
Now we can use Model.demAge.e as the source of data for geom_point() and stat_smooth() in ggplot! You might want to drop simulation results outside the 2.5 and 97.5 percentiles to keep only the middle 95%. The red bars in the Zelig base plots represent the middle 95%. Right now I prefer keeping all of the simulation results and simply changing the alpha (transparency) of the points. This allows us to see all of the results, both outliers and those within widely accepted, but still somewhat arbitrary 95% bounds.

Here is the full code for completely reproducing the last plot above (which is also in the working paper). The last thing to mention is that subsetting the data with complete.cases() to keep only observations with full data on all variables is a crucial step to make before running zelig().

Comments

Danilo said…
Nice post, Christopher! I also use Zelig very often, it's a nice package. I would like to ask you a question: do you know any other R package which performs simulations of quantities of interest the way Zelig does? Since there are some models currently not supported by Zelig, it would be good to have a more general solution. Do you have any idea?

Thanks a lot!
Thanks Danilo.

As a matter of fact, since writing that post I've created two R packages for similar simulation techniques with Cox proportional hazards models (Zelig does these, but not for many important quantities of interest) and (with Laron K Williams, Guy D Whitten) dynamic simulations of autoregressive relationships.

In a more generic sense, drawing coefficient estimates for most models from the normal multivariate distribution is pretty easy in R (see for example the rmultnorm command in the MSBVAR package). Then you just need to write the code to calculate the quantities of interest using the appropriate formula + fitted values.

Slide: one function for lag/lead variables in data frames, including time-series cross-sectional data

I often want to quickly create a lag or lead variable in an R data frame. Sometimes I also want to create the lag or lead variable for different groups in a data frame, for example, if I want to lag GDP for each country in a data frame.I've found the various R methods for doing this hard to remember and usually need to look at old blogposts. Any time we find ourselves using the same series of codes over and over, it's probably time to put them into a function. So, I added a new command–slide–to the DataCombine R package (v0.1.5).Building on the shift function TszKin Julian posted on his blog, slide allows you to slide a variable up by any time unit to create a lead or down to create a lag. It returns the lag/lead variable to a new column in your data frame. It works with both data that has one observed unit and with time-series cross-sectional data.Note: your data needs to be in ascending time order with equally spaced time increments. For example 1995, 1996, 1997. ExamplesNot…

A Link Between topicmodels LDA and LDAvis

Carson Sievert and Kenny Shirley have put together the really nice LDAvis R package. It provides a Shiny-based interactive interface for exploring the output from Latent Dirichlet Allocation topic models. If you've never used it, I highly recommend checking out their XKCD example (this paper also has some nice background).LDAvis doesn't fit topic models, it just visualises the output. As such it is agnostic about what package you use to fit your LDA topic model. They have a useful example of how to use output from the lda package.I wanted to use LDAvis with output from the topicmodels package. It works really nicely with texts preprocessed using the tm package. The trick is extracting the information LDAvis requires from the model and placing it into a specifically structured JSON formatted object.To make the conversion from topicmodels output to LDAvis JSON input easier, I created a linking function called topicmodels_json_ldavis. The full function is below. To use it follow …

Dropbox & R Data

I'm always looking for ways to download data from the internet into R. Though I prefer to host and access plain-text data sets (CSV is my personal favourite) from GitHub (see my short paper on the topic) sometimes it's convenient to get data stored on Dropbox. There has been a change in the way Dropbox URLs work and I just added some functionality to the repmis R package. So I though that I'ld write a quick post on how to directly download data from Dropbox into R. The download method is different depending on whether or not your plain-text data is in a Dropbox Public folder or not.Dropbox Public FolderDropbox is trying to do away with its public folders. New users need to actively create a Public folder. Regardless, sometimes you may want to download data from one. It used to be that files in Public folders were accessible through non-secure (http) URLs. It's easy to download these into R, just use the read.table command, where the URL is the file name. Dropbox recent…