Skip to main content

Posts

End of 2015 Blog Roundup

Over the past few months I've mostly been blogging at a number of other venues. These include: A piece with Mark Hallerberg in Democracy Audit UK summarising our research on how, despite previous findings, democratic governments run similarly sizable bank bailout tabs as autocracies. This wasn't noticed in previous work, because democratic governments have incentives (possiblilty of losing elections) to shift the realisation of these costs into the future. A post over at Bruegel introducing the Financial Supervisory Transparency Index that Mark Copelovitch, Mark Hallerberg, and I created. We also discuss supervisory transparency's implications for a European capital markets union. At VoxUkraine, I discuss the causes and possible solutions to brawling in the Ukranian parliament based on my recent research in the Journal of Peace Research . I didn't write this one, but my co-author Tom Pepinsky, wrote a nice piece about a new working paper we have on the (dif...

More Corrections the the DPI’s yrcurnt Election Timing Variable: OECD Edition

Previously on The Political Methodologist, I posted updates to the Database of Political Institutions' election timing variable: yrcurnt . That set of corrections was only for the current 28 EU member states. I’ve now expanded the corrections to include most other OECD countries. 1 Again, there were many missing elections: Change list Country Changes Australia Corrects missing 1998 election year. Canada Corrects missing 2000, 2006, 2008, 2011 election years. Iceland Corrects missing 2009 election year. Ireland Corrects missing 2011 election. Japan Corrects missing 2005 and 2012 elections. Corrects misclassification of the 2003 and 2009 elections, which were originally erroneously labeled as being in 2004 and 2008, respectively.  Import into R To import the most recent corrected version of the data into R simply use: election_time <- rio::import('https://raw.githubusercontent.com/christophergandrud/yrcurnt_corrected/master/data/yrcurnt...

A Link Between topicmodels LDA and LDAvis

Carson Sievert and Kenny Shirley have put together the really nice LDAvis R package. It provides a Shiny-based interactive interface for exploring the output from Latent Dirichlet Allocation topic models. If you've never used it, I highly recommend checking out their XKCD example (this paper also has some nice background). LDAvis doesn't fit topic models, it just visualises the output. As such it is agnostic about what package you use to fit your LDA topic model. They have a useful example of how to use output from the lda package. I wanted to use LDAvis with output from the topicmodels package. It works really nicely with texts preprocessed using the tm package. The trick is extracting the information LDAvis requires from the model and placing it into a specifically structured JSON formatted object. To make the conversion from topicmodels output to LDAvis JSON input easier, I created a linking function called topicmodels_json_ldavis . The full function is below. To...

Simulated or Real: What type of data should we use when teaching social science statistics?

I just finished teaching a new course on collaborative data science to social science students. The materials are on GitHub if you're interested. What did we do and why? Maybe the most unusual thing about this class from a statistics pedagogy perspective was that it was entirely focused on real world data; data that the students gathered themselves. I gave them virtually no instruction on what data to gather. They gathered data they felt would help them answer their research questions. Students directly confronted the data warts that usually consume a large proportion of researchers' actual time. My intention was that the students systematically learn tools and best practices for how to address these warts. This is in contrast to many social scientists' statistics education. Typically, students are presented with pre-arranged data. They are then asked to perform some statistical function with it. The end. This leaves students underprepared for actually using statist...

Set up R/Stan on Amazon EC2

A few months ago I posted the script that I use to set up my R/JAGS working environment on an Amazon EC2 instance. Since then I've largely transitioned to using R/ Stan to estimate my models. So, I've updated my setup script (see below). There are a few other changes: I don't install/use RStudio on Amazon EC2. Instead, I just use R from the terminal. Don't get me wrong, I love RStudio. But since what I'm doing on EC2 is just running simulations (I handle the results on my local machine), RStudio is overkill. I don't install git anymore. Instead I use source_url (from devtools) and source_data (from repmis) to source scripts from GitHub. Again all of the manipulation I'm doing to these scripts is on my local machine.

Do Political Scientists Care About Effect Sizes: Replication and Type M Errors

Reproducibility has come a long way in political science. Many major journals now require replication materials be made available either on their websites or some service such as the Dataverse Network . Most of the top journals in political science have formally committed to reproducible research best practices by signing up to the The (DA-RT) Data Access and Research Transparency Joint Statement . This is certainly progress. But what are political scientists actually supposed to do with this new information? Data and code availability does help avoid effort duplication--researchers don't need to gather data or program statistical procedures that have already been gathered or programmed. It promotes better research habits . It definitely provides '' procedural oversight ''. We would be highly suspect of results from authors that were unable or unwilling to produce their code/data. However, there are lots of problems that data/code availability requirements do no...

Simple script from setting up R, Git, and Jags on Amazon EC2 Ubuntu Instance

Just wanted to put up the script I've been using to create an Amazon EC2 Ubuntu instance for running RStudio, Git, and Jags. There isn't anything really new in here, but it it has been serving me well. The script begins after the basic instance has been set up in the Amazon EC2 console ( yhat has a nice post on how to do this, though some of their screenshots are a little old). Just SSH into the instance and get started.

Updates to repmis: caching downloaded data and Excel data downloading

Over the past few months I’ve added a few improvements to the repmis –miscellaneous functions for reproducible research–R package. I just want to briefly highlight two of them: Caching downloaded data sets. source_XlsxData for downloading data in Excel formatted files. Both of these capabilities are in repmis version 0.2.9 and greater. Caching When working with data sourced directly from the internet, it can be time consuming (and make the data hoster angry) to repeatedly download the data. So, repmis ’s source functions ( source_data , source_DropboxData , and source_XlsxData ) can now cache a downloaded data set by setting the argument cache = TRUE . For example: DisData <- source_data("http://bit.ly/156oQ7a", cache = TRUE) When the function is run again, the data set at http://bit.ly/156oQ7a will be loaded locally, rather than downloaded. To delete the cached data set, simply run the function again with the argument clearCache = TRUE . source_XlsxDat...

d3Network Plays Nice with Shiny Web Apps

After some delay (and because of helpful prompting by Giles Heywood and code contributions by John Harrison ) d3Network now plays nicely with Shiny web apps . This means you can fully integrate R/D3.js network graphs into your web apps. Here is what one simple example looks like: An explanation of the code is here and you can download the app and play with it using: shiny::runGitHub('d3ShinyExample', 'christophergandrud')

European Parliament Candidates Have a Unique Opportunity to Advocate for Banking Union Transparency and Accountability

This is reposted from the original on the Hertie School of Governance European Elections blog . The discussion of issues around the European Parliament Elections has been beating around the bush for quite some time now. Karlheinz Reif and Hermann Schmitt famously described European Elections as ” second-order elections ”, in that they are secondary to national elections. A few weeks ago on this blog Andrea Römmele and Yann Lorenz argued that the current election cycle has been characterised by personality politics between candidates vying for the Commission presidency, rather than substantive issues. However, the election campaigns could be an important opportunity for the public to express their views on and even learn more about one of the defining changes to the European Union since the introduction of the Euro: the European Banking Union. Much of the framework for the Banking Union has been established in the past year after intense debate between the EU institut...

Numbering Subway Exits

In a bit of an aside from what I usually work on, I've put together a small website with a simple purpose: advocating for subway station exits to be numbered. These are really handy for finding your way around and are common in East Asia. But I've never seen them in Western countries. If you're interested check out the site:

Programmatically download political science data with the psData package

A lot of progress has been made on improving political scientists’ ability to access data ‘programmatically’, e.g. data can be downloaded with source code R. Packages such as WDI for World Bank Development Indicator and dvn for many data sets stored on the Dataverse Network make it much easier for political scientists to use this data as part of a highly integrated and reproducible workflow . There are nonetheless still many commonly used political science data sets that aren’t easily accessible to researchers. Recently, I’ve been using the Database of Political Institutions (DPI) , Polity IV democracy indicators, and Reinhart and Rogoff’s (2010) financial crisis occurrence data. All three of these data sets are freely available for download online. However, getting them, cleaning them up, and merging them together is kind of a pain. This is especially true for the Reinhart and Rogoff data, which is in 4 Excel files with over 70 individual shee...

How I Accidentally Wrote a Paper on Supervisory Transparency in the European Union and Why You Should Too

Research is an unpredictable thing. You head in one direction, but end up going another. Here is a recent example: A co-author and I had an idea for a paper. It's a long story, but basically we wanted to compare banks in the US to those in the EU. This was a situation where our desire to explore a theory was egged on by, what we believed, was available data. In the US it's easy to gather data on banks because the regulators have a nice website where they release the filings banks send them. The data is in a really good format for statistical analysis. US done. We thought our next move would be to quickly gather similar data for EU banks and we would be on our way. First we contacted the UK's Financial Conduct Authority . Surprisingly, they told us that not only did they not release this data, but it was actually illegal for them to do so. Pretty frustrating. Answers to one question stymied by a lack of data. Argh. I guess we'll just keep looking to see what kind of...

Three Quick and Simple Data Cleaning Helper Functions (December 2013)

As I go about cleaning and merging data sets with R I often end up creating and using simple functions over and over. When this happens, I stick them in the DataCombine package. This makes it easier for me to remember how to do an operation and others can possibly benefit from simplified and (hopefully) more intuitive code. I've talked about some of the commands in DataCombine in previous posts . In this post I'll give examples for a few more that I've added over the past couple of months. Note: these examples are based on DataCombine version 0.1.11. Here is a brief run down of the functions covered in this post: FindReplace : a function to replace multiple patterns found in a character string column of a data frame. MoveFront : moves variables to the front of a data frame. This can be useful if you have a data frame with many variables and want to move a variable or variables to the front. rmExcept : removes all objects from a work space except those specified...

Showing results from Cox Proportional Hazard Models in R with simPH

Update 2 February 2014: A new version of simPH (Version 1.0) will soon be available for download from CRAN. It allows you to plot using points, ribbons, and (new) lines. See the updated package description paper for examples. Note that the ribbons argument will no longer work as in the examples below. Please use type = 'ribbons' (or 'points' or 'lines' ). Effectively showing estimates and uncertainty from Cox Proportional Hazard (PH) models , especially for interactive and non-linear effects, can be challenging with currently available software. So, researchers often just simply display a results table. These are pretty useless for Cox PH models. It is difficult to decipher a simple linear variable’s estimated effect and basically impossible to understand time interactions, interactions between variables, and nonlinear effects without the reader further calculating quantities of interest for a variety of fitted values. So, I’ve been putting together th...

GitHub renders CSV in the browser, becomes even better for social data set creation

I've written in a number of places about how GitHub can be a great place to store data. Unlike basically all other web data storage sites (many of which I really like such as Dataverse and FigShare ) GitHub enables deep social data set development and fits nicely into a reproducible research workflow with R. One negative though, especially compared to FigShare, was that there was no easy way to view CSV or TSV data files in the browser. Unless you downloaded the data and opened it in Excel or an R viewer or whatever, you had to look at the raw data file in the browser. It's basically impossible to make sense of a data set of any size like this. However, from at least today, GitHub now renders the data set in the browser as you would expect. Take a look at their blog post for the details.

Getting Started with Reproducible Research: A chapter from my new book

This is an abridged excerpt from Chapter 2 of my new book Reproducible Research with R and RStudio . It's published by Chapman & Hall/CRC Press . You can purchase it on Amazon . "Search inside this book" includes a complete table of contents. Researchers often start thinking about making their work reproducible near the end of the research process when they write up their results or maybe even later when a journal requires their data and code be made available for publication. Or maybe even later when another researcher asks if they can use the data from a published article to reproduce the findings. By then there may be numerous versions of the data set and records of the analyses stored across multiple folders on the researcher’s computers. It can be difficult and time consuming to sift through these files to create an accurate account of how the results were reached. Waiting until near the end of the research process to start thinking about reproducibility ca...