Christopher Gandrud (간드루드 크리스토파)

Posts

Showing results from Cox Proportional Hazard Models in R with simPH

Update 2 February 2014: A new version of simPH (Version 1.0) will soon be available for download from CRAN. It allows you to plot using points, ribbons, and (new) lines. See the updated package description paper for examples. Note that the ribbons argument will no longer work as in the examples below. Please use type = 'ribbons' (or 'points' or 'lines' ). Effectively showing estimates and uncertainty from Cox Proportional Hazard (PH) models , especially for interactive and non-linear effects, can be challenging with currently available software. So, researchers often just simply display a results table. These are pretty useless for Cox PH models. It is difficult to decipher a simple linear variable’s estimated effect and basically impossible to understand time interactions, interactions between variables, and nonlinear effects without the reader further calculating quantities of interest for a variety of fitted values. So, I’ve been putting together th...

GitHub renders CSV in the browser, becomes even better for social data set creation

I've written in a number of places about how GitHub can be a great place to store data. Unlike basically all other web data storage sites (many of which I really like such as Dataverse and FigShare ) GitHub enables deep social data set development and fits nicely into a reproducible research workflow with R. One negative though, especially compared to FigShare, was that there was no easy way to view CSV or TSV data files in the browser. Unless you downloaded the data and opened it in Excel or an R viewer or whatever, you had to look at the raw data file in the browser. It's basically impossible to make sense of a data set of any size like this. However, from at least today, GitHub now renders the data set in the browser as you would expect. Take a look at their blog post for the details.

Getting Started with Reproducible Research: A chapter from my new book

This is an abridged excerpt from Chapter 2 of my new book Reproducible Research with R and RStudio . It's published by Chapman & Hall/CRC Press . You can purchase it on Amazon . "Search inside this book" includes a complete table of contents. Researchers often start thinking about making their work reproducible near the end of the research process when they write up their results or maybe even later when a journal requires their data and code be made available for publication. Or maybe even later when another researcher asks if they can use the data from a published article to reproduce the findings. By then there may be numerous versions of the data set and records of the analyses stored across multiple folders on the researcher’s computers. It can be difficult and time consuming to sift through these files to create an accurate account of how the results were reached. Waiting until near the end of the research process to start thinking about reproducibility ca...

Quick and Simple D3 Network Graphs from R

Sometimes I just want to quickly make a simple D3 JavaScript directed network graph with data in R. Because D3 network graphs can be manipulated in the browser–i.e. nodes can be moved around and highlighted–they're really nice for data exploration. They're also really nice in HTML presentations . So I put together a bare-bones simple function–called d3SimpleNetwork for turning an R data frame into a D3 network graph. Arguments By bare-bones I mean other than the arguments indicating the Data data frame, as well as the Source and Target variables it only has three arguments: height , width , and file . The data frame you use should have two columns that contain the source and target variables. Here's an example using fake data: Source <- c("A", "A", "A", "A", "B", "B", "C", "C", "D") Target <- c("B", "C", "D", "J...

Slide: one function for lag/lead variables in data frames, including time-series cross-sectional data

I often want to quickly create a lag or lead variable in an R data frame. Sometimes I also want to create the lag or lead variable for different groups in a data frame, for example, if I want to lag GDP for each country in a data frame. I've found the various R methods for doing this hard to remember and usually need to look at old blog posts . Any time we find ourselves using the same series of codes over and over, it's probably time to put them into a function. So, I added a new command– slide –to the DataCombine R package (v0.1.5). Building on the shift function TszKin Julian posted on his blog , slide allows you to slide a variable up by any time unit to create a lead or down to create a lag. It returns the lag/lead variable to a new column in your data frame. It works with both data that has one observed unit and with time-series cross-sectional data. Note: your data needs to be in ascending time order with equally spaced time increments. For example 199...

Reinhart & Rogoff: Everyone makes coding mistakes, we need to make it easy to find them + Graphing uncertainty

You may have already seen a lot written on the replication of Reinhart & Rogoff’s (R &amp R) much cited 2010 paper done by Herndon, Ash, and Pollin . If you haven’t, here is a round up of some of some of what has been written: Konczal , Yglesias , Krugman , Cowen , Peng , FT Alphaville . This is an interesting issue for me because it involves three topics I really like: political economy, reproducibility, and communicating uncertainty. Others have already commented on these topics in detail. I just wanted to add to this discussion by (a) talking about how this event highlights a real need for researchers to use systems that make finding and correcting mistakes easy, (b) incentivising mistake finding/correction rather than penalising it, and (c) showing uncertainty . Systems for Finding and Correcting Mistakes One of the problems Herndon, Ash, and Pollin found in R&R’s analysis was and Excel coding error . I love to hate on Excel as much as the next R ...

Dropbox & R Data

I'm always looking for ways to download data from the internet into R. Though I prefer to host and access plain-text data sets (CSV is my personal favourite) from GitHub (see my short paper on the topic) sometimes it's convenient to get data stored on Dropbox . There has been a change in the way Dropbox URLs work and I just added some functionality to the repmis R package. So I though that I'ld write a quick post on how to directly download data from Dropbox into R. The download method is different depending on whether or not your plain-text data is in a Dropbox Public folder or not. Dropbox Public Folder Dropbox is trying to do away with its public folders. New users need to actively create a Public folder. Regardless, sometimes you may want to download data from one. It used to be that files in Public folders were accessible through non-secure (http) URLs. It's easy to download these into R, just use the read.table command, where the URL is the file name...

FillIn: a function for filling in missing data in one data frame with info from another

Update (10 March 2013): FillIn is now part of the budding DataCombine package. Sometimes I want to use R to fill in values that are missing in one data frame with values from another. For example, I have data from the World Bank on government deficits. However, there are some country-years with missing data. I gathered data from Eurostat on deficits and want to use this data to fill in some of the values that are missing from my World Bank data. Doing this is kind of a pain so I created a function that would do it for me. It's called FillIn . An Example Here is an example using some fake data. (This example and part of the function was inspired by a Stack Exchange conversation between JD Long and Josh O'Brien.) First let's make two data frames: one with missing values in a variable called fNA . And a data frame with a more complete variable called fFull . # Create data set with missing values naDF <- data.frame(a = sample(c(1,2), 100, rep=TRUE), ...

InstallOldPackages: a repmis command for installing old R package versions

A big problem in reproducible research is that software changes. The code you used to do a piece of research may depend on a specific version of software that has since been changed. This is an annoying problem in R because install.packages only installs the most recent version of a package. It can be tedious to collect the old versions. On Toby Dylan Hocking 's suggestion, I added tools to the repmis package so that you can install, load, and cite specific R package versions. It should work for any package version that is stored on the CRAN archive ( http://cran.r-project.org ). To only install old package versions use the new repmis command InstallOldPackages . For example: # Install old versions of the e1071 and gtools packages. # Create vectors of the package names and versions to install # Note the names and version numbers must be in the same order Names &lt- c("e1071", "gtools") Vers &lt- c("1.6", "2.6.1") # Install...

repmis: misc. tools for reproducible research in R

I've started to put together an R package called repmis . It has miscellaneous tools for reproducible research with R. The idea behind the package is to collate commands that simplify some of the common R code used within knitr -type reproducible research papers. It's still very much in the early stages of development and has two commands: LoadandCite : a command to load all of the R packages used in a paper and create a BibTeX file containing citation information for them. It can also install the packages if they are on CRAN . source_GitHubData : a command for downloading plain-text formatted data stored on GitHub or at any other secure (https) URL. I've written about why you might want to use source_GitHubData before (see here and here ). You can use LoadandCite in a code chunk near the beginning of a knitr reproducible research document to load all of the R packages you will use in the document and automatically generate a BibTeX file you can draw on to c...

source_GitHubData: a simple function for downloading data from GitHub into R

Update 31 January: I've folded source_GitHubData into the repmis packaged. See this post . Update 7 January 2012: I updated the internal workings of source_GitHubData so that it now relies on httr rather than RCurl . Also it is more directly descended from devtool 's source_url command. This has two advantages. Shortened URL's can be used instead of the data sets' full GitHub URL, The ssl.verifypeer issue is resolved. (Though please let me know if you have problems). The post has been rewritten to reflect these changes. In previous posts I've discussed how to download data stored in plain-text data files (e.g. CSV, TSV) on GitHub directly into R. Not sure why it took me so long to get around to this, but I've finally created a little function that simplifies the process of downloading plain-text data from GitHub. It's called source_GitHubData . (The name mimicks the devtools syntax for functions like source_gist and source_url...

Update to Graphing Non-Proportional Hazards in R

Update 31 July 2013: I've moved all of the functionality described in this post into an R package called simPH . Have a look. It is much easier to use. This is a quick update for a previous post on Graphing Non-Proportional Hazards in R . In the previous post I showed how to simulate and graph 1,000 non-proportional hazard ratios at roughly every point in time across an observation period. In the previous example I kept in simulation outliers. Some people have suggested dropping the top and bottom 2.5 percent of simulated values (i.e. keeping the middle 95 percent). Luckily this can be accomplished with Hadley Wickham 's plyr package and three lines of code. The trick is to use plyr's ddply command to subset the data frame at each point in Time where we simulated values. In the previous example the simulated values were in a variable called HRqmv . In each subset we use the quantile command from base R to create logical variables indicating if a simulation o...

Interesting: Scraply

Ran across a new R package in development on GitHub. It's called scraply . It claims to provide "error-proof scraping in R". This could be a better solution than the one I was working on last year (see HERE ). Haven't tried it yet, though.

Timeline Maps with googleVis & Twitter Bootstrap Carousel (& updated Slidify)

I've wanted to create timeline maps with interactive googleVis Geomaps for a while. These would be a nice way to quickly show the spatial distribution of some data over time. It turns out that it's pretty easy to do with a plugin for Twitter Bootstrap called Carousel . Carousel is probably intended for regular picture slide shows. But because it can hold iframes, it can pretty much include anything, even interactive maps. Here is a short slide show with examples and code for how to combine googleVis and Twitter Bootstrap Carousel to create interactive timeline maps. Note: I used the newest version (0.3.1) of Ramnath Vaidyanathan's Slidify to create the presentation. He is really putting a lot of good work into that package. I especially like the choice to set the default slide framework to Google's I/O 2012 style. It has many features you don't find in other HTML slide frameworks. Particularly useful here, it begins to load iframes when you are on the ...

Graphing Non-Proportional Hazards in R

Update 30 July 2013: I've moved all of the functionality described in this post into an R package called simPH . Have a look. It is much easier to use. Update 30 December 2012: I updated the code HERE so that it keeps only the middle 95 percent of the simulated values. I really like this article by Amanda Licht in Political Analysis. It gives a lot of information on how to interpret nonproportional hazards and includes some nice graphs. Her source code is really helpful for learning the nuts and bolts of how to simulate quantities of interests over time. However, it's in Stata code, which doesn't really fit into my R-based workflow at the moment. So I decided to port the code over. This post gives an example of what I did. What is a non-proportional hazard & why use them? Here is my motivation for being interested in non-proportional hazards: In a few papers I used Cox Proportional Hazard (PH) models to examine countries' policy adoption dec...