Skip to main content

Three Quick and Simple Data Cleaning Helper Functions (December 2013)

As I go about cleaning and merging data sets with R I often end up creating and using simple functions over and over. When this happens, I stick them in the DataCombine package. This makes it easier for me to remember how to do an operation and others can possibly benefit from simplified and (hopefully) more intuitive code.

I've talked about some of the commands in DataCombine in previous posts. In this post I'll give examples for a few more that I've added over the past couple of months. Note: these examples are based on DataCombine version 0.1.11.

Here is a brief run down of the functions covered in this post:

  • FindReplace: a function to replace multiple patterns found in a character string column of a data frame.

  • MoveFront: moves variables to the front of a data frame. This can be useful if you have a data frame with many variables and want to move a variable or variables to the front.

  • rmExcept: removes all objects from a work space except those specified by the user.

FindReplace

Recently I needed to replace many patterns in a column of strings. Here is a short example. Imagine we have a data frame like this:

ABData <- data.frame(a = c("London, UK", "Oxford, UK", "Berlin, DE", "Hamburg, DE", "Oslo, NO"), b = c(8, 0.1, 3, 2, 1))

Ok, now I want to replace the UK and DE parts of the strings with England and Germany. So I create a data frame with two columns. The first records the pattern and the second records what I want to replace the pattern with:

Replaces <- data.frame(from = c("UK", "DE"), to = c("England", "Germany"))

Now I can just use FindReplace to make the replacements all at once:

library(DataCombine)

ABNewDF <- FindReplace(data = ABData, Var = "a", replaceData = Replaces, from = "from", to = "to", exact = FALSE)

# Show changes
ABNewDF
##                  a   b
## 1  London, England 8.0
## 2  Oxford, England 0.1
## 3  Berlin, Germany 3.0
## 4 Hamburg, Germany 2.0
## 5         Oslo, NO 1.0

If you set exact = TRUE then FindReplace will only replace exact pattern matches. Also, you can set vector = TRUE to return only a vector of the column you replaced (the Var column), rather than the whole data frame.

MoveFront

On occasion I've wanted to move a few variables to the front of a data frame. The MoveFront function makes this pretty simple. It only has two arguments: data and Var. Data is the data frame and Var is a character vector with the columns I want to move to the front of the data frame in the order that I want them. Here is an example:

# Create dummy data
A <- B <- C <- 1:50
OldOrder <- data.frame(A, B, C)

names(OldOrder)
## [1] "A" "B" "C"
# Move B and A to the front
NewOrder2 <- MoveFront(OldOrder, c("B", "A"))
names(NewOrder2)
## [1] "B" "A" "C"

rmExcept

Finally, sometimes I want to clean up my work space and only keep specific objects. I want to remove everything else. This is straightforward with rmExcept. For example:

# Create objects
A <- 1
B <- 2
C <- 3

# Remove all objects except for A
rmExcept("A")
## Removed the following objects:
## ABData, ABNewDF, B, C, NewOrder2, OldOrder, Replaces
# Show workspace
ls()
## [1] "A"

You can set the environment you want to clean up with the envir argument. By default is is your global environment.

Comments

Popular posts from this blog

Showing results from Cox Proportional Hazard Models in R with simPH

Update 2 February 2014: A new version of simPH (Version 1.0) will soon be available for download from CRAN. It allows you to plot using points, ribbons, and (new) lines. See the updated package description paper for examples. Note that the ribbons argument will no longer work as in the examples below. Please use type = 'ribbons' (or 'points' or 'lines' ). Effectively showing estimates and uncertainty from Cox Proportional Hazard (PH) models , especially for interactive and non-linear effects, can be challenging with currently available software. So, researchers often just simply display a results table. These are pretty useless for Cox PH models. It is difficult to decipher a simple linear variable’s estimated effect and basically impossible to understand time interactions, interactions between variables, and nonlinear effects without the reader further calculating quantities of interest for a variety of fitted values. So, I’ve been putting together th

Slide: one function for lag/lead variables in data frames, including time-series cross-sectional data

I often want to quickly create a lag or lead variable in an R data frame. Sometimes I also want to create the lag or lead variable for different groups in a data frame, for example, if I want to lag GDP for each country in a data frame. I've found the various R methods for doing this hard to remember and usually need to look at old blog posts . Any time we find ourselves using the same series of codes over and over, it's probably time to put them into a function. So, I added a new command– slide –to the DataCombine R package (v0.1.5). Building on the shift function TszKin Julian posted on his blog , slide allows you to slide a variable up by any time unit to create a lead or down to create a lag. It returns the lag/lead variable to a new column in your data frame. It works with both data that has one observed unit and with time-series cross-sectional data. Note: your data needs to be in ascending time order with equally spaced time increments. For example 1995, 1996

Dropbox & R Data

I'm always looking for ways to download data from the internet into R. Though I prefer to host and access plain-text data sets (CSV is my personal favourite) from GitHub (see my short paper on the topic) sometimes it's convenient to get data stored on Dropbox . There has been a change in the way Dropbox URLs work and I just added some functionality to the repmis R package. So I though that I'ld write a quick post on how to directly download data from Dropbox into R. The download method is different depending on whether or not your plain-text data is in a Dropbox Public folder or not. Dropbox Public Folder Dropbox is trying to do away with its public folders. New users need to actively create a Public folder. Regardless, sometimes you may want to download data from one. It used to be that files in Public folders were accessible through non-secure (http) URLs. It's easy to download these into R, just use the read.table command, where the URL is the file name