6 December 2013

Three Quick and Simple Data Cleaning Helper Functions (December 2013)

As I go about cleaning and merging data sets with R I often end up creating and using simple functions over and over. When this happens, I stick them in the DataCombine package. This makes it easier for me to remember how to do an operation and others can possibly benefit from simplified and (hopefully) more intuitive code.

I've talked about some of the commands in DataCombine in previous posts. In this post I'll give examples for a few more that I've added over the past couple of months. Note: these examples are based on DataCombine version 0.1.11.

Here is a brief run down of the functions covered in this post:

  • FindReplace: a function to replace multiple patterns found in a character string column of a data frame.

  • MoveFront: moves variables to the front of a data frame. This can be useful if you have a data frame with many variables and want to move a variable or variables to the front.

  • rmExcept: removes all objects from a work space except those specified by the user.

FindReplace

Recently I needed to replace many patterns in a column of strings. Here is a short example. Imagine we have a data frame like this:

ABData <- data.frame(a = c("London, UK", "Oxford, UK", "Berlin, DE", "Hamburg, DE", "Oslo, NO"), b = c(8, 0.1, 3, 2, 1))

Ok, now I want to replace the UK and DE parts of the strings with England and Germany. So I create a data frame with two columns. The first records the pattern and the second records what I want to replace the pattern with:

Replaces <- data.frame(from = c("UK", "DE"), to = c("England", "Germany"))

Now I can just use FindReplace to make the replacements all at once:

library(DataCombine)

ABNewDF <- FindReplace(data = ABData, Var = "a", replaceData = Replaces, from = "from", to = "to", exact = FALSE)

# Show changes
ABNewDF
##                  a   b
## 1  London, England 8.0
## 2  Oxford, England 0.1
## 3  Berlin, Germany 3.0
## 4 Hamburg, Germany 2.0
## 5         Oslo, NO 1.0

If you set exact = TRUE then FindReplace will only replace exact pattern matches. Also, you can set vector = TRUE to return only a vector of the column you replaced (the Var column), rather than the whole data frame.

MoveFront

On occasion I've wanted to move a few variables to the front of a data frame. The MoveFront function makes this pretty simple. It only has two arguments: data and Var. Data is the data frame and Var is a character vector with the columns I want to move to the front of the data frame in the order that I want them. Here is an example:

# Create dummy data
A <- B <- C <- 1:50
OldOrder <- data.frame(A, B, C)

names(OldOrder)
## [1] "A" "B" "C"
# Move B and A to the front
NewOrder2 <- MoveFront(OldOrder, c("B", "A"))
names(NewOrder2)
## [1] "B" "A" "C"

rmExcept

Finally, sometimes I want to clean up my work space and only keep specific objects. I want to remove everything else. This is straightforward with rmExcept. For example:

# Create objects
A <- 1
B <- 2
C <- 3

# Remove all objects except for A
rmExcept("A")
## Removed the following objects:
## ABData, ABNewDF, B, C, NewOrder2, OldOrder, Replaces
# Show workspace
ls()
## [1] "A"

You can set the environment you want to clean up with the envir argument. By default is is your global environment.

No comments: