Scrappy Scapers

In an earlier post I presented some R code for a basic way of collecting text from websites. This is a good place to start for collecting text for use in text analysis.

However, it clearly has some limitations;

You need to have all of the URLs already stored in a .csv file.
The method of extracting the text from the downloaded HTML code using <gsub> is a bit imprecise. It doesn't remove the text from common links such as "Home" or "About".

Both of these problems can be solved in R with a bit of work. But I think for bigger scrapping projects it is probably a good idea to use other languages such as Python or Ruby.

ProPublica has an excellent little series on scraping that covers how to gather data from online databases and PDFs. This is a really good public service and enables something sadly unusual in journalism: reproducibility. Their chapter on using Ruby and Nokogiri for scraping the Phizer's doctor payments disclosure database is particularly helpful.

Building on this, I'm thinking of putting together a slideshow for how to use Ruby, Nokogiri, and Mechanize to scrap the Congressional Records database. It will be similar to the slideshow I made for how to use the googleVis and WDI packages to make Google Motion Charts.

Bit busy over the next few weeks, but now that I've blogged it, it's in my "Must-Do" list.

Christopher Gandrud
(간드루드 크리스토파)

Search This Blog

Scrappy Scapers

Labels

Comments

Popular posts from this blog

Dropbox & R Data

Slide: one function for lag/lead variables in data frames, including time-series cross-sectional data

A Link Between topicmodels LDA and LDAvis

Christopher Gandrud (간드루드 크리스토파)

Scrappy Scapers

Labels

Comments

Popular posts from this blog

Dropbox & R Data

Slide: one function for lag/lead variables in data frames, including time-series cross-sectional data

A Link Between topicmodels LDA and LDAvis

Christopher Gandrud
(간드루드 크리스토파)