In an earlier post I presented some R code for a basic way of collecting text from websites. This is a good place to start for collecting text for use in text analysis.
However, it clearly has some limitations;
However, it clearly has some limitations;
- You need to have all of the URLs already stored in a .csv file.
- The method of extracting the text from the downloaded HTML code using <gsub> is a bit imprecise. It doesn't remove the text from common links such as "Home" or "About".
Both of these problems can be solved in R with a bit of work. But I think for bigger scrapping projects it is probably a good idea to use other languages such as Python or Ruby.
ProPublica has an excellent little series on scraping that covers how to gather data from online databases and PDFs. This is a really good public service and enables something sadly unusual in journalism: reproducibility. Their chapter on using Ruby and Nokogiri for scraping the Phizer's doctor payments disclosure database is particularly helpful.
Building on this, I'm thinking of putting together a slideshow for how to use Ruby, Nokogiri, and Mechanize to scrap the Congressional Records database. It will be similar to the slideshow I made for how to use the googleVis and WDI packages to make Google Motion Charts.
Bit busy over the next few weeks, but now that I've blogged it, it's in my "Must-Do" list.
Comments