In an earlier post I presented some R code for a basic way of collecting text from websites. This is a good place to start for collecting text for use in text analysis. However, it clearly has some limitations; You need to have all of the URLs already stored in a .csv file. The method of extracting the text from the downloaded HTML code using <gsub> is a bit imprecise. It doesn't remove the text from common links such as "Home" or "About". Both of these problems can be solved in R with a bit of work. But I think for bigger scrapping projects it is probably a good idea to use other languages such as Python or Ruby . ProPublica has an excellent little series on scraping that covers how to gather data from online databases and PDFs. This is a really good public service and enables something sadly unusual in journalism: reproducibility. Their chapter on using Ruby and Nokogiri for scraping the Phizer's doctor payments disclosure da