Skip to main content

Posts

Showing posts with the label web crawler

Interesting: Scraply

Ran across a new R package in development on GitHub. It's called scraply . It claims to provide "error-proof scraping in R". This could be a better solution than the one I was working on last year (see HERE ). Haven't tried it yet, though.

Scrappy Scapers

In an earlier post I presented some R  code for a basic way of collecting text from websites. This is a good place to start for collecting text for use in text analysis.  However, it clearly has some limitations;  You need to have all of the URLs already stored in a .csv file. The method of extracting the text from the downloaded HTML code using <gsub> is a bit imprecise. It doesn't remove the text from common links such as "Home" or "About". Both of these problems can be solved in R  with a bit of work. But I think for bigger scrapping projects it is probably a good idea to use other languages such as Python or Ruby .  ProPublica has an excellent little series on scraping  that covers how to gather data from online databases and PDFs. This is a really good public service and enables something sadly unusual in journalism: reproducibility. Their  chapter on using Ruby and Nokogiri for scraping the Phizer's doctor payments disclosure da

Simple Text Web Crawler

I put together a simple web crawler for R . It's useful if you are doing any text analysis and need to make .txt files from webpages. If you have a data frame of URLs it will cycle through them and grab all the websites. It strips out the HTML code. Then it saves each webpage as an individual text file. Thanks to Rex Douglass, also.  Enjoy (and please feel free to improve)