Christopher Gandrud (간드루드 크리스토파)

Posts

Showing posts with the label web crawler

Interesting: Scraply

Ran across a new R package in development on GitHub. It's called scraply . It claims to provide "error-proof scraping in R". This could be a better solution than the one I was working on last year (see HERE ). Haven't tried it yet, though.

How-to Extract Text From Multiple Websites with R

I have been meaning to post this slideshow for awhile now. It gives a brief introduction to using R for scraping text from multiple websites. It includes some basic debugging, because R sometimes misses a website. Just click the arrows to change the slides. Enjoy!

Scrappy Scapers

In an earlier post I presented some R code for a basic way of collecting text from websites. This is a good place to start for collecting text for use in text analysis. However, it clearly has some limitations; You need to have all of the URLs already stored in a .csv file. The method of extracting the text from the downloaded HTML code using <gsub> is a bit imprecise. It doesn't remove the text from common links such as "Home" or "About". Both of these problems can be solved in R with a bit of work. But I think for bigger scrapping projects it is probably a good idea to use other languages such as Python or Ruby . ProPublica has an excellent little series on scraping that covers how to gather data from online databases and PDFs. This is a really good public service and enables something sadly unusual in journalism: reproducibility. Their chapter on using Ruby and Nokogiri for scraping the Phizer's doctor payments disclosure da

Simple Text Web Crawler

I put together a simple web crawler for R . It's useful if you are doing any text analysis and need to make .txt files from webpages. If you have a data frame of URLs it will cycle through them and grab all the websites. It strips out the HTML code. Then it saves each webpage as an individual text file. Thanks to Rex Douglass, also. Enjoy (and please feel free to improve)