13 October 2011

Simple Text Web Crawler

I put together a simple web crawler for R. It's useful if you are doing any text analysis and need to make .txt files from webpages. If you have a data frame of URLs it will cycle through them and grab all the websites. It strips out the HTML code. Then it saves each webpage as an individual text file.

Thanks to Rex Douglass, also.

 Enjoy (and please feel free to improve)

2 comments:

Kasper Christensen said...

Nice piece of code. Does what it is supposed to.

Do you have any suggestions to how one can delay the code with x seconds? When using the code for retrieving many pages from same server I am overloading the server giving me "bad" files with no text, and probably some angry hosts, which is not my intention.

I solved the problem by taking 5% of total n of pages at the time. Therefore i believe a solution would be if one could make count total number of pages in the input file and tell the code to only send like 50 requests or 5% of total n at the time.

Best
Kasper

Christopher Gandrud said...

That's a good suggestion. I think I like it better than the approach I took later here.