Skip to main content

Simple Text Web Crawler

I put together a simple web crawler for R. It's useful if you are doing any text analysis and need to make .txt files from webpages. If you have a data frame of URLs it will cycle through them and grab all the websites. It strips out the HTML code. Then it saves each webpage as an individual text file.

Thanks to Rex Douglass, also.

 Enjoy (and please feel free to improve)

Comments

Unknown said…
Nice piece of code. Does what it is supposed to.

Do you have any suggestions to how one can delay the code with x seconds? When using the code for retrieving many pages from same server I am overloading the server giving me "bad" files with no text, and probably some angry hosts, which is not my intention.

I solved the problem by taking 5% of total n of pages at the time. Therefore i believe a solution would be if one could make count total number of pages in the input file and tell the code to only send like 50 requests or 5% of total n at the time.

Best
Kasper
Unknown said…
That's a good suggestion. I think I like it better than the approach I took later here.
Salim KHALIL said…
This comment has been removed by the author.
Salim KHALIL said…
You can use an R web crawler and scraper called RCrawler, it's designed to crawl, parse, store and extract contents of web page automatically.
install.packages("Rcrawler")
see manual for more detail here R web scraper