I put together a simple web crawler for R. It's useful if you are doing any text analysis and need to make .txt files from webpages. If you have a data frame of URLs it will cycle through them and grab all the websites. It strips out the HTML code. Then it saves each webpage as an individual text file.
Thanks to Rex Douglass, also.
Enjoy (and please feel free to improve)
Thanks to Rex Douglass, also.
Enjoy (and please feel free to improve)
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Load RCurl package | |
library(RCurl) | |
addresses <- read.csv("~/links.csv") # Create a .csv file with all of the links you want to crawl | |
for (i in addresses) full.text <- getURL(i) | |
text.sub <- gsub("<.+?>", "", full.text) # Removes HTML tags | |
text <- data.frame(text.sub) | |
outpath <- "~/text.indv/" | |
x <- 1:nrow(text) | |
for(i in x) { | |
write(as.character(text[i,1]), file = paste(outpath,"/",i,".txt",sep="")) | |
} # Note: this is for Mac OS paths |
Comments
Do you have any suggestions to how one can delay the code with x seconds? When using the code for retrieving many pages from same server I am overloading the server giving me "bad" files with no text, and probably some angry hosts, which is not my intention.
I solved the problem by taking 5% of total n of pages at the time. Therefore i believe a solution would be if one could make count total number of pages in the input file and tell the code to only send like 50 requests or 5% of total n at the time.
Best
Kasper
install.packages("Rcrawler")
see manual for more detail here R web scraper