tag:blogger.com,1999:blog-36784551.post6553791094447994412..comments2024-03-17T05:15:05.756+00:00Comments on Christopher Gandrud <br> (간드루드 크리스토파): Simple Text Web CrawlerUnknownnoreply@blogger.comBlogger4125tag:blogger.com,1999:blog-36784551.post-79788485864818630222017-06-29T21:33:41.599+01:002017-06-29T21:33:41.599+01:00You can use an R web crawler and scraper called RC...You can use an R web crawler and scraper called RCrawler, it's designed to crawl, parse, store and extract contents of web page automatically.<br />install.packages("Rcrawler")<br />see manual for more detail here <a href="https://CRAN.R-project.org/package=Rcrawler" rel="nofollow">R web scraper</a><br />Salim KHALILhttps://www.blogger.com/profile/01450840109199440788noreply@blogger.comtag:blogger.com,1999:blog-36784551.post-5671858499664877172017-06-29T21:31:51.825+01:002017-06-29T21:31:51.825+01:00This comment has been removed by the author.Salim KHALILhttps://www.blogger.com/profile/01450840109199440788noreply@blogger.comtag:blogger.com,1999:blog-36784551.post-77109186774345280082012-08-28T06:25:15.267+01:002012-08-28T06:25:15.267+01:00That's a good suggestion. I think I like it be...That's a good suggestion. I think I like it better than the approach I took later <a href="http://christophergandrud.blogspot.kr/2012/02/how-to-extract-text-from-multiple.html" rel="nofollow">here</a>.Anonymoushttps://www.blogger.com/profile/17687826553734188008noreply@blogger.comtag:blogger.com,1999:blog-36784551.post-4021981735315439962012-08-12T21:33:54.497+01:002012-08-12T21:33:54.497+01:00Nice piece of code. Does what it is supposed to.
...Nice piece of code. Does what it is supposed to.<br /><br />Do you have any suggestions to how one can delay the code with x seconds? When using the code for retrieving many pages from same server I am overloading the server giving me "bad" files with no text, and probably some angry hosts, which is not my intention.<br /><br />I solved the problem by taking 5% of total n of pages at the time. Therefore i believe a solution would be if one could make count total number of pages in the input file and tell the code to only send like 50 requests or 5% of total n at the time.<br /><br />Best<br />KasperAnonymoushttps://www.blogger.com/profile/18180722768595356494noreply@blogger.com