How you build it will depend on the service you host it on. So with AWS, you'll ...

jwp · on March 31, 2008

Creating the list of URLs and prioritizing them is the hardest thing about building a crawler! That is, a good, web-scale one. A replacement for wget might be sort of fun, but the real way to make a fast crawler is to be choosy about which pages get updated frequently, which are likely to contain good content (by computing a pagerank-like stat on the fly), etc.

It is far from my area of expertise, but the Wikipedia page about this looks very useful. It cites a bunch of wicked smart people. http://en.wikipedia.org/wiki/Web_crawler

If you just want to suck down a bunch of pages, then there's nothing wrong with wget.

groovyone · on March 30, 2008

Thanks for this. Really appreciate the detailed response. This is basically what I was looking to do, but rather than re-invent the wheel was looking for something that already existed. Thank you for your time and thanks to everyone so far who has commented - it has really helped