How you build it will depend on the service you host it on. So with AWS, you'll use EC2, S3 and perhaps most importantly, SQS.
Fundamentally for a crawler, you will need the following:
1. A list of URLs to crawl, perhaps even ranked in priority of crawling. This is a database of sorts.
2. A set of crawlers that figure out the most important URL on the list and fetch it.
3. A parser and HTML storage service. The parser will also feed new URLs into the list.
Each of the above pieces are easy to do on their own. The trick is how you glue them together. I would suggest something like the following as a starter for using AWS for crawling:
1. A MySQL list of URLs with some kind of priority ranking. This can be a cluster of EC2 instances that store and prioritize the links. Early on, you can ignore the prioritization aspects.
2. The URL cluster will dispatch queue messages of URLs to crawl in the desired crawling order.
3. A cluster of EC2 instances check the SQS queue for crawling messages and fetch the URL specified in each message. As the message is being processed, it's locked so others can move on.
You can make the whole thing dynamic by adding crawling instances if the queue gets too long. You can also have instances that determine the crawling priority for the next time (one metric is number of backlinks to a page). Another set of instances might be parsers or do the actual analysis of the crawled pages.
Which language to code it in? If you're going for maximal speed, perhaps you should consider a compiled language. If not, Python or PHP or Perl would do just fine. Personally I'd do it in a scripting language to begin with and invest the time into a faster crawler later if warranted.
Creating the list of URLs and prioritizing them is the hardest thing about building a crawler! That is, a good, web-scale one. A replacement for wget might be sort of fun, but the real way to make a fast crawler is to be choosy about which pages get updated frequently, which are likely to contain good content (by computing a pagerank-like stat on the fly), etc.
It is far from my area of expertise, but the Wikipedia page about this looks very useful. It cites a bunch of wicked smart people. http://en.wikipedia.org/wiki/Web_crawler
If you just want to suck down a bunch of pages, then there's nothing wrong with wget.
Thanks for this. Really appreciate the detailed response. This is basically what I was looking to do, but rather than re-invent the wheel was looking for something that already existed. Thank you for your time and thanks to everyone so far who has commented - it has really helped
Fundamentally for a crawler, you will need the following:
1. A list of URLs to crawl, perhaps even ranked in priority of crawling. This is a database of sorts.
2. A set of crawlers that figure out the most important URL on the list and fetch it.
3. A parser and HTML storage service. The parser will also feed new URLs into the list.
Each of the above pieces are easy to do on their own. The trick is how you glue them together. I would suggest something like the following as a starter for using AWS for crawling:
1. A MySQL list of URLs with some kind of priority ranking. This can be a cluster of EC2 instances that store and prioritize the links. Early on, you can ignore the prioritization aspects.
2. The URL cluster will dispatch queue messages of URLs to crawl in the desired crawling order.
3. A cluster of EC2 instances check the SQS queue for crawling messages and fetch the URL specified in each message. As the message is being processed, it's locked so others can move on.
You can make the whole thing dynamic by adding crawling instances if the queue gets too long. You can also have instances that determine the crawling priority for the next time (one metric is number of backlinks to a page). Another set of instances might be parsers or do the actual analysis of the crawled pages.
Which language to code it in? If you're going for maximal speed, perhaps you should consider a compiled language. If not, Python or PHP or Perl would do just fine. Personally I'd do it in a scripting language to begin with and invest the time into a faster crawler later if warranted.
And good luck!