Hi there. We're just starting out and want to create a crawler that will sit on EC2. Any advice appreciated. Here's what we're thinking of:
1. Using Beautiful Soup for the actual parsing of pages
2. we're not sure what to use for the crawl itself :( We use Python and love it, but don't know if we need to create our own crawler or what the best route would be. Any advice on this would be good
3. I'd like to create a distributed crawler where we can replicate the crawler over EC2 instances, but not sure how to do this
Apologies if I should ask this elsewhere. I love this community and have passively read many of the articles and comments on here for a couple of months now
Any help or pointing in the right direction would be appreciated
John
Fundamentally for a crawler, you will need the following:
1. A list of URLs to crawl, perhaps even ranked in priority of crawling. This is a database of sorts.
2. A set of crawlers that figure out the most important URL on the list and fetch it.
3. A parser and HTML storage service. The parser will also feed new URLs into the list.
Each of the above pieces are easy to do on their own. The trick is how you glue them together. I would suggest something like the following as a starter for using AWS for crawling:
1. A MySQL list of URLs with some kind of priority ranking. This can be a cluster of EC2 instances that store and prioritize the links. Early on, you can ignore the prioritization aspects.
2. The URL cluster will dispatch queue messages of URLs to crawl in the desired crawling order.
3. A cluster of EC2 instances check the SQS queue for crawling messages and fetch the URL specified in each message. As the message is being processed, it's locked so others can move on.
You can make the whole thing dynamic by adding crawling instances if the queue gets too long. You can also have instances that determine the crawling priority for the next time (one metric is number of backlinks to a page). Another set of instances might be parsers or do the actual analysis of the crawled pages.
Which language to code it in? If you're going for maximal speed, perhaps you should consider a compiled language. If not, Python or PHP or Perl would do just fine. Personally I'd do it in a scripting language to begin with and invest the time into a faster crawler later if warranted.
And good luck!