Ask YC: Where to start with creating a distributed crawler

pierrefar · on March 30, 2008

How you build it will depend on the service you host it on. So with AWS, you'll use EC2, S3 and perhaps most importantly, SQS.

Fundamentally for a crawler, you will need the following:

1. A list of URLs to crawl, perhaps even ranked in priority of crawling. This is a database of sorts.

2. A set of crawlers that figure out the most important URL on the list and fetch it.

3. A parser and HTML storage service. The parser will also feed new URLs into the list.

Each of the above pieces are easy to do on their own. The trick is how you glue them together. I would suggest something like the following as a starter for using AWS for crawling:

1. A MySQL list of URLs with some kind of priority ranking. This can be a cluster of EC2 instances that store and prioritize the links. Early on, you can ignore the prioritization aspects.

2. The URL cluster will dispatch queue messages of URLs to crawl in the desired crawling order.

3. A cluster of EC2 instances check the SQS queue for crawling messages and fetch the URL specified in each message. As the message is being processed, it's locked so others can move on.

You can make the whole thing dynamic by adding crawling instances if the queue gets too long. You can also have instances that determine the crawling priority for the next time (one metric is number of backlinks to a page). Another set of instances might be parsers or do the actual analysis of the crawled pages.

Which language to code it in? If you're going for maximal speed, perhaps you should consider a compiled language. If not, Python or PHP or Perl would do just fine. Personally I'd do it in a scripting language to begin with and invest the time into a faster crawler later if warranted.

And good luck!

jwp · on March 31, 2008

Creating the list of URLs and prioritizing them is the hardest thing about building a crawler! That is, a good, web-scale one. A replacement for wget might be sort of fun, but the real way to make a fast crawler is to be choosy about which pages get updated frequently, which are likely to contain good content (by computing a pagerank-like stat on the fly), etc.

It is far from my area of expertise, but the Wikipedia page about this looks very useful. It cites a bunch of wicked smart people. http://en.wikipedia.org/wiki/Web_crawler

If you just want to suck down a bunch of pages, then there's nothing wrong with wget.

groovyone · on March 30, 2008

Thanks for this. Really appreciate the detailed response. This is basically what I was looking to do, but rather than re-invent the wheel was looking for something that already existed. Thank you for your time and thanks to everyone so far who has commented - it has really helped

aonic · on March 30, 2008

Check out Nutch, not sure if its exactly what you want though. It's in Java not Python, but it works with Hadoop quite nicely

sarosh · on March 30, 2008

I concur with the Nutch vote; but more specifically, take a look at the crawler code written in the src trunk for use with Hadoop. That is probably a good place to start. Also worth a look is Heritrix (crawler for archive.org). http://sourceforge.net/projects/archive-crawler Sadly, this too is written in Java.

The only Python one I am aware of for which code is available is: http://sourceforge.net/projects/ruya/

Edit: You might also want to take a look at http://wiki.apache.org/hadoop/AmazonEC2

Edit2: Polybot is another Python based crawler, but no code. However, the paper has some interesting ideas:

Design and Implementation of a High-Performance Distributed Web Crawler. V. Shkapenyuk and T. Suel. IEEE International Conference on Data Engineering, February 2002. http://cis.poly.edu/westlab/polybot/

inovica · on March 30, 2008

Good response. We've created a basic crawler in Python, but are looking for something more powerful too. Heritrix above looks good

groovyone · on March 30, 2008

Thanks. Still would like to use Python to be honest (any python suggestions?), but I'll give this a go. Going to do some more research and might post back findings if anyone would be interested in critiquing them. I'm creating this startup from scratch so if there is anyone interested in the crawler side of things I'd be happy to chat either about collaboration or sharing ideas.

konsl · on March 30, 2008

If you're looking at building your own crawler in Python from scratch, here's a benchmark of SGML parsers:

http://72.14.205.104/search?q=cache:LYoRD1GTP2UJ:www.oluyede...

We've been playing with sgmlop (http://effbot.org/zone/sgmlop-index.htm) for parsing and urllib2 (http://docs.python.org/lib/module-urllib2.html) for fetching.

dshah · on March 30, 2008

I'm going to vote for Nutch too. Have heard good things.

Also, if someone on this list wants to work on a cool web spidering project (probably using Nutch), send me a message. I'm looking for someone.

surya · on March 31, 2008

I'm interested...

bdr · on March 30, 2008

BeautifulSoup is not foolproof, meaning it does not always even approximate the way a browser would parse the HTML. One important failure is that it fails to recognize when HTML tags are inside of JavaScript strings (and so should not be considered). Whether this is important or not depends on your application.

iowahansen · on March 30, 2008

1.) Don't do it yourself. Use Amazon's Alexa Web Search service (aws.amazon.com). Through that you can access Alexa's 10 billion page index, complete with all the pages, run complex queries etc. Plays nicely with EC2.

2.) If you must do it yourself, Heritrix is the most sophisticated crawler out there (crawler.archive.org).

3.) Nutch is an option, but nowhere near as powerful as Heritrix.

Don't try to reinvent the wheel, writing a robust crawler is a lot of work as there are endless edge cases to take care of (if you are looking into a general purpose web crawler)

sonink · on March 30, 2008

Nutch is good and I would second it, but I would suggest to NOT build a crawler - its not trivial and inadvised in a startup, that is unless your startup is just about building a crawler.

wehriam · on March 30, 2008

I have recently written a Beautiful Soup / Twisted crawler. To make it distributed, presumably we'd use Amazon's queue service.

Feel free to get in touch if you're interested in the details.

groovyone · on March 30, 2008

Thanks John. I'll do that. Your resume looks awesome mind. Not sure how much we'd be able to help you!

gojomo · on March 31, 2008

Thanks for the previous positive comments about Heritrix, which is my project at the Internet Archive. If anyone has questions, please send them my way.

Heritrix was designed for archival projects, which has meant an emphasis on having a "true record" (including non-text resources) and high configurability for inclusion/exclusion. Any text indexing or link-graph-analysis is completely external; we've used Nutch (without their crawler) for that.

Whole-web multi-billion-page crawls have not been the focus yet, though we've tried one and have heard of outside groups successfully using Heritrix for 2+ and 4+ billion page crawls.

Our distribution story is spotty; we provide some options that help you split the URL-space you want to crawl across crawlers, and remote-control crawlers from other programs, but syncing their launch and other steps is left to an expert operator's own devices. We've run coordinated Heritrix crawls on groups of 4-8 machines (dual-opteron, 4G+ RAM, 4x500GB+ HDs) and understand others have used up to 12.

xirium · on March 30, 2008

Given the way that Google is heading ( http://news.ycombinator.com/item?id=149894 ), you should have started at least six months ago. Regardless:

1. 10 years ago, at least 99% of web pages failed validation. Nowadays, the majority still fail. You could validate and then fall through to tag soup processing.

2. 10 years ago, the conventional wisdom ( http://www.tbray.org/ongoing/When/200x/2003/07/30/OnSearchTO... ) was to use a compiled language, such as C, for spidering ( http://www.tbray.org/ongoing/When/200x/2003/12/03/Robots ). Given that memory increases faster than processing power which increases faster than bandwidth, this may not be the case nowadays.

3. That's the meta problem. Solve that and you may find that a search engine is easier.

Readmore · on March 30, 2008

I recently built a crawler of my own in Ruby after trying, and ultimately deciding against, Nutch. Depending on what you want to do with your crawl there is a very good change that you'll be able to write a small crawler that is much easier to extend on your own, and you'll probably be able to write it in the time it would take you to install, setup, and configure Nutch.

As I said I used Ruby and specifically Hpricot for the page parsing. I'm starting to run into problems with Hpricot right now though and I may actually try a python version with Beautiful Soup very soon. Let me know how it goes for you and maybe we can share some code.

groovyone · on March 30, 2008

Hi there. I think I've decided we need to build our own, mainly as what we're wanting to do is quite specific - monitoring of sites for keywords and really we should understand,to a degree, this technology. Happy to share if you end up heading down the Python route. If there is anyone else on here who is doing crawling/data mining then maybe we could share ideas, help each other somehow :) My email is in my profile

Readmore · on March 31, 2008

Have you seen this article yet? Looks like it would be useful to you. http://blog.ianbicking.org/2008/03/30/python-html-parser-per...

krishna2 · on March 30, 2008

+10 for Python +1 for Nutch +1 for Hadoop +1 for Amazon EC2

I think you have bundled two things (crawler and parser (or may be scraper)) into one term : crawler.

Beautiful soup is ok. Give html5lib a try (on google code) - but at some point you are going to have to hack the parser, but that depends on what kind of parsing you want to do.

groovyone · on March 30, 2008

Thanks for that suggestion. Will have a good look at it.

inovica · on March 30, 2008

Look at Harvestman. Quite useful

groovyone · on March 30, 2008

I looked, but this is not distributed, so currently (whilst good) its limited to one server