Making AJAX applications crawlable

pseudonym · on Oct 1, 2010

Considering that it's Google, I'm honestly surprised they haven't created a crawler application that can execute the javascript on a given page and re-parse it based on the new layout.

You wouldn't think it would be that hard, either-- take Chrome, remove UI, add crawler. Chrome's already got the functionality in "Inspect Element" to show dynamically-created content.

patio11 · on Oct 1, 2010

Google does both heuristically parse Javascript and execute Javascript, for at least some fraction of their crawl. This is obviously much more expensive than doing HTML parsing, particularly once the Internet largely goes from static HTML to AJAXy magic.

pseudonym · on Oct 1, 2010

Granted, but if Google can't parse it, I highly doubt anyone else is going to. And as nice as it would be if every web developer read the "Google Guide to Being Nice to Our Web Crawlers", as an internet we still can't get away from IE6. The term "pipe dream" comes to mind.

eli · on Oct 1, 2010

Uh, I thought they did. http://blogs.forbes.com/velocity/2010/06/25/google-isnt-just...

paradoja · on Oct 1, 2010

That would be a fairly bad idea. Javascript execution, and in particular AJAX calls, frequently change data. I wouldn't want my crawler to delete or edit somehow lots of data if I where Google.

pseudonym · on Oct 1, 2010

No more or less than following a link with a GET variable. I'm trying to find the article posted on TDWTF regarding a link that went to "&delete=true" that dropped the entire database...but, Ajax calls are no more or less frequently data-changing than any one of a hundred links on survey sites that send their data with GET variables.

Edit: Found it! http://thedailywtf.com/Articles/WellIntentioned-Destruction....

RyanMcGreal · on Oct 1, 2010

As long as the crawler can't execute POST requests, no one who has built their web application properly will have any problems.

njharman · on Oct 1, 2010

> no one who has built their web application properly will have any problems

In other words everyone will have problems (to an accuracy of 3-4 decimals)

RyanMcGreal · on Oct 1, 2010

Related: http://xkcd.com/327/

IgorPartola · on Oct 1, 2010

And how would the crawler know when the website is done loading?

wanderr · on Oct 3, 2010

I wish they would make it possible to do the whole bit with "pretty" urls. Example:

Users go to: /#/abc Google goes to /abc and they both see the same content. Obviously you'd need some way to tell Google that your site follows this scheme but then you don't have to change all of your URLs and existing external links to pages on your site would accumulate page rank appropriately.

timinman · on Oct 1, 2010

I've been thinking about writing a blog article on this subject. Basically AJAX is like Chocolate - it's really tempting, but eating too much or for the wrong reasons is not a good idea.

I've just rewritten one of my sites which used AJAX to load content on a single index page. I've now made sure every unique resource has a unique url. This is good for browser navigation, good for Google, and good for promoting a single resource on a service like Twitter. Now my site which looked like 1 page to Google is almost 500 pages.

voxxit · on Oct 2, 2010

I just recently implemented this. The thing I'm having trouble with is the fact that they don't limit the amount of time (or specify a limit, rather) which is acceptable to make the crawler wait. For instance, what if it take 20 seconds to load an acceptable amount of JavaScript-created HTML?

ajennings · on Oct 1, 2010

Does anyone know when this went live? I'm assuming it is live already.

I see it was proposed in Oct 2009: http://googlewebmastercentral.blogspot.com/2009/10/proposal-...