Yahoo has a very similar service called Yahoo YQL. It can parse any web resource like HTML or XML and is conceniently available in JSONP format to bypass browser cross domain limitations. It supports parsing the data in server side javascript.
Wow, I was just thinking about building something like this, maybe a month or so ago...
...it would be beautiful if you could plug other things into the "provider" side of this, like RDBMS's, etc. and make the body of data available "queryable".
Curious... How are they planning on handling situations where user created feeds are breaking the TOS? What liability is there upon this company?
Below is an example (google map TOS). Not that one could input a google map feed, but it provides an example... Almost every major news media site has a similar TOS restriction.
2. Restrictions on Use. Unless you have received prior written authorization from Google (or, as applicable, from the provider of particular Content), you must not:
(a) access or use the Products or any Content through any technology or means other than those provided in the Products, or through other explicitly authorized means Google may designate...
This has me pondering whether there should there be some kind of robots.txt check involved. There are implications of arbitrarily requesting data on someone else's server, mainly not breaking EULAs and TOSs. Any thoughts on this?
Has TOS-like set of rules been ever proven valid / enforceable? It seems to me like they are just a wish list / guidelines for not registered users. It's a different thing when you actually make users accept those terms during registration of course - then both parties know the rules. But for anonymous visitor? The service provides the content without / before showing the rules. IMHO that means any "don't use robots" rules are not enforceable at all.
I don't mean to imply that robots.txt should not be respected though (it should if you're a whitehat) - just that it's legal to disregard it if you don't explicitly accept the terms.
I don't know if it's yet been tested in court, but the robots.txt conventions have over a decade of widespread custom behind them. It would be hard for anyone competent enough to do sophisticated crawling to plead ignorance of them. Custom counts in law:
Very interesting idea. A little scary that the implementation is writing web scrapers in Javascript (which are prone to breaking), but interesting none-the-less.
Web-scrapers in general are prone to breaking because the HTML design can change. A div could become a span class='blocky'. Now your scrape doesn't work anymore.
One way to handle such breakage would be to define an extraction template on the fly, based on some content that doesn't change on the page. If, for e.g., you know that the stock quote for Google has GOOG and some number somewhere inside the body, you could search for that pattern and save the XPath or something and reuse it on the other scraping.
I think any other solution (for example, auto reconfiguring smart scripts based on html changes), would quickly become so complex that it would become far more time consuming to maintain than simply going and tweaking the individual javascript when an html change actually does occur.
http://developer.yahoo.com/yql/