JSonduit

zedwill · on June 14, 2010

Yahoo has a very similar service called Yahoo YQL. It can parse any web resource like HTML or XML and is conceniently available in JSONP format to bypass browser cross domain limitations. It supports parsing the data in server side javascript.

http://developer.yahoo.com/yql/

_zhqs · on June 14, 2010

And if you want to use XLS or CSV in YQL, you can dot it like so: http://joubert.posterous.com/lift-legacy-excel-data-into-the...

jasongullickson · on June 14, 2010

Wow, I was just thinking about building something like this, maybe a month or so ago...

...it would be beautiful if you could plug other things into the "provider" side of this, like RDBMS's, etc. and make the body of data available "queryable".

nopassrecover · on June 14, 2010

Likewise - it's great how minds think alike and all that.

whyme · on June 14, 2010

Curious... How are they planning on handling situations where user created feeds are breaking the TOS? What liability is there upon this company?

Below is an example (google map TOS). Not that one could input a google map feed, but it provides an example... Almost every major news media site has a similar TOS restriction.

2. Restrictions on Use. Unless you have received prior written authorization from Google (or, as applicable, from the provider of particular Content), you must not: (a) access or use the Products or any Content through any technology or means other than those provided in the Products, or through other explicitly authorized means Google may designate...

Not to rain, just asking.

jqueryin · on June 14, 2010

This has me pondering whether there should there be some kind of robots.txt check involved. There are implications of arbitrarily requesting data on someone else's server, mainly not breaking EULAs and TOSs. Any thoughts on this?

viraptor · on June 14, 2010

Has TOS-like set of rules been ever proven valid / enforceable? It seems to me like they are just a wish list / guidelines for not registered users. It's a different thing when you actually make users accept those terms during registration of course - then both parties know the rules. But for anonymous visitor? The service provides the content without / before showing the rules. IMHO that means any "don't use robots" rules are not enforceable at all.

I don't mean to imply that robots.txt should not be respected though (it should if you're a whitehat) - just that it's legal to disregard it if you don't explicitly accept the terms.

gojomo · on June 15, 2010

I don't know if it's yet been tested in court, but the robots.txt conventions have over a decade of widespread custom behind them. It would be hard for anyone competent enough to do sophisticated crawling to plead ignorance of them. Custom counts in law:

http://en.wikipedia.org/wiki/Custom_%28law%29

chime · on June 14, 2010

Absolutely. If it's automated and can be used on many different pages (basically a bot script), it should respect robots.txt.

_zhqs · on June 14, 2010

I like this. Nice and flexible. Reminds me of the project I did last year during the NYC Big Apps competition - http://elev.at

jackfoxy · on June 14, 2010

They better get rid of the javascript error affecting IE8 which cripples the home page. (That explains the BETA stamp.)

alttab · on June 14, 2010

Very interesting idea. A little scary that the implementation is writing web scrapers in Javascript (which are prone to breaking), but interesting none-the-less.

mhansen · on June 14, 2010

Why are they prone to breaking?

chime · on June 14, 2010

Web-scrapers in general are prone to breaking because the HTML design can change. A div could become a span class='blocky'. Now your scrape doesn't work anymore.

xtacy · on June 14, 2010

One way to handle such breakage would be to define an extraction template on the fly, based on some content that doesn't change on the page. If, for e.g., you know that the stock quote for Google has GOOG and some number somewhere inside the body, you could search for that pattern and save the XPath or something and reuse it on the other scraping.

scRUBYt (http://scrubyt.org/) has a nice way of specifying such templates.

weego · on June 14, 2010

Also you get a poor user experience when the service fails because their scraper domain gets explicitly blocked.

Scraping in general is not system you can base any service anyone might ever want to "rely" on.

d2viant · on June 14, 2010

Becuase they'll break whenever the source tweaks their formatting in the slightest.

chuhnk · on June 14, 2010

Yea see thats a pain but the concept is great. To turn everything into a json feed. There has to be a better way of doing though that js scrapers.

mistermann · on June 14, 2010

I think any other solution (for example, auto reconfiguring smart scripts based on html changes), would quickly become so complex that it would become far more time consuming to maintain than simply going and tweaking the individual javascript when an html change actually does occur.

jsm386 · on June 14, 2010

That is an awesome favicon they have there. Also a really cool service :)