Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
JSonduit (jsonduit.com)
73 points by bdfh42 on June 14, 2010 | hide | past | favorite | 20 comments


Yahoo has a very similar service called Yahoo YQL. It can parse any web resource like HTML or XML and is conceniently available in JSONP format to bypass browser cross domain limitations. It supports parsing the data in server side javascript.

http://developer.yahoo.com/yql/


And if you want to use XLS or CSV in YQL, you can dot it like so: http://joubert.posterous.com/lift-legacy-excel-data-into-the...


Wow, I was just thinking about building something like this, maybe a month or so ago...

...it would be beautiful if you could plug other things into the "provider" side of this, like RDBMS's, etc. and make the body of data available "queryable".


Likewise - it's great how minds think alike and all that.


Curious... How are they planning on handling situations where user created feeds are breaking the TOS? What liability is there upon this company?

Below is an example (google map TOS). Not that one could input a google map feed, but it provides an example... Almost every major news media site has a similar TOS restriction.

2. Restrictions on Use. Unless you have received prior written authorization from Google (or, as applicable, from the provider of particular Content), you must not: (a) access or use the Products or any Content through any technology or means other than those provided in the Products, or through other explicitly authorized means Google may designate...

Not to rain, just asking.


This has me pondering whether there should there be some kind of robots.txt check involved. There are implications of arbitrarily requesting data on someone else's server, mainly not breaking EULAs and TOSs. Any thoughts on this?


Has TOS-like set of rules been ever proven valid / enforceable? It seems to me like they are just a wish list / guidelines for not registered users. It's a different thing when you actually make users accept those terms during registration of course - then both parties know the rules. But for anonymous visitor? The service provides the content without / before showing the rules. IMHO that means any "don't use robots" rules are not enforceable at all.

I don't mean to imply that robots.txt should not be respected though (it should if you're a whitehat) - just that it's legal to disregard it if you don't explicitly accept the terms.


I don't know if it's yet been tested in court, but the robots.txt conventions have over a decade of widespread custom behind them. It would be hard for anyone competent enough to do sophisticated crawling to plead ignorance of them. Custom counts in law:

http://en.wikipedia.org/wiki/Custom_%28law%29


Absolutely. If it's automated and can be used on many different pages (basically a bot script), it should respect robots.txt.


I like this. Nice and flexible. Reminds me of the project I did last year during the NYC Big Apps competition - http://elev.at


They better get rid of the javascript error affecting IE8 which cripples the home page. (That explains the BETA stamp.)


Very interesting idea. A little scary that the implementation is writing web scrapers in Javascript (which are prone to breaking), but interesting none-the-less.


Why are they prone to breaking?


Web-scrapers in general are prone to breaking because the HTML design can change. A div could become a span class='blocky'. Now your scrape doesn't work anymore.


One way to handle such breakage would be to define an extraction template on the fly, based on some content that doesn't change on the page. If, for e.g., you know that the stock quote for Google has GOOG and some number somewhere inside the body, you could search for that pattern and save the XPath or something and reuse it on the other scraping.

scRUBYt (http://scrubyt.org/) has a nice way of specifying such templates.


Also you get a poor user experience when the service fails because their scraper domain gets explicitly blocked.

Scraping in general is not system you can base any service anyone might ever want to "rely" on.


Becuase they'll break whenever the source tweaks their formatting in the slightest.


Yea see thats a pain but the concept is great. To turn everything into a json feed. There has to be a better way of doing though that js scrapers.


I think any other solution (for example, auto reconfiguring smart scripts based on html changes), would quickly become so complex that it would become far more time consuming to maintain than simply going and tweaking the individual javascript when an html change actually does occur.


That is an awesome favicon they have there. Also a really cool service :)




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: