Hacker News new | past | comments | ask | show | jobs | submit login
Ever wanted arc90′s Readability as an API? (preona.net)
91 points by Swizec on Nov 15, 2010 | hide | past | favorite | 26 comments



Readability has been ported to Python and Php: http://blog.arc90.com/2009/06/20/readability-now-available-i... There is also a C# port here: http://code.google.com/p/nreadability/


There is also a ruby port here: https://github.com/iterationlabs/ruby-readability

The beauty of our approach is that we didn't have to port it and we can adapt much quicker when new versions come out.


Ruby-readability is one of my projects. Let me know if you have any questions.


Great, i searched for a readability's ruby port on github but the closest i can find is pismo: https://github.com/peterc/pismo


Pismo used to use ruby-readability (linked above) but I ended up writing my own system. It works similarly to Readability but is better on certain types of poorly formatted content (but worse on others, so YMMV). Pismo is more a general purpose content extraction library than Readability and better suited for machine processing and summarization (which is what I use it for).

Pismo also comes with a command line client built-in, so you can do stuff like this:

  $ pismo http://preona.net/2010/11/ever-wanted-arc90s-readability-as-an-api/ title sentences

  :title: "Ever wanted arc90's Readability as an API?"
  :sentences: Over at Preona we have been wanting something just like that for a while now. So we built it! Some time ago, while developing LazyReadr, we were faced with the fact that RSS feeds simply aren't all that lovely anymore.
Note: "sentences" picks the first few sentences by default, but this is ideal for a summary by an automated system or for a news page :-)


You should check out topicmarks. It does summaries in a very smart way from what I've seen. We'll likely start using it to do automated summaries for RSS content in our iPad app.


Sadly, though, it "takes minutes" (and they even seem to make a big point of that..) It might be useful for slightly better summaries though I've had great luck with going with the first paragraph of an article so far (or certain other metadata if it scores better, like <meta> description).


I recently converted the Python port to use lxml instead of BeautifulSoup.

If you need speed (BeautifulSoup is slow), and not portability (lxml won't work on google app engine), check out http://blog.interstellr.com/readability-in-python-using-lxml


I ported it to Java a while back and was amazed at how a simple heuristic (looking for blocks of text with lots of commas) went a long way. Unfortunately, it has several drawbacks which have prevented my full use of it and came up with an algorithm which turned out to be similar to boilerpipe's (using decision trees) that worked better in some cases. I haven't gone back to this problem, but I have some ideas on ways that it can be improved.

I actually suggested that there be a <usercontent> tag to one of the people on the HTML5 committee because among other things comments sections throw off many of these web scraping algorithms. After 3 emails, he didn't buy my argument, so I gave up. So all web designers, please oh, pretty please, use the up and coming <article> tag correctly. Many data/text miners will thank you dearly!


I wrote something similar a while ago as a bit of an experiment - again, it uses Readability and jsdom.

It's slow because there's a lot of unoptimised code in jsdom which makes certain DOM manipulations pretty slow. It's early days for this stuff.

So in lieu of them open sourcing their code, which is almost certainly more robust, here it is:

https://github.com/tomtaylor/thelma


Having the benefit of hacking around jsdom for a while to fix certain bugs, I could tweak Readability to work around the slowness.

Also node-htmlparser now contains a patch of mine so it's a bit more robust.

I think you'll find your code might be a bit more robust than you left it, in part due to my wanting to make our API robust :)


We, too, have ported Readability to a web based API, however ours is synchronous. It's lighting fast under the hood (thank you Cython) and you can hit it as much as you like, as quickly as you like. We currently process about 5 million calls to clean-html.json a month, which is odd, because we added it to our set of API calls on a whim. It's turned out to be our most popular API call, by a large margin.

We're also going to be issuing an update soon to include multi-page reading. So you can grab those long, paginated New Yorker articles all in one API call.

http://www.repustate.com


Thanks - that looks quite interesting. Do you offer anything that cleans up the HTML but preserves formatting and images in the article?


Internally, yes. Give a day or two and I"ll add a flag to let you preserve some of the formatting.


That's fantastic, any plans on releasing the source?

EDIT: Oh, wait, that's nothing like readability, is it? It just cleans HTML up into text...


We do have the ability to retain some of the formatting (simple HTML tags) but haven't exposed that over the API yet. We'll update that this week, so come on by a bit later in the week.


UPDATE at midnight: the view-scraped-page part of the architecture was meant as a helper function. Being frontpaged on HackerNews obviously crashed it. The two examples have been moved to static files served by nginx. Sorry for any inconvenience.

If anyone didn't manage to view the two examples, please go check them again :)

--> http://plateboiler.lazyreadr.com/static/example1.html and http://plateboiler.lazyreadr.com/static/example2.html


Boilerpipe has been amazing for me as a Java library: http://code.google.com/p/boilerpipe/


And there's ViewText as well - http://viewtext.org/help/api


Thanks, I created viewtext. It also populates RSS feeds to contain the full content and extracts text from PDFs.


You guys are awesome. I literally just embarked on building an RSS reader that fits my personal taste and was wondering to myself how to get page content filtered through Readability. I was envisioning weird hacks, but this is perfect.


I have two suggestions about making an RSS reader:

1. Help us instead :)

2. Use Superfeedr, it will make your life easier.

Also thanks for calling awesome, we're really not that awesome, just have too much time on our hands :P


I had a similar idea two months ago, but instead of being a web api (with all the scalability and privacy issues that ensue), I made it a browser extension. where people could add parameters to embedded iframes in order to display them nicely formatted.

https://chrome.google.com/extensions/detail/nahmdndkmncjhppb...


Does Safari really use Readability? Any sources on that?


http://www.downloadsquad.com/2010/06/08/think-safari-reader-...

First hit on Google for "Safari Readability"


Help > Acknowledgements




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: