Ever wanted arc90′s Readability as an API?

martinkallstrom · on Nov 15, 2010

Readability has been ported to Python and Php: http://blog.arc90.com/2009/06/20/readability-now-available-i... There is also a C# port here: http://code.google.com/p/nreadability/

Swizec · on Nov 15, 2010

There is also a ruby port here: https://github.com/iterationlabs/ruby-readability

The beauty of our approach is that we didn't have to port it and we can adapt much quicker when new versions come out.

tectonic · on Nov 16, 2010

Ruby-readability is one of my projects. Let me know if you have any questions.

lamnk · on Nov 15, 2010

Great, i searched for a readability's ruby port on github but the closest i can find is pismo: https://github.com/peterc/pismo

petercooper · on Nov 16, 2010

Pismo used to use ruby-readability (linked above) but I ended up writing my own system. It works similarly to Readability but is better on certain types of poorly formatted content (but worse on others, so YMMV). Pismo is more a general purpose content extraction library than Readability and better suited for machine processing and summarization (which is what I use it for).

Pismo also comes with a command line client built-in, so you can do stuff like this:

  $ pismo http://preona.net/2010/11/ever-wanted-arc90s-readability-as-an-api/ title sentences

  :title: "Ever wanted arc90's Readability as an API?"
  :sentences: Over at Preona we have been wanting something just like that for a while now. So we built it! Some time ago, while developing LazyReadr, we were faced with the fact that RSS feeds simply aren't all that lovely anymore.

Note: "sentences" picks the first few sentences by default, but this is ideal for a summary by an automated system or for a news page :-)

Swizec · on Nov 16, 2010

You should check out topicmarks. It does summaries in a very smart way from what I've seen. We'll likely start using it to do automated summaries for RSS content in our iPad app.

petercooper · on Nov 16, 2010

Sadly, though, it "takes minutes" (and they even seem to make a big point of that..) It might be useful for slightly better summaries though I've had great luck with going with the first paragraph of an article so far (or certain other metadata if it scores better, like <meta> description).

kwellman · on Nov 16, 2010

I recently converted the Python port to use lxml instead of BeautifulSoup.

If you need speed (BeautifulSoup is slow), and not portability (lxml won't work on google app engine), check out http://blog.interstellr.com/readability-in-python-using-lxml

izendejas · on Nov 16, 2010

I ported it to Java a while back and was amazed at how a simple heuristic (looking for blocks of text with lots of commas) went a long way. Unfortunately, it has several drawbacks which have prevented my full use of it and came up with an algorithm which turned out to be similar to boilerpipe's (using decision trees) that worked better in some cases. I haven't gone back to this problem, but I have some ideas on ways that it can be improved.

I actually suggested that there be a <usercontent> tag to one of the people on the HTML5 committee because among other things comments sections throw off many of these web scraping algorithms. After 3 emails, he didn't buy my argument, so I gave up. So all web designers, please oh, pretty please, use the up and coming <article> tag correctly. Many data/text miners will thank you dearly!

scraplab · on Nov 15, 2010

I wrote something similar a while ago as a bit of an experiment - again, it uses Readability and jsdom.

It's slow because there's a lot of unoptimised code in jsdom which makes certain DOM manipulations pretty slow. It's early days for this stuff.

So in lieu of them open sourcing their code, which is almost certainly more robust, here it is:

https://github.com/tomtaylor/thelma

Swizec · on Nov 15, 2010

Having the benefit of hacking around jsdom for a while to fix certain bugs, I could tweak Readability to work around the slowness.

Also node-htmlparser now contains a patch of mine so it's a bit more robust.

I think you'll find your code might be a bit more robust than you left it, in part due to my wanting to make our API robust :)

waterside81 · on Nov 16, 2010

We, too, have ported Readability to a web based API, however ours is synchronous. It's lighting fast under the hood (thank you Cython) and you can hit it as much as you like, as quickly as you like. We currently process about 5 million calls to clean-html.json a month, which is odd, because we added it to our set of API calls on a whim. It's turned out to be our most popular API call, by a large margin.

We're also going to be issuing an update soon to include multi-page reading. So you can grab those long, paginated New Yorker articles all in one API call.

http://www.repustate.com

scraplab · on Nov 16, 2010

Thanks - that looks quite interesting. Do you offer anything that cleans up the HTML but preserves formatting and images in the article?

waterside81 · on Nov 21, 2010

Internally, yes. Give a day or two and I"ll add a flag to let you preserve some of the formatting.

stavros · on Nov 16, 2010

That's fantastic, any plans on releasing the source?

EDIT: Oh, wait, that's nothing like readability, is it? It just cleans HTML up into text...

waterside81 · on Nov 21, 2010

We do have the ability to retain some of the formatting (simple HTML tags) but haven't exposed that over the API yet. We'll update that this week, so come on by a bit later in the week.

Swizec · on Nov 15, 2010

UPDATE at midnight: the view-scraped-page part of the architecture was meant as a helper function. Being frontpaged on HackerNews obviously crashed it. The two examples have been moved to static files served by nginx. Sorry for any inconvenience.

If anyone didn't manage to view the two examples, please go check them again :)

--> http://plateboiler.lazyreadr.com/static/example1.html and http://plateboiler.lazyreadr.com/static/example2.html

yesimahuman · on Nov 15, 2010

Boilerpipe has been amazing for me as a Java library: http://code.google.com/p/boilerpipe/

nreece · on Nov 15, 2010

And there's ViewText as well - http://viewtext.org/help/api

ronnier · on Nov 16, 2010

Thanks, I created viewtext. It also populates RSS feeds to contain the full content and extracts text from PDFs.

pak · on Nov 16, 2010

You guys are awesome. I literally just embarked on building an RSS reader that fits my personal taste and was wondering to myself how to get page content filtered through Readability. I was envisioning weird hacks, but this is perfect.

Swizec · on Nov 16, 2010

I have two suggestions about making an RSS reader:

1. Help us instead :)

2. Use Superfeedr, it will make your life easier.

Also thanks for calling awesome, we're really not that awesome, just have too much time on our hands :P

antimatter15 · on Nov 16, 2010

I had a similar idea two months ago, but instead of being a web api (with all the scalability and privacy issues that ensue), I made it a browser extension. where people could add parameters to embedded iframes in order to display them nicely formatted.

https://chrome.google.com/extensions/detail/nahmdndkmncjhppb...

ludwigvan · on Nov 16, 2010

Does Safari really use Readability? Any sources on that?

Swizec · on Nov 16, 2010

http://www.downloadsquad.com/2010/06/08/think-safari-reader-...

First hit on Google for "Safari Readability"

mikeklaas · on Nov 16, 2010

Help > Acknowledgements