Pismo used to use ruby-readability (linked above) but I ended up writing my own system. It works similarly to Readability but is better on certain types of poorly formatted content (but worse on others, so YMMV). Pismo is more a general purpose content extraction library than Readability and better suited for machine processing and summarization (which is what I use it for).
Pismo also comes with a command line client built-in, so you can do stuff like this:
$ pismo http://preona.net/2010/11/ever-wanted-arc90s-readability-as-an-api/ title sentences
:title: "Ever wanted arc90's Readability as an API?"
:sentences: Over at Preona we have been wanting something just like that for a while now. So we built it! Some time ago, while developing LazyReadr, we were faced with the fact that RSS feeds simply aren't all that lovely anymore.
Note: "sentences" picks the first few sentences by default, but this is ideal for a summary by an automated system or for a news page :-)
You should check out topicmarks. It does summaries in a very smart way from what I've seen. We'll likely start using it to do automated summaries for RSS content in our iPad app.
Sadly, though, it "takes minutes" (and they even seem to make a big point of that..) It might be useful for slightly better summaries though I've had great luck with going with the first paragraph of an article so far (or certain other metadata if it scores better, like <meta> description).
I ported it to Java a while back and was amazed at how a simple heuristic (looking for blocks of text with lots of commas) went a long way. Unfortunately, it has several drawbacks which have prevented my full use of it and came up with an algorithm which turned out to be similar to boilerpipe's (using decision trees) that worked better in some cases. I haven't gone back to this problem, but I have some ideas on ways that it can be improved.
I actually suggested that there be a <usercontent> tag to one of the people on the HTML5 committee because among other things comments sections throw off many of these web scraping algorithms. After 3 emails, he didn't buy my argument, so I gave up. So all web designers, please oh, pretty please, use the up and coming <article> tag correctly. Many data/text miners will thank you dearly!
We, too, have ported Readability to a web based API, however ours is synchronous. It's lighting fast under the hood (thank you Cython) and you can hit it as much as you like, as quickly as you like. We currently process about 5 million calls to clean-html.json a month, which is odd, because we added it to our set of API calls on a whim. It's turned out to be our most popular API call, by a large margin.
We're also going to be issuing an update soon to include multi-page reading. So you can grab those long, paginated New Yorker articles all in one API call.
We do have the ability to retain some of the formatting (simple HTML tags) but haven't exposed that over the API yet. We'll update that this week, so come on by a bit later in the week.
UPDATE at midnight: the view-scraped-page part of the architecture was meant as a helper function. Being frontpaged on HackerNews obviously crashed it. The two examples have been moved to static files served by nginx. Sorry for any inconvenience.
If anyone didn't manage to view the two examples, please go check them again :)
You guys are awesome. I literally just embarked on building an RSS reader that fits my personal taste and was wondering to myself how to get page content filtered through Readability. I was envisioning weird hacks, but this is perfect.
I had a similar idea two months ago, but instead of being a web api (with all the scalability and privacy issues that ensue), I made it a browser extension. where people could add parameters to embedded iframes in order to display them nicely formatted.