Tumblelog by Soup.io
Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

February 12 2011

Feedability: NodeJS Feed Proxy With Readability


This is something I’ve planned to do for a really long time, so I’m really happy that finally I was able to realize it.

There is a common habit of commercial major news sites, to publish only short article excerpts (if at all) in their Feeds. Although I can totally understand why they do this (to increase page impressions/ad revenue), it really distracts when you read with a feed reader and need to open a web browser everytime you want to read the complete article, this gets even worse in situations where you’ve no Internet connection available. Another possible scenario is when you’ve waited too long and the article is already been unpublicized, and yes the german public broadcasting stations are actually forced (by the German publisher lobby) to unpublish (“depublizieren“) their content after some days. These are (for me at least), totally valid points that justify the effort to do something against it.

It is of course not a new problem, as I said I’ve planned to do something for quite some time now. One of the first solutions I came up with was to use specified regular expressions to extract the content of articles and built a new feed (including the extracted full article text). I’m not the only one who thought of this solution, there are for an example some Snownews/Liferea/Newsbeuter filter scripts available that do exactly that. This works, but it would be better to have a more generic solution besides specifying and maintaining regular expressions (or XPath for that matter) for all the news sites and (“commercial”-)blogs I read.

The more powerful approach to extract the articles, would be to use content extraction or template detection algorithms. I’d written an article (in German) about that when I played with some of these algorithms a while back. But I couldn’t find a suitable implementation, that was developed and stable enough to do this and I wasn’t really crazy about writing one of my own either.

Then in 2009 comes arc90‘s Readability that implements a mature content extraction algorithm in Client JavaScript. It is not perfect but I guess it is by far the best open source solution available for it right now. One problem in particular that I’ve noticed are comment sections below articles, sometimes comments include more text than the actual article, this can confuse Readability to think that the comment is the actual main content. So although it works most of the time, you should expect problems like this. The first application that I’m aware of using Readability for feeds is the Apple feed reader “Reeder” that can fetch the full text for selected articles.

There are some approaches to port Readability to other languages, but I’ve never saw a complete reimplementation. A few days ago I stumbled upon a Readability NodeJS library written by Arrix Zhou that uses just a slightly modified version of the original. Since I’ve planned to learn NodeJS anyways (like most people I’ve only written client JavaScript before) I used the opportunity to write Feedability:

Feedability is written in JavaScript using Google’s V8 Engine and NodeJS (written&tested with v0.3.8), it requires the node-readability and node-expat library that can be installed using npm. Feedability implements a small HTTP Server, you sent the feed you want to read just as a query string, so for an example: http://127.0.0.1:1912/http://example.com/atom.xml The NodeJS server will download the feed and parse it for item links (article urls), it will also remove any existing content excerpts. It will then scrape all found articles (or use a cache file) and run readability over the received sites to extract the article (or use a cache file). The original feed will be extended with the full articles and send to the user (the feed reader software).

I’ve tested it with Atom, RSS1.0 and RSS2.0 Feeds, but there are some known bugs, for instance: The character encoding breaks sometimes. As I said this is my first NodeJS application, there are some parts that I’m particular unhappy with, for example the current feed parser/generator based on expat (lib/feed.js), maybe I’m going to rewrite that sometime.