Sunday, January 23, 2011

Using Clojure to work with RSS feeds

As I mentioned in my last blog entry, I am trying to learn Clojure. I think it is an interesting challenge to internalize new thinking paradigm. So following entry/code is an attempt to learn the paradigms of the language. Drop a comment, if you have any suggestion.

I found that, while trying to learn new language, one of the challenge is to find interesting practice problem to solve. The practice problem which helps you understand the strength of the language. I think one such problem space Clojure's application would be interesting is parsing and data crunching.

Example Problem
So the problem I picked is a simple one. I heavily use google reader and for that matter most of my web reading happens as part subscribed RSS feeds. I find myself manually filtering the feed list to suite my interest at the time. This exercise creates parser for RSS feed. This parser creates Clojure records based data structure. Once you have all the feed represented in records structure you can write variety of functions to filter the feed or even apply intelligent web algorithms.

Organization of Code
I am looking for a good way to organize code in clojure. Just like Java packages, clojure has namespaces. But I have still not found a good way to organize clojure functions into files/namespaces. For now, I chose to create namespace per domain.
  • rss.data.clj - Define records data structure and parser
  • rss.feed.clj - Defines functions to get feeds
  • rss.stat.clj - Defines functions to filter feeds (and may be some statistical algorithms)
data.clj

(ns rss.data
(:require [clojure.xml :as xml])
(:import [java.io ByteArrayInputStream]))

(defrecord rss-channel [title description link lastbuilddate pubdate items])
(defrecord rss-item [title description link guid pubdate])

(defn create-rss-channel [map items]
(let [title (:title map)
pubDate (:pubDate map)
description (:description map)
link (:link map)
lastBuildDate (:lastBuildDate map)]
(rss-channel. title description link lastBuildDate pubDate items)))

(defn create-item [item-data]
(let [description (:description item-data)
pubDate (:pubDate item-data)
guid (:guid item-data)
link (:link item-data)
title (:title item-data)
]
(rss-item. title description link guid pubDate)))

(defn parse-item [item-data]
(create-item (zipmap (map :tag item-data) (map :content item-data))))

(defn parse-items [items-data]
(map parse-item items-data))

(defn parse-channel [channel-content]
(loop [x (:content channel-content)]
(create-rss-channel
(into {} (filter #(not (= :item (key %))) (zipmap (map :tag x) (map :content x))))
(parse-items (map :content (into [] (filter #(= :item (:tag %)) x)))))
))

(defn parse-rss [rss]
(let [rss-data (ByteArrayInputStream. (.getBytes rss "UTF-8"))]
(for [x (xml-seq (xml/parse rss-data))
:when (= :rss (:tag x))]
(map parse-channel (:content x)))))

feed.clj

(ns rss.feed
(:require [clj-http.client :as client]))

(defn get-feed [url]
(:body (client/get url)))

(defn get-feeds [urllist]
(pmap get-feed urllist))

stat.clj

(ns rss.stat)

(def keywords ["Clojure" "Groovy" "Python"])

(defn find-items [rss-item-list keyword]
(filter #(.contains (first (:title %)) keyword) (flatten rss-item-list)))

(defn find-items-keywords [rss-item-list keyword-list]
(flatten (map #(find-items rss-item-list %) keyword-list)))

Trying Delicious RSS
Let's use RSS feed published by delicious.com. All RSS feed can found here. Let's use the feed popular which got has tag "programming". (http://feeds.delicious.com/v2/rss/popular/programming)

user=> (rss.stat/find-items-keywords (map :items (flatten (rss.data/parse-rss (rss.feed/get-feed "http://feeds.delicious.com/v2/rss/popular/programming")))) ["Clojure" "Python" "Groovy"])
(#:rss.data.rss-item{:title ["Hidden features of Python - Stack Overflow"], :description nil, :link ["http://stackoverflow.com/questions/101268/hidden-features-of-python/102062"], :guid ["http://www.delicious.com/url/16bbee33bbf4f6a03448058c5fd9f461#thaiyoshi"], :pubdate ["Sun, 23 Jan 2011 16:53:58 +0000"]}
#:rss.data.rss-item{:title ["Writing Forwards Compatible Python Code | Armin Ronacher's Thoughts and Writings"], :description nil, :link ["http://lucumr.pocoo.org/2011/1/22/forwards-compatible-python/"], :guid ["http://www.delicious.com/url/c6fb2679bca6c193a41d92513f7edca1#papaeye"], :pubdate ["Sat, 22 Jan 2011 16:09:10 +0000"]})
This one only had few links. Let's try one which has lots of items. (http://feeds.delicious.com/v2/rss/tag/programming?count=100)

user=> (rss.stat/find-items-keywords (map :items (flatten (rss.data/parse-rss (rss.feed/get-feed "http://feeds.delicious.com/v2/rss/tag/programming?count=100")))) ["Clojure" "Python" "Groovy"])

(#:rss.data.rss-item{:title ["The Evolution of a Python Programmer : Aleks' Domain"], :description nil, :link ["http://metaleks.net/programming/the-evolution-of-a-python-programmer"], :guid ["http://www.delicious.com/url/15311aafa2207da0ffc770903091ccea#tsung"], :pubdate ["Mon, 24 Jan 2011 02:44:19 +0000"]}
#:rss.data.rss-item{:title ["Hidden features of Python - Stack Overflow"], :description nil, :link ["http://stackoverflow.com/questions/101268/hidden-features-of-python/102062"], :guid ["http://www.delicious.com/url/16bbee33bbf4f6a03448058c5fd9f461#adharmad"], :pubdate ["Mon, 24 Jan 2011 02:43:07 +0000"]}
#:rss.data.rss-item{:title ["PySCeS: the Python Simulator for Cellular Systems"], :description nil, :link ["http://pysces.sourceforge.net/index.html"], :guid ["http://www.delicious.com/url/40e012b894d62cd8cb9ba6b95855b868#tiguco"], :pubdate ["Mon, 24 Jan 2011 02:39:24 +0000"]}
#:rss.data.rss-item{:title ["Hidden features of Python - Stack Overflow"], :description nil, :link ["http://stackoverflow.com/questions/101268/hidden-features-of-python/102062"], :guid ["http://www.delicious.com/url/16bbee33bbf4f6a03448058c5fd9f461#lonstile"], :pubdate ["Mon, 24 Jan 2011 02:24:25 +0000"]}
Source Code available at GitHub (https://github.com/kartikshah/rss-feeder)
Blogged with the Flock Browser

5 comments:

Anonymous said...

nice.!

Unknown said...

Hi Kartik, cool! You may consider leveraging Clojure's map destructuring capabilities, which would reduce some duplication in your let bindings. For instance, the function:

(defn create-rss-channel [map items]
(let [title (:title map)
pubDate (:pubDate map)
description (:description map)
link (:link map)
lastBuildDate (:lastBuildDate map)]
(rss-channel. title description link lastBuildDate pubDate items)))

could become:

(defn create-rss-channel [map items]
(let [{:keys [title pubDate description link lastBuildDate} map]
(rss-channel. title description link lastBuildDate pubDate items)))

Kartik Shah said...

Thanks, for the suggestion.

Updated @ GitHub https://github.com/kartikshah/rss-feeder

Anonymous said...

I am trying to learn clojure..came across your rss feed reader. i want to know what each function does w.r.t to xml rss output

Anonymous said...

Hi..I'm beginning with clojure and I've a couple doubts

1) In :
(filter #(.contains (first (:title %)) keyword) (flatten rss-item-list)))

why use "first"

2) using find-item-keywords I get a lazy seq instead hash..How can I get the title and only show the title?...thanks