2010-02-06

Fetching a web page in Clojure

To automate part of my daily reporting routine I experimented with Clojure.
Here's a prototype script that fetches web page and extracts HTML's title of a web-page. Limitation of is stems from the fact that it parses HTML to DOM. So it should at least be well-formed (blogger.com's page is not).
Dependencies are Apache http client, Apache io commons and XML libraries from JDK.

The part I especially like is
(map #(.getNodeValue (.item nodes %)) (range (.getLength nodes)))

It extracts implicit collection of DOM nodes in form of two methods "item(index)" and "getLength" into a convenient list.

The script is below:

(ns getReport
 (:import
 (java.net URL)
 (java.io ByteArrayInputStream InputStream)
 (org.apache.http.client ResponseHandler HttpClient)
 (org.apache.http.client.methods HttpGet)
 (org.apache.http.impl.client BasicResponseHandler DefaultHttpClient)
 (java.util.regex Pattern Matcher)
 (javax.xml.parsers DocumentBuilderFactory)
 (org.w3c.dom Document)
 (org.apache.commons.io IOUtils)
 (javax.xml.xpath XPathFactory XPathConstants)))

(defn getResource [url]
 (let [client (DefaultHttpClient.)
   request (HttpGet. url)
   body (.execute client request (BasicResponseHandler.))]   
  (.. client getConnectionManager shutdown)
  body))

(defn makeDocFactory []
 (let [factory (DocumentBuilderFactory/newInstance)]
  (doto factory
   (.setValidating false)
   (.setExpandEntityReferences false)
   (.setXIncludeAware false)
   (.setSchema nil))
   (let [dBuilder (.newDocumentBuilder factory)]
   ; This entity resolver disables fetching DTD (w3c will be happy)
   (.setEntityResolver dBuilder
    (proxy [org.xml.sax.EntityResolver] []
     (resolveEntity [publicId systemId]          
      (new org.xml.sax.InputSource (new java.io.StringReader "")))))  
   dBuilder)))

(defn parseDom [html]
 (let [builder (makeDocFactory)]
  (.parse builder (ByteArrayInputStream. (.getBytes html)))))

(defn queryDoc [dom query]
 (let [factory (XPathFactory/newInstance)
   xpath (.newXPath factory)
   expr (.compile xpath query)
   nodes (.evaluate expr dom XPathConstants/NODESET)]
  (map #(.getNodeValue (.item nodes %)) (range (.getLength nodes)))))

(defn parseReport [htmlText]
 (queryDoc (parseDom htmlText) "/html/head/title/text()"))

(println (parseReport (getResource "http://twitter.com/")))

No comments:

Post a Comment

On security

My VPS recently got banned for spam which surprised me since none of my soft there sending email. So my first thoughts were that this is a...