Here's a prototype script that fetches web page and extracts HTML's title of a web-page. Limitation of is stems from the fact that it parses HTML to DOM. So it should at least be well-formed (blogger.com's page is not).
Dependencies are Apache http client, Apache io commons and XML libraries from JDK.
The part I especially like is
(map #(.getNodeValue (.item nodes %)) (range (.getLength nodes)))
It extracts implicit collection of DOM nodes in form of two methods "item(index)" and "getLength" into a convenient list.
The script is below:
(ns getReport
(:import
(java.net URL)
(java.io ByteArrayInputStream InputStream)
(org.apache.http.client ResponseHandler HttpClient)
(org.apache.http.client.methods HttpGet)
(org.apache.http.impl.client BasicResponseHandler DefaultHttpClient)
(java.util.regex Pattern Matcher)
(javax.xml.parsers DocumentBuilderFactory)
(org.w3c.dom Document)
(org.apache.commons.io IOUtils)
(javax.xml.xpath XPathFactory XPathConstants)))
(defn getResource [url]
(let [client (DefaultHttpClient.)
request (HttpGet. url)
body (.execute client request (BasicResponseHandler.))]
(.. client getConnectionManager shutdown)
body))
(defn makeDocFactory []
(let [factory (DocumentBuilderFactory/newInstance)]
(doto factory
(.setValidating false)
(.setExpandEntityReferences false)
(.setXIncludeAware false)
(.setSchema nil))
(let [dBuilder (.newDocumentBuilder factory)]
; This entity resolver disables fetching DTD (w3c will be happy)
(.setEntityResolver dBuilder
(proxy [org.xml.sax.EntityResolver] []
(resolveEntity [publicId systemId]
(new org.xml.sax.InputSource (new java.io.StringReader "")))))
dBuilder)))
(defn parseDom [html]
(let [builder (makeDocFactory)]
(.parse builder (ByteArrayInputStream. (.getBytes html)))))
(defn queryDoc [dom query]
(let [factory (XPathFactory/newInstance)
xpath (.newXPath factory)
expr (.compile xpath query)
nodes (.evaluate expr dom XPathConstants/NODESET)]
(map #(.getNodeValue (.item nodes %)) (range (.getLength nodes)))))
(defn parseReport [htmlText]
(queryDoc (parseDom htmlText) "/html/head/title/text()"))
(println (parseReport (getResource "http://twitter.com/")))
No comments:
Post a Comment