Here's a prototype script that fetches web page and extracts HTML's title of a web-page. Limitation of is stems from the fact that it parses HTML to DOM. So it should at least be well-formed (blogger.com's page is not).
Dependencies are Apache http client, Apache io commons and XML libraries from JDK.
The part I especially like is
(map #(.getNodeValue (.item nodes %)) (range (.getLength nodes)))
It extracts implicit collection of DOM nodes in form of two methods "item(index)" and "getLength" into a convenient list.
The script is below:
(ns getReport (:import (java.net URL) (java.io ByteArrayInputStream InputStream) (org.apache.http.client ResponseHandler HttpClient) (org.apache.http.client.methods HttpGet) (org.apache.http.impl.client BasicResponseHandler DefaultHttpClient) (java.util.regex Pattern Matcher) (javax.xml.parsers DocumentBuilderFactory) (org.w3c.dom Document) (org.apache.commons.io IOUtils) (javax.xml.xpath XPathFactory XPathConstants))) (defn getResource [url] (let [client (DefaultHttpClient.) request (HttpGet. url) body (.execute client request (BasicResponseHandler.))] (.. client getConnectionManager shutdown) body)) (defn makeDocFactory [] (let [factory (DocumentBuilderFactory/newInstance)] (doto factory (.setValidating false) (.setExpandEntityReferences false) (.setXIncludeAware false) (.setSchema nil)) (let [dBuilder (.newDocumentBuilder factory)] ; This entity resolver disables fetching DTD (w3c will be happy) (.setEntityResolver dBuilder (proxy [org.xml.sax.EntityResolver] [] (resolveEntity [publicId systemId] (new org.xml.sax.InputSource (new java.io.StringReader ""))))) dBuilder))) (defn parseDom [html] (let [builder (makeDocFactory)] (.parse builder (ByteArrayInputStream. (.getBytes html))))) (defn queryDoc [dom query] (let [factory (XPathFactory/newInstance) xpath (.newXPath factory) expr (.compile xpath query) nodes (.evaluate expr dom XPathConstants/NODESET)] (map #(.getNodeValue (.item nodes %)) (range (.getLength nodes))))) (defn parseReport [htmlText] (queryDoc (parseDom htmlText) "/html/head/title/text()")) (println (parseReport (getResource "http://twitter.com/")))
No comments:
Post a Comment