Two Guys Arguing

CSS Selectors, Java Interop and Scraping

Posted in clojure by youngnh on 11.03.10

Building a DOM

Parsing HTML can be tricky, most of my naive attempts to parse real-world pages produced a lot of stack traces. The Validator.nu HTML parser has so far cleared those low hurdles. It’s implemented in Java and it has a maven artifact, which makes it easy to include in a leiningen project, so it’s my current weapon of choice.

:dependencies [[org.clojure/clojure "1.2.0"]
           [org.clojure/clojure-contrib "1.2.0"]
       [nu.validator.htmlparser/htmlparser "1.2.1"]]

It’s easy to get a DOM from a webpage using Validator.nu (api docs here), feed HtmlDocumentBuilder an InputSource which you feed a java.io.Reader, which is easily created via the reader fn from clojure.java.io:

(defn build-document [file-name]
  (.parse (HtmlDocumentBuilder.) (InputSource. (reader file-name))))

Converting the DOM to a seq

Clojure comes with a few very nice tree walking facilities. We can’t use them until we convert a dom with nodes of type, well, Node and branches of NodeList into seqs that Clojure is more adept at manipulating.

NodeList has two methods on it, getLength() and item(int index). One approach is to close over an index binding and recursively create the seq:

(defn nodelist-seq [node-list]
  (letfn [(internal [i]
            (lazy-seq
          (when (< i (.getLength node-list))
            (cons (.item node-list i) (internal (inc i))))))]
  (internal 0)))

Another is to keep the current index in an atom, and implement Iterator with it, which Clojure can make into a seq for you:

(defn nodelist-seq [node-list]
  (iterator-seq
   (let [i (atom 0)]
     (reify Iterator
       (hasNext [_]
         (< @i (.getLength node-list)))
       (next [_]
         (try
       (.item node-list @i)]
           (finally
        (swap! i inc)))))))

Where I’m using try/finally as a replacement for Common Lisp’s prog1.

With that in place, it’s not hard to turn a DOM into a nested seq, which either zippers found in clojure.zip or Stuart Sierra’s clojure.walk should be able to navigate for you quite adeptly.

Selectors

I’d like to be able to select a node by:

  • id: #statTable1
  • tag name: table
  • class attribute: .class

And I’d like selectors to work from any node I give it. This way I can write a selector that will work at multiple places in a tree, making them more reusable. Being able to turn a DOM into a seq suggests that filtering it on a predicate would be a quick way to write the above selectors, here are supporting functions to inspect the nodes themselves:

(defn element-tagname [elt]
  (when (= Node/ELEMENT_NODE (.getNodeType elt))
    (.getNodeName elt)))

(defn get-attribute [elt attr]
  (.?. elt getAttributes (getNamedItem attr) getValue))

(defn hasclass? [elt class]
  (when-let [class-attr (get-attribute elt "class")]
    (some #(= class %) (split class-attr #" "))))

The .?. method in get-attribute is remarkably useful. It’s analogous to the .. operator in clojure.core for chaining method invocations on objects. As not all Node objects have attributes on them, and not all attributes have the one we’re looking for, in both cases, a null value is returned by the method invoked. Trying to invoke any other method returns an NPE. .?. does the grunt-work of handling that and short-circuiting to return nil.

The Document object has two methods on it that are just too good to pass up, though. getElementById and getElementsByTagName might give better performance than scanning the entire tree, so if we’re selecting from the root, then we’d like to use them. Multimethods solve our dilemma nicely.

(defn doc-or-node [node & _]
  (if (instance? Document node)
    Document
    Node)))

(defmulti id-sel doc-or-node)

(defmulti element-sel doc-or-node)

(defmethod id-sel Document [document id]
  (.getElementById document (.substring id 1)))

(defmethod id-sel Node [node id]
  (filter #(= (.substring id 1) (get-attribute % "id")) (dom-seq node)))

(defmethod element-sel Document [document elt-name]
  (.getElementsByTagname document elt-name))

(defmethod element-sel Node [node elt-name]
  (filter #(= elt-name (element-tagname %)) (dom-seq node)))

Uniformity

Finally, if each selector takes a single Node and returns a list of Nodes, then I’ll note that you can “chain” selectors together with mapcat.

(->> (element-sel document "body")
     (mapcat #(element-sel % "table"))
     (mapcat #(element-sel % "tr"))
     (mapcat #(class-sel % ".odd")))

With this property, we’d need to make sure that Document version of id-sel above wraps it’s single Node in a list. This sort of chaining ability, of taking a bunch of things, and applying them in sequence to get a single thing throws up the use reduce flags in my head. My first attempt nearly works out of the gate:

(defn $ [node & selectors]
  (reduce mapcat node selectors))

The problems with it being that mapcat takes it’s function argument first, while we’re passing our selector functions in second, and that mapcat takes a list, not a single item. Here’s how I fixed it:

(defn flip [f]
  (fn [& args]
    (apply f (reverse args))))

(defn $ [node & selectors]
  (reduce (flip mapcat) [node] selectors))

So now we have a new selector that composes the behavior of a bunch of selectors.

The ‘M’ Word

By now, you may have realized that this approach is the same as that suddenly ubiquitous and hip mathematical notion, the List monad. I won’t expound any further on the point, you’re either interested in monads or you’re not. I’m of the mind that they’re a remarkably useful construct, but a bit obtuse when approached from the narrow description of only their mathematical properties.

You can find a larger working example expanding upon all the code in this post on my github

Tagged with: ,

One Infinite Loop

Posted in common lisp by youngnh on 08.03.09

I wanted to write a small bot to scrape my netflix queue from their website, but making a simple request to their homepage redirected in a loop forever.

I’ve used Edi Weitz’s Drakma library for this sort of thing before and I really like it. Here was the code that I wrote:

(let ((cookies (make-instance 'cookie-jar)))
  (http-request "https://www.netflix.com/"
		     :method :get
		     :cookie-jar cookies))

Requesting http://www.netflix.com would result in being redirected to http://www.netflix.com/Default?tcw=1&cqs=, which would result in being redirected back to netflix.com and so on forever. Drakma actually handled the situation very gracefully. After 6 redirect attempts, it threw an exception saying it had exceeded its redirection limit. I just couldn’t figure out why.

I tried messing with my user agent. Writing scrapers has always made me a little paranoid that the developers on the other end know what I’m up to, are locked in a struggle-to-the-death to prevent my unauthorized uses and of course use the most sophisticated and obvious option available to them: my User-Agent header. Drakma has a remarkably convenient built-in facility for spoofing major browser versions. It’s literally a 20 character change to the function call. I tried that. No dice. Apparently the Netflix developers are as overworked as everyone else and bots just aren’t a big enough problem to justify snuffing them out by filtering User-Agent headers.

The redirection eventually terminated in Firefox. Otherwise I and millions of other users wouldn’t be able to see the page at all. Maybe Firefox has a built in limit to redirection, at which point it just says “fuck it, I’m just gonna go with this page”. But that doesn’t make any sense either, becuase 302’s don’t come with any HTML to display just in case the browser decides it’s tired of bouncing from page to page.

Firebug’s net panel showed that with cookies on, I didn’t redirect at all, netflix.com returned a 200 OK. With cookies off, I redirected once to /Default, then back to the homepage and that time, a 200 came back with gloriously renderable HTML.

I sent Drakma’s HTTP headers to *standard-output* with a handy little one-liner:

(setq *header-stream* *standard-output*)

Here’s what I saw:

GET / HTTP/1.1
Host: www.netflix.com
User-Agent: Drakma/1.0.0 (SBCL 1.0.30; Linux; 2.6.30-ARCH; http://weitz.de/drakma/)
Accept: */*
Connection: close

HTTP/1.1 302 Moved Temporarily
Date: Mon, 03 Aug 2009 12:47:31 GMT
Server: Apache-Coyote/1.1
P3P: CP="CAO DSP COR DEVa TAIa OUR BUS UNI STA"
Location: http://www.netflix.com/Default?tcw=1&amp;cqs=
Content-Length: 0
Set-Cookie: VisitorId=XXX; Domain=.netflix.com; Path=/
Set-Cookie: country=X; Domain=.netflix.com; Expires=Tue, 03-Aug-2010 12:47:32 GMT; Path=/
Set-Cookie: nflxsid=XXX; Domain=.netflix.com; Path=/
Set-Cookie: NetflixSession=XXX; Domain=.netflix.com; Path=/
Set-Cookie: NetflixCookies=XXX; Domain=.netflix.com; Expires=Wed, 02-Sep-2009 12:47:32 GMT; Path=/
Set-Cookie: asearch=XXX; Domain=.netflix.com; Expires=Tue, 03-Aug-2010 12:47:32 GMT; Path=/
Vary: Accept-Encoding
Cache-Control: private
Keep-Alive: timeout=15, max=93
Connection: Keep-Alive
Content-Type: text/plain
Set-Cookie: NSC_xxx.ofugmjy.dpn=XXX;path=/

GET /Default?tcw=1&amp;cqs= HTTP/1.1
Host: www.netflix.com
User-Agent: Drakma/1.0.0 (SBCL 1.0.30; Linux; 2.6.30-ARCH; http://weitz.de/drakma/)
Accept: */*
Cookie: NSC_xxx.ofugmjy.dpn=XXX; asearch=XXX; NetflixCookies=XXX; NetflixSession=XXX; nflxsid=XXX; country=XXX; VisitorId=XXX
Connection: close

HTTP/1.1 302 Moved Temporarily
Date: Mon, 03 Aug 2009 12:47:32 GMT
Server: Apache-Coyote/1.1
P3P: CP="CAO DSP COR DEVa TAIa OUR BUS UNI STA"
Location: http://www.netflix.com/
Content-Length: 0
Set-Cookie: lastHitTime=XXX; Domain=.netflix.com; Path=/
Set-Cookie: VisitorId=XXX; Domain=.netflix.com; Path=/
Set-Cookie: country=XXX; Domain=.netflix.com; Expires=Tue, 03-Aug-2010 12:47:32 GMT; Path=/
Set-Cookie: nflxsid=XXX; Domain=.netflix.com; Path=/
Set-Cookie: NetflixCookies=XXX; Domain=.netflix.com; Expires=Wed, 02-Sep-2009 12:47:32 GMT; Path=/
Vary: Accept-Encoding
Cache-Control: private
Keep-Alive: timeout=15, max=84
Connection: Keep-Alive
Content-Type: text/plain

GET / HTTP/1.1
Host: www.netflix.com
User-Agent: Drakma/1.0.0 (SBCL 1.0.30; Linux; 2.6.30-ARCH; http://weitz.de/drakma/)
Accept: */*
Connection: close

HTTP/1.1 302 Moved Temporarily
Date: Mon, 03 Aug 2009 12:47:32 GMT
Server: Apache-Coyote/1.1
P3P: CP="CAO DSP COR DEVa TAIa OUR BUS UNI STA"
Location: http://www.netflix.com/Default?tcw=1&amp;cqs=
Content-Length: 0
Set-Cookie: VisitorId=XXX; Domain=.netflix.com; Path=/
Set-Cookie: country=XXX; Domain=.netflix.com; Expires=Tue, 03-Aug-2010 12:47:32 GMT; Path=/
Set-Cookie: nflxsid=XXX; Domain=.netflix.com; Path=/
Set-Cookie: NetflixSession=XXX; Domain=.netflix.com; Path=/
Set-Cookie: NetflixCookies=XXX; Domain=.netflix.com; Expires=Wed, 02-Sep-2009 12:47:32 GMT; Path=/
Set-Cookie: asearch=XXX; Domain=.netflix.com; Expires=Tue, 03-Aug-2010 12:47:32 GMT; Path=/
Vary: Accept-Encoding
Cache-Control: private
Keep-Alive: timeout=15, max=69
Connection: Keep-Alive
Content-Type: text/plain
Set-Cookie: NSC_xxx.ofugmjy.dpn=XXX;path=/

GET /Default?tcw=1&amp;cqs= HTTP/1.1
Host: www.netflix.com
User-Agent: Drakma/1.0.0 (SBCL 1.0.30; Linux; 2.6.30-ARCH; http://weitz.de/drakma/)
Accept: */*
Cookie: lastHitTime=XXX; NSC_xxx.ofugmjy.dpn=XXX; asearch=XXX; NetflixCookies=XXX; NetflixSession=XXX; nflxsid=XXX; country=XXX; VisitorId=XXX
Connection: close

HTTP/1.1 302 Moved Temporarily
Date: Mon, 03 Aug 2009 12:47:32 GMT
Server: Apache-Coyote/1.1
P3P: CP="CAO DSP COR DEVa TAIa OUR BUS UNI STA"
Location: http://www.netflix.com/
Content-Length: 0
Set-Cookie: lastHitTime=XXX; Domain=.netflix.com; Path=/
Set-Cookie: VisitorId=XXX; Domain=.netflix.com; Path=/
Set-Cookie: country=XXX; Domain=.netflix.com; Expires=Tue, 03-Aug-2010 12:47:33 GMT; Path=/
Set-Cookie: nflxsid=XXX; Domain=.netflix.com; Path=/
Set-Cookie: NetflixCookies=XXX; Domain=.netflix.com; Expires=Wed, 02-Sep-2009 12:47:33 GMT; Path=/
Vary: Accept-Encoding
Cache-Control: private
Keep-Alive: timeout=15, max=92
Connection: Keep-Alive
Content-Type: text/plain

GET / HTTP/1.1
Host: www.netflix.com
User-Agent: Drakma/1.0.0 (SBCL 1.0.30; Linux; 2.6.30-ARCH; http://weitz.de/drakma/)
Accept: */*
Connection: close

HTTP/1.1 302 Moved Temporarily
Date: Mon, 03 Aug 2009 12:47:32 GMT
Server: Apache-Coyote/1.1
P3P: CP="CAO DSP COR DEVa TAIa OUR BUS UNI STA"
Location: http://www.netflix.com/Default?tcw=1&amp;cqs=
Content-Length: 0
Set-Cookie: VisitorId=XXX; Domain=.netflix.com; Path=/
Set-Cookie: country=XXX; Domain=.netflix.com; Expires=Tue, 03-Aug-2010 12:47:33 GMT; Path=/
Set-Cookie: nflxsid=XXX; Domain=.netflix.com; Path=/
Set-Cookie: NetflixSession=XXX; Domain=.netflix.com; Path=/
Set-Cookie: NetflixCookies=XXX; Domain=.netflix.com; Expires=Wed, 02-Sep-2009 12:47:33 GMT; Path=/
Set-Cookie: asearch=XXX; Domain=.netflix.com; Expires=Tue, 03-Aug-2010 12:47:33 GMT; Path=/
Vary: Accept-Encoding
Cache-Control: private
Keep-Alive: timeout=15, max=93
Connection: Keep-Alive
Content-Type: text/plain
Set-Cookie: NSC_xxx.ofugmjy.dpn=XXX;path=/

GET /Default?tcw=1&amp;cqs= HTTP/1.1
Host: www.netflix.com
User-Agent: Drakma/1.0.0 (SBCL 1.0.30; Linux; 2.6.30-ARCH; http://weitz.de/drakma/)
Accept: */*
Cookie: lastHitTime=XXX; NSC_xxx.ofugmjy.dpn=XXX; asearch=XXX; NetflixCookies=XXX; NetflixSession=XXX; nflxsid=XXX; country=XXX; VisitorId=XXX
Connection: close

HTTP/1.1 302 Moved Temporarily
Date: Mon, 03 Aug 2009 12:47:32 GMT
Server: Apache-Coyote/1.1
P3P: CP="CAO DSP COR DEVa TAIa OUR BUS UNI STA"
Location: http://www.netflix.com/
Content-Length: 0
Set-Cookie: lastHitTime=XXX; Domain=.netflix.com; Path=/
Set-Cookie: VisitorId=XXX; Domain=.netflix.com; Path=/
Set-Cookie: country=XXX; Domain=.netflix.com; Expires=Tue, 03-Aug-2010 12:47:33 GMT; Path=/
Set-Cookie: nflxsid=XXX; Domain=.netflix.com; Path=/
Set-Cookie: NetflixCookies=XXX; Domain=.netflix.com; Expires=Wed, 02-Sep-2009 12:47:33 GMT; Path=/
Vary: Accept-Encoding
Cache-Control: private
Keep-Alive: timeout=15, max=15
Connection: Keep-Alive
Content-Type: text/plain

Huh, the issue was in Drakma. I would request their homepage with no cookies, and Netflix would respond with a redirect to a different URL and a bunch of Set-Cookie headers. Drakma would follow, dutifully passing a Cookie header, and be redirected back to the original URL, also with Set-Cookie headers. Drakma would again follow, but this time not include cookies in the request.

Netflix is doing the redirects — I think — to ‘prime the pump’. If you don’t send cookies to the second URL, you get redirected to a you need to turn cookies on page. If you do, Netflix figures you’ll end up back at their homepage with a bunch of initialized cookies. It’s weird behavior on the part of Drakma that’s breaking the system.

I wonder who to talk to about this? Edi Weitz? #lisp?

Tagged with: , , , , ,
Follow

Get every new post delivered to your Inbox.