Two Guys Arguing

Competing Parsers

Posted in clojure, haskell by youngnh on 09.28.11

So, I’m writing a parser for the Turtle serialization format for RDF. In addition to being a format we use all the time at Revelytix, its a decently compact grammar, giving me a good chance to implement it using The Parsatron and suss out some of the library’s rough edges.

I hit my first rough edge with the longString production:

longString ::= """" lcharacter* """

but, having already implemented the lcharacter parser already, I didn’t see the subtleties in this production and plowed ahead with this straightforward definition:

(defparser long-string []
  (between (times 3 (char \")) (times 3 (char \"))
           (many (lcharacter))))

Which looks great and compiles without complaint, but when you feed it input, it immediately complains:

> (run (long-string) "\"\"\"roughshod\"\"\"")

Unexpected end of input at line: 1 column: 16
[Thrown class java.lang.RuntimeException]

The message here could be better, and I’ll work on that. I would want it to say Unexpected end of input, expected '"""', because what happened was the (many (lcharacter)) parser consumed too much.

Turns out, lcharacter is defined in the grammar to include double quotes, so (many (lcharacter)) ate as many as it could until it literally ran out of input.

A good regex can handle this:

> (re-find #"\"\"\".*\"\"\"" "\"\"\"roughshod\"\"\"")
"\"\"\"roughshod\"\"\""

So we should be able to as well. To keep track of whether or not we’ve consumed a hat trick of quotes, my first attempt looked something like this:

(defparser long-string []
  (letfn [(middle-part [s]
            (let->> [c (lcharacter)]
              (if (= c \")
                (two-left s)
                (middle-part (concat s [c])))))
          (two-left [s]
            (let->> [c (lcharacter)]
              (if (= c \")
                (one-left s)
                (middle-part (concat s [\" c])))))
          (one-left [s]
            (let->> [c (lcharacter)]
              (if (= c \")
                (always s)
                (middle-part (concat s [\" \" c])))))]
    (>> (times 3 (char \"))
        (middle-part []))))

Which uses 3 local, mutually-recursive functions to “count” each consecutive double quote. And they all look a lot alike. I refactored to this:

(defparser long-string []
  (letfn [(middle-part [s n]
            (let->> [c (lcharacter)]
              (if (= c \")
                (case n
                      0 (middle-part s 1)
                      1 (middle-part s 2)
                      2 (always s))
                (middle-part (concat s (repeat n \") [c]) 0))))]
    (>> (times 3 (char \"))
        (middle-part [] 0))))

The above works well, but the problem arises in the first place because lcharacter and """ share the same single-character lookahead. By examining only the next character in the input, we can’t
tell if it should belong to lcharacter or """. This suggests that we can lookahead 3 characters at a time and if we receive """, then we can interpret that not as 3 lcharacters, but as a terminating triple double quote.

(defparser long-string []
  (between (times 3 (char \")) (times 3 (char \"))
           (many
            (let->> [cs (lookahead (times 3 (lcharacter)))]
              (if (= cs [\" \" \"])
                (never)
                (lcharacter))))))

I’m not sure quite which way to go, nor can I immediately see a way to make a higher-level lookahead parser that ensures that 2 parsers don’t stomp all over each other, though that would be quite ideal. If you can, chime in below in the comments.

If you’d like to follow the development of The Parsatron, it’s on github

Tagged with:

guarding an expression

Posted in clojure by youngnh on 02.01.11

Oftentimes, I want to make some computation, apply the result to a predicate, and if it passes, return that result. If the predicate does not succeed, I usually want to return some other value. I’ve run into this situation before, with no satisfying resolution. My code usually ends up looking like so:

(let [value (some-computation x y z)]
  (if (check? value)
    value
    some-other-default-value))

Which is overly verbose, even for Clojure, if you ask me. There’s a let and value appears 3 times for a fairly straightforward idiom. I think what I’d like to write instead is:

(guard check?
  (some-computation x y z)
  some-other-default-value)

The following macro does the trick of expanding to the verbose form I’ve been writing:

(defmacro guard [pred then else]
  `(let [x# ~then]
     (if (~pred x#)
       x#
       ~else)))

Now, the awkward thing about the above is that unlike an if-statement, the order in which you read things is not the order in which they get executed in. Reading it, the execution happens on line 2 first, line 1 second, and then possibly on line 3. My question to the blogosphere is, does Clojure already have something like this lurking in a lib somewhere? Or is there a blindingly obvious solution to this that I’m overlooking?

One fn to bind them

Posted in clojure, haskell by youngnh on 11.21.10

I had a chance to work on my parsec port a little this weekend. Say hello to one of the most important and ubiquitous parsers in the parsec arsenal, parser-bind.

The idea behind parser-bind is that it should squish two parsers together. It represents parsing one thing after another. The only other parser we’ve built that squishes two parsers together is parser-plus, which operates more like “or” in that if the first one fails, it tries the second. This parser will quit immediately whenever either fails. If this parser succeeds, it’s because each matched successive input.

(defn parser-bind [m n]
  (fn [state cok cerr eok eerr]
    (letfn [(mcok [item state]
              (n state cok cerr cok cerr)))
            (meok [item state]
              (n state cok cerr eok eerr)))]
      (m state mcok cerr meok eerr))))

If the first parser, m, consumes ok, but the second one, n, does not consume, our combined parser will still call the cok continuation. Conversely, if the first one is empty and ok, but the second one consumes, we will also escape via the cok continuation. parser-bind does not override any of the error handling continuations because if something goes wrong, we use them to exit immediately.

The useful part of parser-bind isn’t in the the above implementation. It isn’t how parsec implements the idea. Parsec’s implementation does take the first parser, m, but for it’s second argument, it takes a function that, when executed, returns the second parser.

This is a neat idea because the unlike a parser that has to be fully specified at write-time, a function can bind intermediate, runtime results. Those intermediate results, once bound and named can be used to create further parsers. It allows us to write let-like forms:

(p-let [c (one-of "abc")]
  (char c))

Where each binding form in the parser let has to be a destructuring form and parser pair. The above is a parser that parses a character, and then looks for a duplicate of what it just parsed, similar to capture groups in regular expressions. p-let uses parser-bind under the covers:

(defmacro p-let [[& bindings] & body]
  (let [[bind-form p] (take 2 bindings)]
    (if (= 2 (count bindings))
      `(parser-bind ~p (fn [~bind-form] ~@body))
      `(parser-bind ~p (fn [~bind-form] (p-let ~(drop 2 bindings) ~@body))))))

Given only a single binding pair, we make the parser in it the first argument to parser-bind, and wrap a function with it’s destructing form as args, returning the body. In longer binding forms, we produce a recursive structure that macroexpand will continue to expand one binding form at a time.

Tagged with:

symbol-macrolet and inside-out test fixtures

Posted in clojure, common lisp by youngnh on 11.15.10

Rolling right along with more ways to write fixtures, let’s say we start with a test that looks like this:

(deftest test-convoluted
  (with-connection hdb
    (try
      (do-commands "create table person (name varchar(255))")
      (insert-values :person [:name] ["bill"] ["joey"])
      (with-query-results results ["select name from person"]
    (is (= "bill" (:name (first results))))
    (is (= "joey" (:name (second results)))))
      (finally
       (do-commands "drop schema public cascade")))))

It does it’s setup and teardown inside of the deftest itself. The actual “testing” parts are the two (is) forms deeply nested inside. I’ve already written a post on how the clojure.test lib that comes with Clojure addresses this kind of complexity by providing fixtures. However, they’re a bit clumsy and aren’t particularly fine-grained. You can either run them once before (and after) all tests run, or once before and after each test runs.

The Common Lisp testing framework FiveAM has a different approach to defining and using fixtures. The FiveAM library defines two macros, def-fixture and with-fixture which define a fixture form and execute forms inside of a named fixture, respectively.

To define a fixture that takes care of the setup and teardown in the test above, we would write something like this:

(def-fixture person-schema []
  (with-connection hdb
    (try
      (do-commands "create table person (name varchar(255), age integer)")
      (insert-values :person [:name] ["bill"] ["joey"])
      (with-query-results results [query]
        (test-body))
      (finally
       (do-commands "drop schema public cascade")))))

In the above, test-body is a special captured variable that will be replaced with whatever forms you later specify. It’s where the meat of your test will go. You specify what to run there as the body of the with-fixture macro, thusly:

(deftest test-ideal
  (with-fixture person-schema []
    (is (= "bill" (:name (first results))))
    (is (= "joey" (:name (second results)))))))

The with-fixture form names the fixture we want to use, and takes care of expanding it in such a way that the variables are still checked by the compiler. If our fixture didn’t declare results in the scope that our assertions were expanded to, the compiler would complain just as if we had written the whole thing out by hand.

We can make the fixture in our example even more flexible. def-fixture can declare arguments and with-fixture can provide them. Altering our setup slightly to add a second column, we can then pass in a specific query to be run per fixture:

(def-fixture person-schema [query]
  (with-connection hdb
    (try
      (do-commands "create table person (name varchar(255), age integer)")
      (insert-values :person [:name :age] ["bill" 25] ["joey" 35])
      (with-query-results results [query]
        (test-body))
      (finally
       (do-commands "drop schema public cascade")))))

And then we can get a lot more mileage out of our solitary fixture:

(deftest test-ideal
  (testing "name column"
   (with-fixture person-schema ["select name from person"]
     (is (= "bill" (:name (first results))))
     (is (= "joey" (:name (second results))))))

  (testing "age column"
    (with-fixture person-schema ["select age from person"]
      (is (= 25 (:age (first results))))
      (is (= 35 (:age (second results)))))))

This is a trick that defmacro can’t easily perform. If we wanted to define person-schema as a macro, in order to capture results, we’d have to write put ~'results somewhere in a backtick form. It’ll work, but for any significant number of capturing symbols, there’s tildes and ticks everywhere. For a feature that would ostensibly have users writing lots and lots of their own expansions, that’s a major drawback, in my opinion. Early versions of the newly re-written ClojureQL had users do this, I believe, with little snippets of macros sprinkled throughout their code. It turns out that we can have our cake and eat it too, and you’ve probably guessed how from the title of this post.

I copped the implementation straight from FiveAM. Common Lisp has an advantage here, as the language has built-in local macros (Clojure’s are global to a namespace) and symbol macros, which FiveAM uses to great effect. However, Konrad Hinsen (of clojure.contrib.monad fame) has implemented local and symbol macros in the clojure.contrib.macro-utils lib. I used that.

My Clojure implementation consists of the two macros and a global dynamic variable that holds an atom mapping fixture names to their arguments and bodies. The def-fixture macro takes care of assoc-ing them as they are defined. The with-fixture macro pulls that fixture definition and constructs an anonymous function from it, as well as making test-body a symbol macro that expands to the given body of with-fixture.

(def *fixtures* (atom {}))

(defmacro def-fixture [name args & body]
  `(swap! *fixtures* assoc '~name (cons '~args '~body)))

(defmacro with-fixture [name args & body]
  (let [[largs lbody] (get @*fixtures* name)]
    `(symbol-macrolet [~'test-body (fn [] ~@body)]
                  ((fn ~largs ~lbody)
                   ~@args))))

I could have used a local macro, by swapping the symbol-macrolet form for a macrolet form, but then I would have had to quote the body parameter passed to with-fixture. By using a symbol macro and asking that users treat it like a fn, I can avoid that. It’s a small thing and either way works. For 8 lines of code overall, these 2 macros add a lot of flexibility to how you can define test fixtures.

Tagged with: , ,

Fixture Macros

Posted in clojure by youngnh on 11.13.10

A few months ago, at Revelytix, we put together a pretty large test-harness at work for putting Semantic Triplestores through their paces. The effort was highly dependant on setup and teardown of the external environment.

For instance, in order to run a single test that involves making a http request to a triplestore-backed web application, I had to write code to start the virtual machine my web application and triplestore images were installed on, restore that vm to a known-state snapshot, wait for the guest operating system to come online, login to the web application and establish a session, and oh yeah, actually make the http request to see the results I was really interested in.

Ultimately, we went with a design that passed a lot of maps around, but tonight I found myself reviewing some of the other possibilities we could have pursued. Clojure maintains great separation of concerns by allowing you to pass functions around. For example, here’s the function I originally wrote to setup and teardown a virtual machine, running some arbitrary function f in-between:

(defn vm-setup [vm snaphsot f]
  (restore-snapshot vm snapshot)
  (start-vm vm)
  (try
   (f)
   (catch Exception e
     (.printStackTrace e))
   (finally
    (save-vm-state vm))))

The function knows nothing about what f does and so f can be anything.

After the call to start-vm, VirtualBox launches immediately, but it often takes 10-15 seconds for the guest OS to start running and communicating. Executing f immediately after starting the vm will often fail if you don’t first wait for the guest OS to come online. I could write that sort of functionality into the vm-setup function, but that’s mixing concerns, and I could just as easily write another setup function that tries to ping the guest OS, executing it’s payload once the machine has started communicating:

(defn host-reachable? [host timeout f]
  (if (ping host timeout)
    (f)
    (throw (Exception. (str "Could not reach host: " host " within " timeout " ms")))))

This pattern goes on. Conceptually, I ended up with a single test being expressed like this:

(vm-setup "WebAppVM1" "CleanSnapshot"
  (host-reachable? "hostname" (* 30 1000)
    (webapp-login "hostname" "username" "password"
      (times 30
        (timethis
          (http-request "/some/webapp/path"))))))

Which is pretty readable and it’s structure matches it’s meaning. The test itself is the call to

(http-request "/some/webapp/path")

and it’s nested pretty deeply in an execution context.

If we needed to resuse all of those steps, we could make it into a composite setup function:

(defn start-vm-and-login [f]
  (vm-setup "WebAppVM1" "CleanSnapshot"
    (host-reachable? "hostname" (* 30 1000)
      (webapp-login "hostname" "username" "password"
        (times 30
          (timethis
            (f))))))

The code above, however, won’t run. The forms aren’t function objects, they’re function invocations that return values. We’d have to wrap each of them in (fn [] ) in order to lambda-ize them:

(vm-setup "WebAppVM1" "CleanSnapshot"
  (fn []
    (host-reachable? "hostname" (* 30 1000)
      (fn []
        (webapp-login "hostname" "username" "password"
          (fn []
            (times 30
              (fn []
                (timethis
                  (fn [] (http-request "/some/webapp/path")))))))))))

We can write our own macro to take the first version and expand into the second, and it helps if we notice how similar it is to the -> macro. In fact, we could write:

(->> (http-request "/some/webapp/path")
     (fn [])
     (timethis)
     (fn [])
     (times 30)
     (fn [])
     (webapp-login "hostname" "username" "password")
     (fn [])
     (host-reachable? "hostname" (* 30 1000))
     (fn [])
     (vm-setup "WebAppVM1" "CleanSnapshot"))

Which reads a little bit backwards, but provides a great target for our macro to generate:

(defmacro fixture [& forms]
  `(->> ~@(interpose '(fn []) (reverse forms))))

Allowing us to write:

(fixture (vm-setup "WebAppVM1" "CleanSnapshot")
         (host-reachable? "hostname" (* 30 1000))
     (webapp-login "hostname" "username" "password")
     (times 30)
     (timethis)
     (http-request "/some/webapp/path"))

Not bad

Tagged with: , , , ,

Porting Haskell’s Parsec

Posted in clojure, haskell by youngnh on 11.11.10

Parsec is a parser-combinator library. Parser combinators are built around the idea of making a bunch of very small and focused parsers and combining them using operators that one would more usually see in regular expressions. Ultimately leading to parsers that feel more like function calls in a program that a stiff declaration of a grammar. Parsec is king in Haskell-land. In Clojure, however, there are a number of libraries fully- and not-so-fully written that can be used to write parsing programs. fnparse, Amotoen, clarsec, parser, clj-peg are just a few, feel free to mention your favorites in the comments. I don’t mean to leave any out, but rather point to out that what I’m doing here is not new. I do hope it’s illuminating for some.

Parsec, as I see it, boils down to 2 ideas.
A parser either consumes input or doesn’t. Consumed or Empty.
A parser either succeeds in parsing or it fails. Ok or Err.

These outcomes can be combined into 4 continuation functions that are passed to every parser:

  • cok – Consumed & Ok
  • cerr – Consumed & Err
  • eok – Empty & Ok
  • eerr – Empty & Err

As for errors, Parsec defines two types of them. Those that we can say something about, and those that we can say nothing about. These are errors with messages and unknown errors, respectively. Of the errors that we can say something about, some are the result of not finding input that the parser was expecting, which lead to messages like “expected ‘a’ and found ‘b'”, and some are the result of not finding input where we expected to, which lead to messages like “unexpected end of input”.

Finally, Parsec keeps tabs on the thing it’s parsing, it maintains state. The state is made up of 2 elements, the input stream itself, of which a Clojure seq models nicely and the current source position, itself made up of the name of the input, and one’s current line and column location in it.

The Most Basic Parsers

The simplest parser is the one that no matter what, returns a constant value. This is called parserReturn in Haskell, but in Clojure, it’s more akin to the constantly function, so I’ve named it always, and here’s it’s simplified implementation:

(defn always [x]
  (fn [state cok cerr eok eerr]
    (eok x state)))

This implementation makes sense. No matter what, it returns a new parser. A parser is merely a fn that takes a state and 4 continuations. The always parser always calls the Empty & Ok continuation. Nothing was removed from the stream (hence the Empty part), and everything should continue on as normal (the Ok part).

Equally simple is the parser that always fails. This is called parserZero in Haskell, since it represents a “nothing” parser.

(defn parser-zero []
  (fn [state cok cerr eok eerr]
    (eerr (unknown-error state))))

(defn unknown-error [{:keys [pos] :as state}]
  (ParseError. pos []))

More Interesting Parsers

One of the more basic parsers in Parsec is tokenPrim, which processes a single element from the underlying stream. It unconses the first element from the head of the input, tests if it is supposed to be consumed and then updates the state’s current position in the input. To do this, it takes 3 functions.

nextpos calculates a new source position based on the item consumed and the old position.
test takes a single element from the underlying stream and returns whether or not to consume it
showToken is used to create readable error messages by returning a string representation of stream elements

(defn token-prim [show-f nextpos-f consume?]
  (fn [{:keys [input pos] :as state} cok cerr eok eerr]
    (if-let [s (seq input)]
      (let [item (first s)
            rest-of-input (next s)]
        (if (consume? item)
          (let [newpos (nextpos-f pos item rest-of-input)
                newstate (InputState. rest-of-input newpos)]
            (cok item newstate))
          (eerr (unexpect-error (show-f item) pos))))
      (eerr (unexpect-error "" pos)))))

There are three ways the above function continues. Two are through eerr, one when there is nothing left in the seq when we were expecting to parse something, and one when we did parse something, but our test told us not to consume it. In the second case we can produce a decently readable description of the item so that we can later present it to the user. Finally, if our test tells us to go ahead and consume the item, we call cok passing it the item and a newly calculated state with a new position and the input without our consumed item on the front.

There’s a lot of parsers we can implement on top of token-prim, however, it’s got no brain. You can only line up a number of token parsers one after another and let them tell you if the input matched in the order you thought it would. We can’t express the idea of “or” with it. For that, Parsec relies on the parserPlus parser. It’s called “plus” because it’s used to glue multiple parsers into a single one, analagous to how addition of numbers glues them all together into a new, single number (I never used to think about things like this. Haskell has made me re-understand everything I already knew).

The strategy for implementing parserPlus is that it will take 2 parsers and try the first one. If that succeeds, we’ll go with that. If it doesn’t, we try the second one, and if it succeeds we want our combined parser to be indistinguishable from that second parser. If neither work, then our parser didn’t work and we want to escape like any other parser would if it failed. Calling the first parser is easy. For the sake of staying close to the original Haskell, we’ll call this parser m. Parsers in Haskell and Clojure are simply functions, so in order to try it, we can invoke it and pass the current state and the 4 continuations it expects.

The continuations are our hook to intercept failures. We know that if m fails, it will call the fourth continuation we pass it. So to try the second parser, n second we’re going to wrap the eerr function (the 4th continuation) by trying that second parser before giving up and calling eerr. Here’s how it looks in Clojure:

(defn parser-plus [m n]
  (fn [state cok cerr eok eerr]
    (letfn [(meerr [err]
               (letfn [(neok [item state-prime]
                          (eok item state-prime))
                        (neerr [err-prime]
                          (eerr (merge-error err err-prime)))]
                 (n state cok cerr neok neerr)))]
      (m state cok cerr eok meerr))))

The loacally nested functions aren’t exactly readable at a glance, but combined with the knowledge of what’s happening it’s a really elegant way to express the idea. Also, as a small note, there aren’t great names for some of the nested function parameters. state-prime and err-prime? Well, that’s a holdover from Haskell to express that the thing is an altered version of the thing it came from. In mathematics, this is expressed as a tick, state' and err'. Those aren’t legal Clojure 1.2 identifiers, so I opted to be verbose. Starting with the Clojure 1.3 alphas available now, tick is a legal constituent character, which means you can use it anywhere in an identifier except as the first character.

The last parser I’d like to tackle in this blog post is manyAccum. This parser wraps behavior around an existing parser and so becomes a tangle of continuation functions just like parser-plus was, but unlike parser-plus, manyAccum only accepts one parser and attempts to apply it 0 or more times. This is the Parser equivalent of the Kleene operator.

Just like parser-plus, we’re going to invoke the parser manyAccum is given and create a new parser by manipulating the continuations we pass to it. Specifically, if the parser we’re given fails to consume any input (calls eerr), we’re going to hijack that and report that it was instead an eok with an empty list. If the parser succeeds in consuming input, we’re going to try to get it to do it again. And again. And again forever. Here’s what it looks like:

(defn many-accum [p]
  (fn [state cok cerr cok eerr]
    (letfn [(many-err [err] (throw (RuntimeException. "combinator '*' is applied to a parser that accepts an empty string")))
             (continue [coll item state-prime]
               (p state-prime (partial continue (cons item coll)) cerr many-err (fn [_] (cok (cons item coll) state-prime))))]
      (p state (partial continue (seq [])) cerr many-err (fn [_] (eok [] state))))))

We define many-err to immediately quit with an exception if the third continuation, eok, is called since that means that p accepts empty strings and would spin forever if we let it. The only other trick to many-accum is that we create continue to accumulate items by first calling it with an empty seq, (seq []) and then consing further consumed items onto the front. Haskell’s many-accum takes a cons-like operator in addition to p as a more flexible way of creating a list of elements.

A final Note

I intentionally stayed away from Monads in this post (which is no easy task when porting Haskell), averting my eyes from Konrad Hinsen’s clojure.contrib.monad and trying wherever possible to make Clojure functions feel less like Haskell functions obsessed with parentheses. Not because Monads are particularly special or complex, but rather just the opposite. Monads fall out of designs that favor composability and uniformity. The first parser of this post, always, is half of an implementation of Monad. parser-zero and parser-plus are 100% of an smaller class of monads called MonadPlus. Reading clj-http’s source, I felt like it was such clean and idiomatic Clojure, with fantastic composablity properties that made it easy to build on top of, but also like it would be very easy to expess in Haskell and not feel forced or awkward. So it’ll be interesting to finish this port and see if I can succeed in doing the same in the opposite direction.

Tagged with:

A Brief Note on How clj-http Works

Posted in clojure by youngnh on 11.07.10

It’s not Ring, but it’s close

Mark McGranaghan’s Ring library is probably better known than clj-http. It’s largely for serving and responding to HTTP requests. clj-http is the same idea on the opposite side of the protocol. It’s for making HTTP requests.

clj-http’s introduction page is at http://mmcgrana.github.com/2010/08/clj-http-clojure-http-client.html.

Poking at it, trying to make sense of the code, I couldn’t quite grasp what was going on.

Here’s the definition of clj-http.client/request, featured prominently on clj-http’s introduction page:

(def request
     (-> #'core/request
         wrap-redirects
         wrap-exceptions
         wrap-decompression
         wrap-input-coercion
         wrap-output-coercion
         wrap-query-params
         wrap-basic-auth
         wrap-accept
         wrap-accept-encoding
         wrap-content-type
         wrap-method
         wrap-url))

request is the building block of higher-level requests that most developers would worry themselves with. get and post and their ilk are implemented in terms of request.

The code flows both ways

As to what exactly request is doing, I think it’s safe to say that all of the things written above… well, happen; redirects are handled, the content is typed, input and output are compressed and/or coerced. But in what order? The -> operator that I maligned in a previous post turns a series of backwards-lexically function calls into a sequential listing, but in this particular case, that’s misleading as well, as even though wrap-url is called last, it’s effects on the computation actually occur first. Wha?

It clicked for me when I realized that clj-http is written in a Continuation Passing Style. The only kind of continuation that clj-http worries itself with is the sending off of a request. All of these wrap- methods refer to their continuation as client, and when they call it, they expect to get back a HTTP response.

Let’s take a look at wrap-url:

(defn wrap-url [client]
  (fn [req]
    (if-let [url (:url req)]
      (client (-> req (dissoc :url) (merge (parse-url url))))
      (client req))))

It takes a client, and a client takes a request. A request is a map of various values that affect what gets included in the HTTP request. wrap-url checks if it’s request has a convenience key on it, :url, and if so, it breaks the url up into a bunch more specific parts using parse-url and then merges them with the given request map. Now, the really cool and ingenious part of this is that wrap-url doesn’t actually do any of this when called, but instead returns a fn that will. That fn — you guessed it — is a “client”, which means that the result of wrap-url can then be passed to other request-altering fns as their continuation. All of the wrap- methods modify the client you give them to produce one with the underlying client behavior and whatever new behavior they see fit to add.

So, back to the question of when the methods in request‘s long arrow chain actually take effect: clj-http doesn’t change the semantics of the -> operator, so wrap-url does indeed get called last. It is passed the client created by wrap-method, who was passed the client created by […insert your intelligence from your own explorations here..] which is passed the clj-http.core/request fn. clj-http.core/request is the only fn in the whole lot that actually knows how to make an actual http request (and even then it has Apache’s HttpClient do most of the heavy lifting for it).

So the last function in the arrow chain is the first function to get a crack at modifying the request. Conversely, it’s the last function to get a crack at modifying the response as it must get the response from the client passed to it from wrap-method, which must get it’s response from […oh god, not this again…] which gets it from clj-http.core/request.

It is up to each wrap- method along the way to decide whether or not it’s actually concerned with the request or the response (or both). There are plenty of fns in the stack that demonstrate each choice. wrap-url is a good example of a fn that modifies the request on the way down, but returns the response untouched.

I hope this helps anybody who was thinking about writing stuff on top of clj-http, but couldn’t immediately figure out how to stop their code from automatically redirecting or how to gracefully check that they’ve logged into a website before making requests to pages behind a user account.

Tagged with: ,

Reloading Multimethods

Posted in clojure by youngnh on 11.05.10

Replacing code in a running Lisp has always been a bit of a black art to me. I got wedged in a weird situation with multimethods today:

I started clojure via slime and my first test and implementation loaded and ran without any problems:

(ns lambda-reductions)

(deftest test-substitute
  (testing "variable substitution"
    (is (= 'r (substitute 'x ['x 'r])))
    (is (= 'y (substitute 'y ['x 'r])))))

(defmulti substitute (constantly variable))

(defmethod substitute 'variable [form [from to]]
  (if (= form from) to form))

I wrote my code and my tests in the same namespace, as I was just going for speed here.

My second test worked fine too the only real code “replacement” being that I altered my dispatch function and added another method to the multimethod:

(testing "apply substitutions"
  (is (= '(foo r) (substitute '(foo x) ['x 'r]))))

(defmulti substitute (fn [form _]
                   (cond (list? form) 'apply)
             :otherwise 'variable))

(defmethod substitute 'apply [[f & args] replace]
  (cons (substitute f replace) (map #(substitute % replace) args)))

I hit C-c C-k (slime-compile-and-load-file), reran my tests from the repl and they seemed to be picked up.

Then I added number a third case, and for some reason I could not get Clojure to recognize that it had loaded it:

(testing "lambda substitutions"
  (is (= '(fn [x] 42) (substitute '(fn [x] 42) ['x 'r])))
  (is (= '(fn [x] x) (substitute '(fn [x] x) ['x 'r])))
  (is (= '(clojure.core/fn [y] r) (substitute '(fn [y] x) ['x 'r])))
  (is (= '(clojure.core/fn [y] (foo r)) (substitute '(fn [y] (foo x)) ['x 'r]))))

(defmulti substitute (fn [form _]
                   (cond (and (list? form) (= 'fn (first form))) 'lambda
             (list? form) 'apply
             :otherwise 'variable)))

(defmethod substitute 'lambda [[_ [arg] t :as form] [from to :as replace]]
  (if-not (= arg from)
    `(fn [~arg] ~(substitute t replace))
    form))

My tests failed. Which is a fair enough response when I’ve written my code wrong, but after tweaking things enough times, I was convinced I had the impl right. I C-c C-k‘d a couple more times, just to be sure and then tried a bunch of things to verify that Clojure had loaded the code I asked it to. By every indication, it had.
I used the methods function. (keys (methods substitute)) showed that my dispatch value had been registered.
I used the get-method function. ((get-method substitute 'lambda) '(foo [x] (foo x)) ['x 'r]) returned the expected result. Clojure knew all about my method and the function itself was right. For some reason the dispatcher wasn’t sending my stuff to the right place (a small lesson learned: I’m going to define my dispatch functions as top-level fns instead of anonymous fns in the defmulti form, to make them more testable).

I tried remove-method on each variation in turn, and then C-c C-k‘d again to try and re-add them. No joy.
I tried remove-all-methods, thinking perhaps I’d been too gentle before. Another C-c C-k and my tests still failed.

So I did what every programmer finally does when the magic in the machine stops responding to their incantations, I killed Clojure. M-x slime-quit-lisp and then restarted. A single C-c C-k and re-running my tests showed them all passing. A remove-ns might’ve saved me from having to quit my running lisp altogether, but I’m not sure. I’m still not sure what happened to cause the behavior in the first place.

If you know what I did to get on the bad side of multimethods, chime-in in the comments below.

CSS Selectors, Java Interop and Scraping

Posted in clojure by youngnh on 11.03.10

Building a DOM

Parsing HTML can be tricky, most of my naive attempts to parse real-world pages produced a lot of stack traces. The Validator.nu HTML parser has so far cleared those low hurdles. It’s implemented in Java and it has a maven artifact, which makes it easy to include in a leiningen project, so it’s my current weapon of choice.

:dependencies [[org.clojure/clojure "1.2.0"]
           [org.clojure/clojure-contrib "1.2.0"]
       [nu.validator.htmlparser/htmlparser "1.2.1"]]

It’s easy to get a DOM from a webpage using Validator.nu (api docs here), feed HtmlDocumentBuilder an InputSource which you feed a java.io.Reader, which is easily created via the reader fn from clojure.java.io:

(defn build-document [file-name]
  (.parse (HtmlDocumentBuilder.) (InputSource. (reader file-name))))

Converting the DOM to a seq

Clojure comes with a few very nice tree walking facilities. We can’t use them until we convert a dom with nodes of type, well, Node and branches of NodeList into seqs that Clojure is more adept at manipulating.

NodeList has two methods on it, getLength() and item(int index). One approach is to close over an index binding and recursively create the seq:

(defn nodelist-seq [node-list]
  (letfn [(internal [i]
            (lazy-seq
          (when (< i (.getLength node-list))
            (cons (.item node-list i) (internal (inc i))))))]
  (internal 0)))

Another is to keep the current index in an atom, and implement Iterator with it, which Clojure can make into a seq for you:

(defn nodelist-seq [node-list]
  (iterator-seq
   (let [i (atom 0)]
     (reify Iterator
       (hasNext [_]
         (< @i (.getLength node-list)))
       (next [_]
         (try
       (.item node-list @i)]
           (finally
        (swap! i inc)))))))

Where I’m using try/finally as a replacement for Common Lisp’s prog1.

With that in place, it’s not hard to turn a DOM into a nested seq, which either zippers found in clojure.zip or Stuart Sierra’s clojure.walk should be able to navigate for you quite adeptly.

Selectors

I’d like to be able to select a node by:

  • id: #statTable1
  • tag name: table
  • class attribute: .class

And I’d like selectors to work from any node I give it. This way I can write a selector that will work at multiple places in a tree, making them more reusable. Being able to turn a DOM into a seq suggests that filtering it on a predicate would be a quick way to write the above selectors, here are supporting functions to inspect the nodes themselves:

(defn element-tagname [elt]
  (when (= Node/ELEMENT_NODE (.getNodeType elt))
    (.getNodeName elt)))

(defn get-attribute [elt attr]
  (.?. elt getAttributes (getNamedItem attr) getValue))

(defn hasclass? [elt class]
  (when-let [class-attr (get-attribute elt "class")]
    (some #(= class %) (split class-attr #" "))))

The .?. method in get-attribute is remarkably useful. It’s analogous to the .. operator in clojure.core for chaining method invocations on objects. As not all Node objects have attributes on them, and not all attributes have the one we’re looking for, in both cases, a null value is returned by the method invoked. Trying to invoke any other method returns an NPE. .?. does the grunt-work of handling that and short-circuiting to return nil.

The Document object has two methods on it that are just too good to pass up, though. getElementById and getElementsByTagName might give better performance than scanning the entire tree, so if we’re selecting from the root, then we’d like to use them. Multimethods solve our dilemma nicely.

(defn doc-or-node [node & _]
  (if (instance? Document node)
    Document
    Node)))

(defmulti id-sel doc-or-node)

(defmulti element-sel doc-or-node)

(defmethod id-sel Document [document id]
  (.getElementById document (.substring id 1)))

(defmethod id-sel Node [node id]
  (filter #(= (.substring id 1) (get-attribute % "id")) (dom-seq node)))

(defmethod element-sel Document [document elt-name]
  (.getElementsByTagname document elt-name))

(defmethod element-sel Node [node elt-name]
  (filter #(= elt-name (element-tagname %)) (dom-seq node)))

Uniformity

Finally, if each selector takes a single Node and returns a list of Nodes, then I’ll note that you can “chain” selectors together with mapcat.

(->> (element-sel document "body")
     (mapcat #(element-sel % "table"))
     (mapcat #(element-sel % "tr"))
     (mapcat #(class-sel % ".odd")))

With this property, we’d need to make sure that Document version of id-sel above wraps it’s single Node in a list. This sort of chaining ability, of taking a bunch of things, and applying them in sequence to get a single thing throws up the use reduce flags in my head. My first attempt nearly works out of the gate:

(defn $ [node & selectors]
  (reduce mapcat node selectors))

The problems with it being that mapcat takes it’s function argument first, while we’re passing our selector functions in second, and that mapcat takes a list, not a single item. Here’s how I fixed it:

(defn flip [f]
  (fn [& args]
    (apply f (reverse args))))

(defn $ [node & selectors]
  (reduce (flip mapcat) [node] selectors))

So now we have a new selector that composes the behavior of a bunch of selectors.

The ‘M’ Word

By now, you may have realized that this approach is the same as that suddenly ubiquitous and hip mathematical notion, the List monad. I won’t expound any further on the point, you’re either interested in monads or you’re not. I’m of the mind that they’re a remarkably useful construct, but a bit obtuse when approached from the narrow description of only their mathematical properties.

You can find a larger working example expanding upon all the code in this post on my github

Tagged with: ,

Starting a Cluster on EC2 with Pallet

Posted in clojure by youngnh on 11.01.10

Pallet is a node provisioning, configuration and administration tool. It is designed to make small to midsize deployments simple.

Pallet is built on top of the java library jclouds. There is a Clojure API for jclouds, written by the same contributors that have put in so much work on Pallet, and it’s what Pallet itself uses, for the most part. jclouds seems to be more focused on provisioning and controlling compute instances, while Pallet is a configuration and administration tool that just happens to need that provisioning capability.

Getting started

The Pallet docs are pretty good, but they left me with some questions and produced some behavior that I had to investigate further to understand. In that way, Pallet is currently a bit of a narrow path on top of steep cliffs, wander off the trail too far and things get choppy. To get started with Pallet, I cloned Hugo Duncan’s repo at git://github.com/hugoduncan/pallet.git and started a swank server with lein swank. Easy as could be.

Node Types

Pallet has a notion of nodes, identically configured machines. A node is identified by a tag and specified up front how it’s supposed to be configured. A plain vanilla node with no configuration or other special customizations can be created via defnode:


(defnode vanilla {})

You specify a node’s configuration via phases. :bootstrap and :configure are the two major phases of a node. :bootstrap is run exactly once when a node is started for the first time, useful for setting up users, basic services and settings to grant access to the box. :configure is a bit more general and can be thought of as the steps needed to bring a node to a baseline configuration.

I wanted to start a 5 machine CouchDB cluster, so I defined a couchdb node type:


(defnode couchnode {}
  :bootstrap (phase (automated-admin-user))
  :configure (phase (java :openjdk)
                    (couchdb)))

Node names can’t have hypens in them, so scrunch everything together. The phase calls are a macro that run packaged configurations called crates. The above adds my public key (~/.ssh/id_rsa.pub) as authorized to login to a couchdb node, and then installs the Open JDK and CouchDB. Cool thing about this is that we don’t specify and it doesn’t matter what flavor of Linux is running on the node, these crates are agnostic.

Logging into the Box

jclouds, not Pallet as far as I can tell, when targeted at EC2 creates a new security group per tag (nodetype) and a new key pair per box. This is good security practice, but it wasn’t obvious to me how to get the actual key so that I could log into a node and pooke around.

Poking around on individual nodes is what Pallet is supposed to save you from, so much later in your comfortability with the tool I’m sure that this sort of thing doesn’t matter, but it’s an issue I encountered.

The automated-admin-user crate is incredibly useful in this regard. It authorizes a public key for login in addition to the one that jclouds creates. By default it authorizes ~/.ssh/id_rsa.pub, your public key, but you can supply your own along with a few more advanced options defined in the pallet.crate.ssh-key namespace.

Beefier Boxes

The couchdb node we defined above by default uses the smallest instance available. On EC2, this is the t1.micro instance. If we need more memory or want more cores, we need to specify that in our node definition.


(defnode couchnode {:min-ram (* 7 1024) :min-cores 2}
  :bootstrap (phase (automated-admin-user))
  :configure (phase (java :openjdk)
                    (couchdb)))

The Pallet docs don’t do a stellar job of outlining what these options are, but that may be because the options themselves are specified in the jclouds Clojure API source, in jclouds/compute/src/main/clojure/org/jclouds/compute.clj:


os-family 
location-id 
architecture 
image-id 
hardware-id
os-name-matches 
os-version-matches
os-description-matches
os-64-bit
image-version-matches 
image-name-matches
image-description-matches 
min-cores 
min-ram
run-script 
install-private-key 
authorize-public-key
inbound-ports 
smallest 
fastest 
biggest 
any 
destroy-on-error

On EC2, with only the RAM and Cores specified, the Amazon Machine Image (AMI) is left to the library to choose. I think jclouds (I think Pallet offloads this logic) chose poorly. The machine is Ubuntu 9.02, whose ssh daemon takes forever to start (consistently more than 10 minutes for me), and once it is started, jclouds itself seemed to be locked out, all of its operations failing with an “Auth failure” message. So as a final step, I found a fresh Ubuntu 10.04 image and specified that along with the other information:


(defnode couchnode {:image-id "us-east-1/ami-da0cf8b3" :min-ram (* 7 1024) :min-cores 2}
  :bootstrap (phase (automated-admin-user))
  :configure (phase (java :openjdk)
                    (couchdb)))

Converge and Lift

With the node type specified, it’s time to start up a few instances. This is the job of converge. First, we create a compute-service object and then we’re good to go:


(def ec2-service (compute-service "ec2" :identity ec2-access-id :credential ec2-secret-key)

(converge {couchnode 5} :compute ec2-service)

converge is kind of neat as it will stop or start new nodes as necessary to bring the total number specified per nodetype. It runs :bootstrap for newly started nodes and :configure for all nodes to ensure that when it completes that you have a homogenous cluster of machines at your disposal.

There is a similar operation, lift that applies phases to all of the nodes of a certain type. It looks very similar in form to converge, but it doesn’t guarantee to run :configure unless you explicitly specify that it should. This makes lift a good candidate for applying configurations to a set of nodes after they’ve already been started:


(lift couchnode
      :compute ec2-service
      :phase (phase (jetty)))

There’s a couple of namespace imports that you need to make to be able to run the above code as it exists above, you can get your hands on the whole thing here.