Two Guys Arguing

Porting Haskell’s Parsec

Posted in clojure, haskell by youngnh on 11.11.10

Parsec is a parser-combinator library. Parser combinators are built around the idea of making a bunch of very small and focused parsers and combining them using operators that one would more usually see in regular expressions. Ultimately leading to parsers that feel more like function calls in a program that a stiff declaration of a grammar. Parsec is king in Haskell-land. In Clojure, however, there are a number of libraries fully- and not-so-fully written that can be used to write parsing programs. fnparse, Amotoen, clarsec, parser, clj-peg are just a few, feel free to mention your favorites in the comments. I don’t mean to leave any out, but rather point to out that what I’m doing here is not new. I do hope it’s illuminating for some.

Parsec, as I see it, boils down to 2 ideas.
A parser either consumes input or doesn’t. Consumed or Empty.
A parser either succeeds in parsing or it fails. Ok or Err.

These outcomes can be combined into 4 continuation functions that are passed to every parser:

  • cok – Consumed & Ok
  • cerr – Consumed & Err
  • eok – Empty & Ok
  • eerr – Empty & Err

As for errors, Parsec defines two types of them. Those that we can say something about, and those that we can say nothing about. These are errors with messages and unknown errors, respectively. Of the errors that we can say something about, some are the result of not finding input that the parser was expecting, which lead to messages like “expected ‘a’ and found ‘b'”, and some are the result of not finding input where we expected to, which lead to messages like “unexpected end of input”.

Finally, Parsec keeps tabs on the thing it’s parsing, it maintains state. The state is made up of 2 elements, the input stream itself, of which a Clojure seq models nicely and the current source position, itself made up of the name of the input, and one’s current line and column location in it.

The Most Basic Parsers

The simplest parser is the one that no matter what, returns a constant value. This is called parserReturn in Haskell, but in Clojure, it’s more akin to the constantly function, so I’ve named it always, and here’s it’s simplified implementation:

(defn always [x]
  (fn [state cok cerr eok eerr]
    (eok x state)))

This implementation makes sense. No matter what, it returns a new parser. A parser is merely a fn that takes a state and 4 continuations. The always parser always calls the Empty & Ok continuation. Nothing was removed from the stream (hence the Empty part), and everything should continue on as normal (the Ok part).

Equally simple is the parser that always fails. This is called parserZero in Haskell, since it represents a “nothing” parser.

(defn parser-zero []
  (fn [state cok cerr eok eerr]
    (eerr (unknown-error state))))

(defn unknown-error [{:keys [pos] :as state}]
  (ParseError. pos []))

More Interesting Parsers

One of the more basic parsers in Parsec is tokenPrim, which processes a single element from the underlying stream. It unconses the first element from the head of the input, tests if it is supposed to be consumed and then updates the state’s current position in the input. To do this, it takes 3 functions.

nextpos calculates a new source position based on the item consumed and the old position.
test takes a single element from the underlying stream and returns whether or not to consume it
showToken is used to create readable error messages by returning a string representation of stream elements

(defn token-prim [show-f nextpos-f consume?]
  (fn [{:keys [input pos] :as state} cok cerr eok eerr]
    (if-let [s (seq input)]
      (let [item (first s)
            rest-of-input (next s)]
        (if (consume? item)
          (let [newpos (nextpos-f pos item rest-of-input)
                newstate (InputState. rest-of-input newpos)]
            (cok item newstate))
          (eerr (unexpect-error (show-f item) pos))))
      (eerr (unexpect-error "" pos)))))

There are three ways the above function continues. Two are through eerr, one when there is nothing left in the seq when we were expecting to parse something, and one when we did parse something, but our test told us not to consume it. In the second case we can produce a decently readable description of the item so that we can later present it to the user. Finally, if our test tells us to go ahead and consume the item, we call cok passing it the item and a newly calculated state with a new position and the input without our consumed item on the front.

There’s a lot of parsers we can implement on top of token-prim, however, it’s got no brain. You can only line up a number of token parsers one after another and let them tell you if the input matched in the order you thought it would. We can’t express the idea of “or” with it. For that, Parsec relies on the parserPlus parser. It’s called “plus” because it’s used to glue multiple parsers into a single one, analagous to how addition of numbers glues them all together into a new, single number (I never used to think about things like this. Haskell has made me re-understand everything I already knew).

The strategy for implementing parserPlus is that it will take 2 parsers and try the first one. If that succeeds, we’ll go with that. If it doesn’t, we try the second one, and if it succeeds we want our combined parser to be indistinguishable from that second parser. If neither work, then our parser didn’t work and we want to escape like any other parser would if it failed. Calling the first parser is easy. For the sake of staying close to the original Haskell, we’ll call this parser m. Parsers in Haskell and Clojure are simply functions, so in order to try it, we can invoke it and pass the current state and the 4 continuations it expects.

The continuations are our hook to intercept failures. We know that if m fails, it will call the fourth continuation we pass it. So to try the second parser, n second we’re going to wrap the eerr function (the 4th continuation) by trying that second parser before giving up and calling eerr. Here’s how it looks in Clojure:

(defn parser-plus [m n]
  (fn [state cok cerr eok eerr]
    (letfn [(meerr [err]
               (letfn [(neok [item state-prime]
                          (eok item state-prime))
                        (neerr [err-prime]
                          (eerr (merge-error err err-prime)))]
                 (n state cok cerr neok neerr)))]
      (m state cok cerr eok meerr))))

The loacally nested functions aren’t exactly readable at a glance, but combined with the knowledge of what’s happening it’s a really elegant way to express the idea. Also, as a small note, there aren’t great names for some of the nested function parameters. state-prime and err-prime? Well, that’s a holdover from Haskell to express that the thing is an altered version of the thing it came from. In mathematics, this is expressed as a tick, state' and err'. Those aren’t legal Clojure 1.2 identifiers, so I opted to be verbose. Starting with the Clojure 1.3 alphas available now, tick is a legal constituent character, which means you can use it anywhere in an identifier except as the first character.

The last parser I’d like to tackle in this blog post is manyAccum. This parser wraps behavior around an existing parser and so becomes a tangle of continuation functions just like parser-plus was, but unlike parser-plus, manyAccum only accepts one parser and attempts to apply it 0 or more times. This is the Parser equivalent of the Kleene operator.

Just like parser-plus, we’re going to invoke the parser manyAccum is given and create a new parser by manipulating the continuations we pass to it. Specifically, if the parser we’re given fails to consume any input (calls eerr), we’re going to hijack that and report that it was instead an eok with an empty list. If the parser succeeds in consuming input, we’re going to try to get it to do it again. And again. And again forever. Here’s what it looks like:

(defn many-accum [p]
  (fn [state cok cerr cok eerr]
    (letfn [(many-err [err] (throw (RuntimeException. "combinator '*' is applied to a parser that accepts an empty string")))
             (continue [coll item state-prime]
               (p state-prime (partial continue (cons item coll)) cerr many-err (fn [_] (cok (cons item coll) state-prime))))]
      (p state (partial continue (seq [])) cerr many-err (fn [_] (eok [] state))))))

We define many-err to immediately quit with an exception if the third continuation, eok, is called since that means that p accepts empty strings and would spin forever if we let it. The only other trick to many-accum is that we create continue to accumulate items by first calling it with an empty seq, (seq []) and then consing further consumed items onto the front. Haskell’s many-accum takes a cons-like operator in addition to p as a more flexible way of creating a list of elements.

A final Note

I intentionally stayed away from Monads in this post (which is no easy task when porting Haskell), averting my eyes from Konrad Hinsen’s clojure.contrib.monad and trying wherever possible to make Clojure functions feel less like Haskell functions obsessed with parentheses. Not because Monads are particularly special or complex, but rather just the opposite. Monads fall out of designs that favor composability and uniformity. The first parser of this post, always, is half of an implementation of Monad. parser-zero and parser-plus are 100% of an smaller class of monads called MonadPlus. Reading clj-http’s source, I felt like it was such clean and idiomatic Clojure, with fantastic composablity properties that made it easy to build on top of, but also like it would be very easy to expess in Haskell and not feel forced or awkward. So it’ll be interesting to finish this port and see if I can succeed in doing the same in the opposite direction.

Tagged with: