Two Guys Arguing

Competing Parsers

Posted in clojure, haskell by youngnh on 09.28.11

So, I’m writing a parser for the Turtle serialization format for RDF. In addition to being a format we use all the time at Revelytix, its a decently compact grammar, giving me a good chance to implement it using The Parsatron and suss out some of the library’s rough edges.

I hit my first rough edge with the longString production:

longString ::= """" lcharacter* """

but, having already implemented the lcharacter parser already, I didn’t see the subtleties in this production and plowed ahead with this straightforward definition:

(defparser long-string []
  (between (times 3 (char \")) (times 3 (char \"))
           (many (lcharacter))))

Which looks great and compiles without complaint, but when you feed it input, it immediately complains:

> (run (long-string) "\"\"\"roughshod\"\"\"")

Unexpected end of input at line: 1 column: 16
[Thrown class java.lang.RuntimeException]

The message here could be better, and I’ll work on that. I would want it to say Unexpected end of input, expected '"""', because what happened was the (many (lcharacter)) parser consumed too much.

Turns out, lcharacter is defined in the grammar to include double quotes, so (many (lcharacter)) ate as many as it could until it literally ran out of input.

A good regex can handle this:

> (re-find #"\"\"\".*\"\"\"" "\"\"\"roughshod\"\"\"")
"\"\"\"roughshod\"\"\""

So we should be able to as well. To keep track of whether or not we’ve consumed a hat trick of quotes, my first attempt looked something like this:

(defparser long-string []
  (letfn [(middle-part [s]
            (let->> [c (lcharacter)]
              (if (= c \")
                (two-left s)
                (middle-part (concat s [c])))))
          (two-left [s]
            (let->> [c (lcharacter)]
              (if (= c \")
                (one-left s)
                (middle-part (concat s [\" c])))))
          (one-left [s]
            (let->> [c (lcharacter)]
              (if (= c \")
                (always s)
                (middle-part (concat s [\" \" c])))))]
    (>> (times 3 (char \"))
        (middle-part []))))

Which uses 3 local, mutually-recursive functions to “count” each consecutive double quote. And they all look a lot alike. I refactored to this:

(defparser long-string []
  (letfn [(middle-part [s n]
            (let->> [c (lcharacter)]
              (if (= c \")
                (case n
                      0 (middle-part s 1)
                      1 (middle-part s 2)
                      2 (always s))
                (middle-part (concat s (repeat n \") [c]) 0))))]
    (>> (times 3 (char \"))
        (middle-part [] 0))))

The above works well, but the problem arises in the first place because lcharacter and """ share the same single-character lookahead. By examining only the next character in the input, we can’t
tell if it should belong to lcharacter or """. This suggests that we can lookahead 3 characters at a time and if we receive """, then we can interpret that not as 3 lcharacters, but as a terminating triple double quote.

(defparser long-string []
  (between (times 3 (char \")) (times 3 (char \"))
           (many
            (let->> [cs (lookahead (times 3 (lcharacter)))]
              (if (= cs [\" \" \"])
                (never)
                (lcharacter))))))

I’m not sure quite which way to go, nor can I immediately see a way to make a higher-level lookahead parser that ensures that 2 parsers don’t stomp all over each other, though that would be quite ideal. If you can, chime in below in the comments.

If you’d like to follow the development of The Parsatron, it’s on github

Tagged with:

One fn to bind them

Posted in clojure, haskell by youngnh on 11.21.10

I had a chance to work on my parsec port a little this weekend. Say hello to one of the most important and ubiquitous parsers in the parsec arsenal, parser-bind.

The idea behind parser-bind is that it should squish two parsers together. It represents parsing one thing after another. The only other parser we’ve built that squishes two parsers together is parser-plus, which operates more like “or” in that if the first one fails, it tries the second. This parser will quit immediately whenever either fails. If this parser succeeds, it’s because each matched successive input.

(defn parser-bind [m n]
  (fn [state cok cerr eok eerr]
    (letfn [(mcok [item state]
              (n state cok cerr cok cerr)))
            (meok [item state]
              (n state cok cerr eok eerr)))]
      (m state mcok cerr meok eerr))))

If the first parser, m, consumes ok, but the second one, n, does not consume, our combined parser will still call the cok continuation. Conversely, if the first one is empty and ok, but the second one consumes, we will also escape via the cok continuation. parser-bind does not override any of the error handling continuations because if something goes wrong, we use them to exit immediately.

The useful part of parser-bind isn’t in the the above implementation. It isn’t how parsec implements the idea. Parsec’s implementation does take the first parser, m, but for it’s second argument, it takes a function that, when executed, returns the second parser.

This is a neat idea because the unlike a parser that has to be fully specified at write-time, a function can bind intermediate, runtime results. Those intermediate results, once bound and named can be used to create further parsers. It allows us to write let-like forms:

(p-let [c (one-of "abc")]
  (char c))

Where each binding form in the parser let has to be a destructuring form and parser pair. The above is a parser that parses a character, and then looks for a duplicate of what it just parsed, similar to capture groups in regular expressions. p-let uses parser-bind under the covers:

(defmacro p-let [[& bindings] & body]
  (let [[bind-form p] (take 2 bindings)]
    (if (= 2 (count bindings))
      `(parser-bind ~p (fn [~bind-form] ~@body))
      `(parser-bind ~p (fn [~bind-form] (p-let ~(drop 2 bindings) ~@body))))))

Given only a single binding pair, we make the parser in it the first argument to parser-bind, and wrap a function with it’s destructing form as args, returning the body. In longer binding forms, we produce a recursive structure that macroexpand will continue to expand one binding form at a time.

Tagged with:

Porting Haskell’s Parsec

Posted in clojure, haskell by youngnh on 11.11.10

Parsec is a parser-combinator library. Parser combinators are built around the idea of making a bunch of very small and focused parsers and combining them using operators that one would more usually see in regular expressions. Ultimately leading to parsers that feel more like function calls in a program that a stiff declaration of a grammar. Parsec is king in Haskell-land. In Clojure, however, there are a number of libraries fully- and not-so-fully written that can be used to write parsing programs. fnparse, Amotoen, clarsec, parser, clj-peg are just a few, feel free to mention your favorites in the comments. I don’t mean to leave any out, but rather point to out that what I’m doing here is not new. I do hope it’s illuminating for some.

Parsec, as I see it, boils down to 2 ideas.
A parser either consumes input or doesn’t. Consumed or Empty.
A parser either succeeds in parsing or it fails. Ok or Err.

These outcomes can be combined into 4 continuation functions that are passed to every parser:

  • cok – Consumed & Ok
  • cerr – Consumed & Err
  • eok – Empty & Ok
  • eerr – Empty & Err

As for errors, Parsec defines two types of them. Those that we can say something about, and those that we can say nothing about. These are errors with messages and unknown errors, respectively. Of the errors that we can say something about, some are the result of not finding input that the parser was expecting, which lead to messages like “expected ‘a’ and found ‘b'”, and some are the result of not finding input where we expected to, which lead to messages like “unexpected end of input”.

Finally, Parsec keeps tabs on the thing it’s parsing, it maintains state. The state is made up of 2 elements, the input stream itself, of which a Clojure seq models nicely and the current source position, itself made up of the name of the input, and one’s current line and column location in it.

The Most Basic Parsers

The simplest parser is the one that no matter what, returns a constant value. This is called parserReturn in Haskell, but in Clojure, it’s more akin to the constantly function, so I’ve named it always, and here’s it’s simplified implementation:

(defn always [x]
  (fn [state cok cerr eok eerr]
    (eok x state)))

This implementation makes sense. No matter what, it returns a new parser. A parser is merely a fn that takes a state and 4 continuations. The always parser always calls the Empty & Ok continuation. Nothing was removed from the stream (hence the Empty part), and everything should continue on as normal (the Ok part).

Equally simple is the parser that always fails. This is called parserZero in Haskell, since it represents a “nothing” parser.

(defn parser-zero []
  (fn [state cok cerr eok eerr]
    (eerr (unknown-error state))))

(defn unknown-error [{:keys [pos] :as state}]
  (ParseError. pos []))

More Interesting Parsers

One of the more basic parsers in Parsec is tokenPrim, which processes a single element from the underlying stream. It unconses the first element from the head of the input, tests if it is supposed to be consumed and then updates the state’s current position in the input. To do this, it takes 3 functions.

nextpos calculates a new source position based on the item consumed and the old position.
test takes a single element from the underlying stream and returns whether or not to consume it
showToken is used to create readable error messages by returning a string representation of stream elements

(defn token-prim [show-f nextpos-f consume?]
  (fn [{:keys [input pos] :as state} cok cerr eok eerr]
    (if-let [s (seq input)]
      (let [item (first s)
            rest-of-input (next s)]
        (if (consume? item)
          (let [newpos (nextpos-f pos item rest-of-input)
                newstate (InputState. rest-of-input newpos)]
            (cok item newstate))
          (eerr (unexpect-error (show-f item) pos))))
      (eerr (unexpect-error "" pos)))))

There are three ways the above function continues. Two are through eerr, one when there is nothing left in the seq when we were expecting to parse something, and one when we did parse something, but our test told us not to consume it. In the second case we can produce a decently readable description of the item so that we can later present it to the user. Finally, if our test tells us to go ahead and consume the item, we call cok passing it the item and a newly calculated state with a new position and the input without our consumed item on the front.

There’s a lot of parsers we can implement on top of token-prim, however, it’s got no brain. You can only line up a number of token parsers one after another and let them tell you if the input matched in the order you thought it would. We can’t express the idea of “or” with it. For that, Parsec relies on the parserPlus parser. It’s called “plus” because it’s used to glue multiple parsers into a single one, analagous to how addition of numbers glues them all together into a new, single number (I never used to think about things like this. Haskell has made me re-understand everything I already knew).

The strategy for implementing parserPlus is that it will take 2 parsers and try the first one. If that succeeds, we’ll go with that. If it doesn’t, we try the second one, and if it succeeds we want our combined parser to be indistinguishable from that second parser. If neither work, then our parser didn’t work and we want to escape like any other parser would if it failed. Calling the first parser is easy. For the sake of staying close to the original Haskell, we’ll call this parser m. Parsers in Haskell and Clojure are simply functions, so in order to try it, we can invoke it and pass the current state and the 4 continuations it expects.

The continuations are our hook to intercept failures. We know that if m fails, it will call the fourth continuation we pass it. So to try the second parser, n second we’re going to wrap the eerr function (the 4th continuation) by trying that second parser before giving up and calling eerr. Here’s how it looks in Clojure:

(defn parser-plus [m n]
  (fn [state cok cerr eok eerr]
    (letfn [(meerr [err]
               (letfn [(neok [item state-prime]
                          (eok item state-prime))
                        (neerr [err-prime]
                          (eerr (merge-error err err-prime)))]
                 (n state cok cerr neok neerr)))]
      (m state cok cerr eok meerr))))

The loacally nested functions aren’t exactly readable at a glance, but combined with the knowledge of what’s happening it’s a really elegant way to express the idea. Also, as a small note, there aren’t great names for some of the nested function parameters. state-prime and err-prime? Well, that’s a holdover from Haskell to express that the thing is an altered version of the thing it came from. In mathematics, this is expressed as a tick, state' and err'. Those aren’t legal Clojure 1.2 identifiers, so I opted to be verbose. Starting with the Clojure 1.3 alphas available now, tick is a legal constituent character, which means you can use it anywhere in an identifier except as the first character.

The last parser I’d like to tackle in this blog post is manyAccum. This parser wraps behavior around an existing parser and so becomes a tangle of continuation functions just like parser-plus was, but unlike parser-plus, manyAccum only accepts one parser and attempts to apply it 0 or more times. This is the Parser equivalent of the Kleene operator.

Just like parser-plus, we’re going to invoke the parser manyAccum is given and create a new parser by manipulating the continuations we pass to it. Specifically, if the parser we’re given fails to consume any input (calls eerr), we’re going to hijack that and report that it was instead an eok with an empty list. If the parser succeeds in consuming input, we’re going to try to get it to do it again. And again. And again forever. Here’s what it looks like:

(defn many-accum [p]
  (fn [state cok cerr cok eerr]
    (letfn [(many-err [err] (throw (RuntimeException. "combinator '*' is applied to a parser that accepts an empty string")))
             (continue [coll item state-prime]
               (p state-prime (partial continue (cons item coll)) cerr many-err (fn [_] (cok (cons item coll) state-prime))))]
      (p state (partial continue (seq [])) cerr many-err (fn [_] (eok [] state))))))

We define many-err to immediately quit with an exception if the third continuation, eok, is called since that means that p accepts empty strings and would spin forever if we let it. The only other trick to many-accum is that we create continue to accumulate items by first calling it with an empty seq, (seq []) and then consing further consumed items onto the front. Haskell’s many-accum takes a cons-like operator in addition to p as a more flexible way of creating a list of elements.

A final Note

I intentionally stayed away from Monads in this post (which is no easy task when porting Haskell), averting my eyes from Konrad Hinsen’s clojure.contrib.monad and trying wherever possible to make Clojure functions feel less like Haskell functions obsessed with parentheses. Not because Monads are particularly special or complex, but rather just the opposite. Monads fall out of designs that favor composability and uniformity. The first parser of this post, always, is half of an implementation of Monad. parser-zero and parser-plus are 100% of an smaller class of monads called MonadPlus. Reading clj-http’s source, I felt like it was such clean and idiomatic Clojure, with fantastic composablity properties that made it easy to build on top of, but also like it would be very easy to expess in Haskell and not feel forced or awkward. So it’ll be interesting to finish this port and see if I can succeed in doing the same in the opposite direction.

Tagged with: